Temporal Difference Learning

Observe samples $$\left(s_t, a_t, r_t, s_{t+1} \right)$$. If value estimates are accurate, the following must hold:

$$V(s_t) = r_t + \gamma V(s_{t+1})$$

If not, there is a TD error:

$$\gamma = r_t + \gamma V(s_{t+1}) - V(s_t)$$

To learn better estimates - minimize $γ$ TD(0):

$$V(s) \leftarrow V(s) + \alpha \left( r_t + \gamma V(s_{t+1}) - V(s_t) \right)$$