Temporal Difference Learning

Observe samples $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ . If value estimates are accurate, the following must hold:

$V (s_{t}) = r_{t} + γ V (s_{t + 1})$

If not, there is a TD error:

$γ = r_{t} + γ V (s_{t + 1}) - V (s_{t})$

To learn better estimates - minimize $γ $ TD(0):

$V (s) \leftarrow V (s) + α (r_{t} + γ V (s_{t + 1}) - V (s_{t}))$