Jethro's Braindump

Temporal Difference Learning

Observe samples (st,at,rt,st+1). If value estimates are accurate, the following must hold:

V(st)=rt+γV(st+1)

If not, there is a TD error:

γ=rt+γV(st+1)V(st)

To learn better estimates - minimize $γ $ TD(0):

V(s)V(s)+α(rt+γV(st+1)V(st))