Jethro's Braindump

Actor-Critic

Actor-Critic improves on Policy Gradients methods by introducing a critic.

Recall the objective:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,t|si,t)Q^i,t

The question we want to address is: can we get a better estimate of the reward-to-go?

Originally, we were using the single-trajectory estimate of the reward-to-go. If we knew the true expected reward-to-go, then we would have a lower variance version of the policy gradient.

We define the advantage function as Aπ(st,at)=Qπ(st,at)Vπ(st). Vπ(st) can be used as baseline b, and we obtain the objective:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,t|si,t)Aπ(si,t,ai,t)

Value Function Fitting

Recall:

Qπ(st,at)=t=tTEπθ[r(st,at)|st,at] Vπ(st)=Eatπθ(at|st)[Qπ(st,at)] Aπ(st,at)=Qπ(st,at)Vπ(st) θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,t|si,t)Aπ(si,t,ai,t)

We can choose to fit Qπ, Vπ or Aπ, each have their pros and cons.

We can write:

Qπ(st,at)r(st,at)+Vπ(st+1)

Aπ(st,at)r(st,at)+Vπ(st+1)Vπ(st)

Classic actor-critic algorithms fit Vπ, and pay the cost of 1 time-step to get Qπ.

We do Monte Carlo evaluation with function approximation, estimating Vπ(st) as:

Vπ(st)t=tTr(st,at)

Our training data consists of {(si,t,t=tTr(si,t,ai,t))}, and we can just fit a neural network with regression.

Alternatively, we can decompose the ideal target, and use the old Vπ:

yi,t=t=tTEπθ[r(st,at)|si,t]r(si,t,ai,t)+V^ϕπ(si,t+1)

This is a biased estimate, but might have much lower variance. This works when the policy does not change much and the previous value function is a decent estimate. Since it is using a the previous value function, it is also called a bootstrapped estimate.

Discount Factors

The problem with the bootstrapped estimate is that with long horizon problems, V^ϕπ can get infinitely large. A simple trick is to use a discount factor:

yi,tr(si,t,ai,t)+γV^ϕπ(si,t+1)

where γ[0,1].

We can think of γ as changing the MDP, introducing a death state with reward 0, and the probability of transitioning to this death state is 1γ. This causes the agent to prefer better rewards now than later.

Figure 1: \(\gamma\) modified MDP

Figure 1: γ modified MDP

We can then modify A^π:

A^π(st,at)r(st,at)+γV^π(st+1)V^π(st)

γ can be interpreted as a way to limit variance, and prevent the infinite sum (think about what happens when γ gets bigger).

Algorithm

  1. sample {si,ai} from πθ(a\s)
  2. Fit V^ϕπ(s) to the sampled reward sums
  3. Evaluate A^π(si,ai)=r(siai)+V^ϕπ(si)V^ϕπ(si)
  4. θJ(θ)iθlogπθ(ai|si)A^π(si,ai)
  5. θθ+αθJ(θ)

Online Actor-critic

online actor-critic uses a single sample batch, which is a bad idea in large neural networks. We need to use multiple samples to perform updates.

The purpose of multiple workers here is not to make the algorithm faster, but to make it work by increasing the batch size.

Generalized Advantage Estimation

A^nπ(st,at)=t=tt+nγttr(st,at)V^ϕπ(st)+γnV^ϕπ(st+n)

A^GAEπ(st,at)=n=1wnA^nπ(st,at)

is some weighted combination of n-step returns. If we choose wnλn1, we can show that:

A^GAEπ(st,at)=n=1(γλ)ttδt

where

δt=r(st,at)+γV^ϕπ(st+1)V^ϕπ(st)

the role of γ and the role of λ turns out to b similar, trading off bias and variance!

Need to balance between learning speed, stability.

  • Conservative Policy Iteration (CPI)
    • propose surrogate objective, guarantee monotonic improvement under specific state distribution
  • Trust Region Policy Optimization (TRPO)
    • approximates CPI with trust region constraint
  • Proximal Policy Optimization (PPO)
    • replaces TRPO constraint with RL penalty + clipping (computationally efficient)
  • Soft Actor-Critic (SAC)
    • stabilize learning by jointly maximizing expected reward and policy entropy (based on maximum entropy RL)
  • Optimistic Actor Critic (OAC)
    • Focus on exploration in deep Actor critic approaches.
    • Key insight: existing approaches tend to explore conservatively
    • Key result: Optimistic exploration leads to efficient, stable learning in modern Actor Critic methods

Resources

Links to this note