Jethro's Braindump

Policy Gradients

tags
Machine Learning Algorithms, Reinforcement Learning ⭐

Key Idea

The objective is:

θ=argmaxθEτpθ(τ)[tr(st,at)]

To evaluate the objective, we need to estimate this expectation, often through sampling by generating multiple samples from the distribution:

J(θ)=Eτpθ(τ)[tr(st,at)]1Nitr(si,t,ai,t)

Recall that:

θJ(θ)1Ni=1Nθlogπθ(τi)t=1Tθlogθπθ(ai,t|si,t)r(τi)

This makes the good stuff more likely, and bad stuff less likely, but scaled by the rewards.

Comparison to Maximum Likelihood

policy gradient
θJ(θ)1Ni=1N(t=1Tθlogπθ(ai,t|si,t))(t=1Tr(si,t,ai,t))
maximum likelihood
θJML(θ)1Ni=1N(t=1Tθlogπθ(ai,t|si,t))

Partial Observability

The policy gradient method does not assume that the system follows the Markovian Assumption! The algorithm only requires the ability to generate samples, and a function approximator for πθ(at|ot).

Issues

  • Policy gradients have high variance: the gradient is noisy, easily affected by a constant change in rewards

Properties of Policy gradients

  1. On-policy

The objective is an expectation under trajectories sampled under that policy. This can be tweaked into an off-policy method using Importance Sampling.

Figure 1: Off-policy policy gradients

Figure 1: Off-policy policy gradients

θJ(θ)=Eτπθ(τ)[πθ(τ)πθ(τ)θlogπθ(τ)r(τ)] when θθ =Eτπθ(τ)[(t=1Tπθ(at|st)πθ(at|st))(t=1Tθlogπθ(at|st))(t=1Tr(st,at))]

Problem: with large T the first term becomes extremely big or small.

Variance Reduction

Causality

The policy at time t cannot affect the reward at time t when t<t.

θJ(θ)1Ni=1N(t=1Tθlogπθ(ai,t|si,t))(t=tTr(si,t,ai,t))

This is still an unbiased estimator, and has lower variance because the gradients are multiplied by smaller values. This is often written as:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,t|si,t)Q^i,t

where Q^i,t is the reward-to-go.

Baseline Reduction

θJ(θ)1Ni=1Nθlogπθ(τ)[r(τ)b]

Subtracting a baseline is unbiased in expectation, but may reduce the variance. We can compute the optimal baseline, by evaluating the variance of the gradient, and setting the derivative of the variance with respect to b to 0:

Figure 2: Computing the optimal baseline

Figure 2: Computing the optimal baseline

This is just expected reward, but weighted by gradient magnitudes.

Policy Gradient in practice

  • Gradients have high variance
  • Consider using much larger batches
  • Tweaking learning rates might be important
  • Adaptive learning rates are fine, there are some policy-gradient oriented learning rate adjustment methods

REINFORCE

  1. For each episode,
    1. generate τ=s0,a0,r1,,st1,at1,rt by following πθ(a|s)
    2. For each step i=0,,t1:
      1. Ri=k=itγtkrk (Unbiased estimate of remaining episode return under πθ starting from i)
      2. Ai^=Rib (Advantage function: subtract base line b to lower variance)
        1. Advantage function tells you how relatively good this action is
      3. $θ = θ+αθlogπθ(a|si)A^i

Objective: J(θ)=τPθ(τ)R(τ)

θJ(θ)=θτPθ(τ)R(τ) =τθPθ(τ)R(τ)

Actor critics use learned estimate (e.g. $\hat{A}(s, a) = \hat{Q}(s,

    • \hat{V}(s).)

Policy Gradients an Policy Iteration

Policy gradients involves estimating A^s,a, and using it to improve the policy, much like policy iteration which evaluates A(s,a) and use it to create a better, deterministic policy.

J(θ)J(θ)=J(θ)Es0p(s1)[Vπθ(s0)] =J(θ)Eτpθ(τ)[Vπθ(s0)] =J(θ)Eτpθ(τ)[t=0γtVπθ(st)t=1γtVπθ(st)] =J(θ)+Eτpθ(τ)[t=0γt(γVπθ(st+1)Vπθ)(st)] =Eτ\pθ(τ)[t=0γtr(st,at)]+Eτpθ(τ)[t=0γt(γVπθ(st+1)Vπθ)(st)] =Eτpθ(τ)[tγtAπθ(st,a,t)]

We have an expectation under θ, but samples under θ. We use marginals representation, and Importance Sampling to remove the expectation under πθ, but can we ignore the other distribution mismatch?

We can bound the distribution change from pθ(st) to pθ(st). (See Trust Region Policy Optimization paper)

We can measure the distribution mismatch with KL divergence.

Then, we can enforce the constraint of a small KL divergence by using a loss function with the Lagrange Multiplier:

L(θ,λ)=tEstpθ(st)[Eatπtheta(at|st)[ptheta(at|st)pθ(at|st)γtAπθ(st,at)]]λ(DKL(πθ(at|st)||πθ(at|st))ϵ)

  1. Maximize L wrt to θ
  2. λλ+α(DKLϵ)

Intuition: raise λ if constraint violated too much, else lower it.

Alternatively, optimize within some region, and use a Taylor expansion to approximate the function within that region.

Natural Gradients

Resources