Why Deep RL?

  1. Reinforcement learning provides a mathematical framework for decision-making
  2. Deep learning has shown to be extremely successful in unstructured environments (e.g. image, text)
  3. Deep RL allows for end-to-end training of policies
    1. Features are tedious and difficult to hand-design, and are not so transferable across tasks

Good Resources

  1. CS285 Fall 2019 - YouTube
  2. Welcome to Spinning Up in Deep RL! — Spinning Up documentation (Tensorflow, Pytorch)


The Simplest Policy Gradient (Spinning Up)

We can derive a simple policy gradient algorithm by using the likelihood ratio trick:

\begin{equation} \nabla_{\theta} P(\tau | \theta)=P(\tau | \theta) \nabla_{\theta} \log P(\tau | \theta) \end{equation}

The link provides a succinct derivation of the PG algorithm.

Vanilla Policy Gradient (Spinning Up)

Spinning Up’s implementation of VPG uses several tricks:

  1. Generalized Advantage Estimation (GAE)
  2. Actor-critic

Generalized Advantage Estimator (GAE) (Schulman et al., 2015)

The variance of a gradient estimator scales unfavourably with the time horizon, since the effect of an action is confounded with the effects of past and future actions.

The generalized advantage estimator (GAE) is a family of policy gradient estimators that reduce variance of the policy gradient estimators while maintaining a tolerable level of bias.

Policy gradient methods maximize the expected total reward by repeatedly estimating the gradient \(g\):

\begin{equation} g=\mathbb{E}\left[\sum_{t=0}^{\infty} \Psi_{t} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\right] \end{equation}

Where \( \Psi_{t} \) may be one of the following:

\begin{equation} \begin{array}{ll}{\text { 1. } \sum_{t=0}^{\infty} r_{t} : \text { total reward of the trajectory. }} & {\text { 4. } Q^{\pi}\left(s_{t}, a_{t}\right) : \text { state-action value function. }} \ {\text { 2. } \sum_{t^{\prime}=t}^{\infty} r_{t} : \text { reward following action } a_{t} .} & {\text { 5. } A^{\pi}\left(s_{t}, a_{t}\right) : \text { advantage function. }} \ {\text { 3. } \sum_{t^{\prime}=t}^{\infty} r_{t^{\prime}}-b\left(s_{t}\right) : \text { baselined version of }} & {} \ {\text { previous formula. }} & {\text { 6. } r_{t}+V^{\pi}\left(s_{t+1}\right)-V^{\pi}\left(s_{t}\right) : \text { TD residual. }}\end{array} \end{equation}

The choice of \( \Psi_{t} = A^{\pi}(s_t, a_t)\) yields almost the lowest variance, but in practice, the advantage function is not known and must be approximated.

The parameter \(\gamma\) allows us to reduce variance by downweighting rewards corresponding to delayed effects, at the cost of introducing bias. This parameter corresponds to the discount factor used in discounted formulations of MDPs, but is used as variance reduction in an undiscounted problem.

\begin{equation} V^{\pi, \gamma}\left(s_{t}\right) :=\underset{a_{s_{t+1} : \infty}}{\mathbb{E}_{s_{t+1 ; \infty}},}\left[\sum_{l=0}^{\infty} \gamma^{l} r_{t+l}\right] \quad Q^{\pi, \gamma}\left(s_{t}, a_{t}\right) :=\underset{a_{s_{t+1} : \infty}}{\mathbb{E}_{s_{t+1 ; \infty}},}\left[\sum_{l=0}^{\infty} \gamma^{l} r_{t+l}\right] \end{equation}

\begin{equation} A^{\pi, \gamma}\left(s_{t}, a_{t}\right) :=Q^{\pi, \gamma}\left(s_{t}, a_{t}\right)-V^{\pi, \gamma}\left(s_{t}\right) \end{equation}

Actor Critic

Batch actor-critic algorithm:

  1. sample \(\left\{ s_i, a_i \right\}\) from \(\pi_\theta (a|s)\) (run it on the robot)
  2. fit \(\hat{V}_\phi^\pi (s)\) to sample reward sums
  3. evaluate \(\hat{A}^\pi (s_i, a_i) = r(s_i, a_i) + \hat{V}_\phi^\pi(s_i’) - \hat{V}_\phi^\pi (s_i)\)
  4. \(\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta(a_i|s_i) \hat{A}^\pi (s_i|a_i)\)
  5. \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\)

(Sutton et al., 2000), (Mnih et al., 2016), (Gu et al., 2016)

Deep RL with Q-functions

Instabilities in Q-Learning

  1. Correlations are present in the sequence of observations
  2. Small updates to \(Q\) may significantly change the policy and therefore change the data distribution
  3. Correlations between the action-values \(Q\) and the target values \(r + \gamma \mathrm{max}_{a’}Q(s’, a’)\).

Full fitted Q-iteration algorithm:

  1. collect dataset \(\left\{ (s_i, a_i, s_i’, r_i)\right\}\) using some policy
  2. set \(y_i \leftarrow r(s_i, a_i) + \gamma \mathrm{max}_{a_i’} Q_\phi(s_i’, a_i’)\)
  3. set \(\phi \leftarrow \mathrm{argmin}_\phi \frac{1}{2} \sum_i \lVert Q_\phi (s_i, a_i) - y_i \rVert ^2\)

Online Q-iteration algorithm:

  1. take some action \(a_i\) and observe \((s_i, a_i, s_i’, r_i)\)
  2. \(y_i = r(s_i, a_i) + \gamma \mathrm{max}_{a’}Q_\phi(s_i’, a_i’)\)
  3. \(\phi \leftarrow \phi - \alpha \frac{dQ_\phi}{d\phi} (s_i, a_i) (Q_\phi (s_i, a_i) - y_i)\)

Q-learning is not gradient descent, and does not converge in general, because there are no gradients through target value.

Deep Q-network (DQN)

DQN (Mnih et al., 2015) aims to improve the stability of Q-learning by introducing 2 mechanisms: experience replay, and a periodically updated target.

Experience Replay

This idea is first proposed in (Mnih et al., 2015).

All episodic steps \(e_t = (S_t, A_t, R_t, S_{t+1})\) are stored in a replay buffer \(D_t = \left\{e_1, \dots, e_t\right\}\). \(D_t\) has experience tuples over many episodes. During Q-learning updates, samples are drawn at random from the replay buffer. This allows for multiple reuse of each episode, improving data efficiency, and smooths changes in the data distribution.

Is uniform sampling from the replay buffer the best approach? It is expected that some episodes may provide higher expected learning progress, and prioritizing these episodes should lead to faster learning. Prioritized experience replay (Schaul et al., 2015) uses TD-error as a measure of expected learning progress, correcting for the introduced bias by using importance-sampling weights.

One ability humans have is to learn almost as much from achieving an undesirable outcome as from the desired one. This property is missing from many model-free RL algorithms. Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) allows the algorithm to perform this kind of reasoning, and can be combined with any off-policy RL algorithm. This is applicable whenever multiple goals may be achieved. This makes learning more sample efficient, and possible when rewards are sparse.

Periodically Updated Target

The Q-function is optimized towards target values that are only periodically updated. The Q-network is cloned, and kept frozen as the optimization target every \(K\) steps, \(K\) being a tunable hyperparameter.

The modified loss function looks like this:

\begin{equation} L(\theta) = \mathcal{E}_{(s,a,r,s’) \sim U(D)}\left[ \left( r + \gamma \mathrm{max}_{a’}Q(s’,a’;\theta^-)-Q(s,a;\theta) \right)^2 \right] \end{equation}

where \(U(D)\) is a uniform distribution over the replay buffer, and \(\theta^-\) is the parameters of the frozen target Q-network.


Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P., High-Dimensional Continuous Control Using Generalized Advantage Estimation, CoRR, (), (2015).

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y., Policy gradient methods for reinforcement learning with function approximation, In , Advances in neural information processing systems (pp. 1057–1063) (2000). : .

Mnih, V., Badia, Adri`a Puigdom`enech, Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., …, Asynchronous methods for deep reinforcement learning, CoRR, (), (2016).

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., & Levine, S., Q-prop: sample-efficient policy gradient with an off-policy critic, CoRR, (), (2016).

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., …, Human-level control through deep reinforcement learning, Nature, 518(7540), 529–533 (2015). http://dx.doi.org/10.1038/nature14236

Schaul, T., Quan, J., Antonoglou, I., & Silver, D., Prioritized Experience Replay, CoRR, (), (2015).

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., …, Hindsight Experience Replay, CoRR, (), (2017).