Jethro's Braindump

Deep Reinforcement Learning

Reinforcement Learning ⭐

Why Deep Reinforcement Learning?

  1. Reinforcement learning provides a mathematical framework for decision-making
  2. Deep learning has shown to be extremely successful in unstructured environments (e.g. image, text)
  3. Deep RL allows for end-to-end training of policies
    1. Features are tedious and difficult to hand-design, and are not so transferable across tasks
    2. Features are informed by the task

Anatomy of Deep RL algorithms

\begin{equation} \theta^{\star}=\arg \max _{\theta} E_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \end{equation}

policy gradients
directly differentiate above objective
estimate value/q-function of the optimal policy (no explicit policy)
estimate value/q-function of the current policy, use it to improve policy
model-based RL
estimate the transition model, and then…
  • use it for planning (no explicit policy)
    • Trajectory optimization/optimal control (continuous spaces)
    • Discrete planning in discrete action spaces (Monte Carlo Tree Search)
  • use it to improve a policy (e.g. via backpropagation, with some tricks)
  • use the model to learn a value function (e.g. through dynamic programming)

Why so many RL algorithms?

  • Different tradeoffs
    • sample efficiency
      • is it off policy: can improve policy without generating new samples from that policy?
      • however, are samples cheap to obtain?
Figure 1: Sample efficiency comparison

Figure 1: Sample efficiency comparison

  • stability and ease of use (does it converge, and if so to what?)
    • Q-learning: fixed point iteration
    • Model-based RL: model is not optimized for expected reward
  • Different assumptions
    • fully observable?
      • generally assumed by value function fitting methods (mitigated by adding recurrence)
      • episodic learning
        • generally assumed by pure policy gradient methods
        • assumed by some model-based RL methods
      • continuity or smoothness?
        • assumed by some continuous value function learning methods
        • often assumed by some model-based RL methods
    • stochastic or deterministic?
  • Different things are easy or hard in different settings
    • easier to represent the policy?
    • easier to represent the model?

Challenges in Deep RL

Stability and Hyperparameter Tuning

  • Devising stable RL algorithms is hard
  • Can’t run hyperparameter sweeps in the real world

Would like algorithms with favourable improvement and convergence properties:

  • Trust region policy optimization Schulman et al., n.d.n

or algorithms that adaptively adjust parameters:

  • Q-Prop Gu et al., n.d.

Problem Formulation

  • Multi-task reinforcement learning and generalization
  • Unsupervised or self-supervised learning


  1. CS285 Fall 2019 - YouTube
  2. Welcome to Spinning Up in Deep RL! — Spinning Up documentation (Tensorflow, Pytorch)
  3. David Silver’s Deep RL ICML Tutorial


Links to this note