Jethro's Braindump

Deep Reinforcement Learning

Why Deep Reinforcement Learning?

  1. Reinforcement learning provides a mathematical framework for decision-making
  2. Deep learning has shown to be extremely successful in unstructured environments (e.g. image, text)
  3. Deep RL allows for end-to-end training of policies
    1. Features are tedious and difficult to hand-design, and are not so transferable across tasks
    2. Features are informed by the task

Anatomy of Deep RL algorithms

\begin{equation} \theta^{\star}=\arg \max _{\theta} E_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \end{equation}

policy gradients
directly differentiate above objective
value-based
estimate value/q-function of the optimal policy (no explicit policy)
actor-critic
estimate value/q-function of the current policy, use it to improve policy
model-based RL
estimate the transition model, and then…
  • use it for planning (no explicit policy)
    • Trajectory optimization/optimal control (continuous spaces)
    • Discrete planning in discrete action spaces (§mcts)
  • use it to improve a policy (e.g. via backpropagation, with some tricks)
  • use the model to learn a value function (e.g. through dynamic programming)

Why so many RL algorithms?

  • Different tradeoffs
    • sample efficiency
      • is it off policy: can improve policy without generating new samples from that policy?
      • however, are samples cheap to obtain?
Figure 1: Sample efficiency comparison

Figure 1: Sample efficiency comparison

  • stability and ease of use (does it converge, and if so to what?)
    • Q-learning: fixed point iteration
    • Model-based RL: model is not optimized for expected reward
  • Different assumptions
    • fully observable?
      • generally assumed by value function fitting methods (mitigated by adding recurrence)
      • episodic learning
        • generally assumed by pure policy gradient methods
        • assumed by some model-based RL methods
      • continuity or smoothness?
        • assumed by some continuous value function learning methods
        • often assumed by some model-based RL methods
    • stochastic or deterministic?
  • Different things are easy or hard in different settings
    • easier to represent the policy?
    • easier to represent the model?

Challenges in Deep RL

Stability and Hyperparameter Tuning

  • Devising stable RL algorithms is hard
  • Can’t run hyperparameter sweeps in the real world

Would like algorithms with favourable improvement and convergence properties:

or algorithms that adaptively adjust parameters:

Problem Formulation

  • Multi-task reinforcement learning and generalization
  • Unsupervised or self-supervised learning

Resources

  1. CS285 Fall 2019 - YouTube
  2. Welcome to Spinning Up in Deep RL! — Spinning Up documentation (Tensorflow, Pytorch)
  3. David Silver’s Deep RL ICML Tutorial

Subtopics

Bibliography

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P., High-Dimensional Continuous Control Using Generalized Advantage Estimation, CoRR, (), (2015).

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., & Levine, S., Q-prop: sample-efficient policy gradient with an off-policy critic, CoRR, (), (2016).

Icon by Laymik from The Noun Project. Website built with ♥ with Org-mode, Hugo, and Netlify.