Deep Reinforcement Learning

Why Deep Reinforcement Learning?

Reinforcement learning provides a mathematical framework for decision-making
Deep learning has shown to be extremely successful in unstructured environments (e.g. image, text)
Deep RL allows for end-to-end training of policies
1. Features are tedious and difficult to hand-design, and are not so transferable across tasks
2. Features are informed by the task

$θ^{⋆} = \arg max_{θ} E_{τ \sim p_{θ} (τ)} [\sum_{t} r (s_{t}, a_{t})]$

policy gradients

directly differentiate above objective

value-based

estimate value/q-function of the optimal policy (no explicit policy)

actor-critic

estimate value/q-function of the current policy, use it to improve policy

model-based RL

estimate the transition model, and then…

use it for planning (no explicit policy)
- Trajectory optimization/optimal control (continuous spaces)
- Discrete planning in discrete action spaces (Monte Carlo Tree Search)
use it to improve a policy (e.g. via backpropagation, with some tricks)
use the model to learn a value function (e.g. through dynamic programming)

Different tradeoffs
- sample efficiency
  - is it off policy: can improve policy without generating new samples from that policy?
  - however, are samples cheap to obtain?

stability and ease of use (does it converge, and if so to what?)
- Q-learning: fixed point iteration
- Model-based RL: model is not optimized for expected reward
Different assumptions
- fully observable?
  - generally assumed by value function fitting methods (mitigated by adding recurrence)
  - episodic learning
    - generally assumed by pure policy gradient methods
    - assumed by some model-based RL methods
  - continuity or smoothness?
    - assumed by some continuous value function learning methods
    - often assumed by some model-based RL methods
- stochastic or deterministic?
Different things are easy or hard in different settings
- easier to represent the policy?
- easier to represent the model?

Would like algorithms with favourable improvement and convergence properties:

or algorithms that adaptively adjust parameters:

<biblio.bib>