Jethro's Braindump

Imitation Learning

Behavioural Cloning

Behavioural cloning is a fancy name for supervised learning. We collect tuples of actions and observations from demonstrations, and used supervised learning to learn a policy πθ(at|ot).

The problem with behavioural cloning is that the errors accumulate, and the state trajectory will change dramatically. When we evaluate the algorithm, can we make pdata(ot)=pπθ(ot)?

DAgger: Dataset Aggregation

  • Goal: collect training data from pθπ(ot) instead of pdata(ot)
  • how? run πθ(at|ot), but need labels at!
  1. train πθ(at|ot) from human data D
  2. run πθ(at|ot) to get dataset Dπ
  3. Ask human to label Dπ with actions at
  4. Aggregate: DDDπ
  • Problem: have to ask humans to label large datasets iteratively, and can be unnatural (resulting in bad labels)
  • Behavioural cloning may still work when we model the expert very accurately (no distributional “drift”)

Why might we fail to fit the expert?

  1. non-Markovian behaviour

    1. Our policy assumes that the action depends only on the current observation.
    2. Perhaps a better model is to account for all observations.
    3. Problem: history exacerbates causal confusion (Haan, Jayaraman, and Levine, n.d.)
  2. Multimodal behaviour

    1. Solutions:
      1. output mixture of Gaussians (easy to implement, works well in practice)
      2. Latent Variable models (additional latent variable as part of input)
      3. Autoregressive discretization
Figure 1: Autoregressive Discretization discretizes one dimension of the action space at a time

Figure 1: Autoregressive Discretization discretizes one dimension of the action space at a time

What’s the problem with imitation learning?

  • Humans need to provide data, which is typically finite. Deep models typically require large amounts of data.
  • Human are not good at providing some kinds of actions
  • Humans can learn autonomously (from experience)

Imitation Learning in the RL context

Reward function:

r(s,a)=logp(a=π(s)|s)

Cost function:

c(s,a)={0 if a=π(s) 1 otherwise 

The number of mistakes go up quadratically in the worst case:

Assuming: πθ(aπ(s)|s)ϵ

Figure 2: The tightrope walking problem

Figure 2: The tightrope walking problem

Bibliography

Haan, Pim de, Dinesh Jayaraman, and Sergey Levine. n.d. “Causal Confusion in Imitation Learning.” http://arxiv.org/abs/1905.11979v2.