Inverse Reinforcement Learning
Standard Imitation Learning copies actions performed by the expert, and do not reason about the outcomes of the actions. However, humans copy the intent of the actions, which result in vastly different actions.
We want to learn the reward function because the reward function is often hard to specify.
Inverse Reinforcement Learning is about learning reward functions. This problem is, however, ill-specified: there are infinitely many reward functions that can explain the same behaviour. Formally:
Inverse RL is the setting where we are given:
- states
, actions - (sometimes) transitions
- samples
sampled from
and we would like to learn
Feature Matching IRL
Idea: if features
We can approximate the RHS using expert samples, and the LHS is the
state-action marginal under
where
Issues with the maximum margin principle:
- Maximizing margin is an arbitrary choice
- No clear model of sub-optimality
Maximum likelihood learning
The IRL partition function is:
where
\begin{equation} \nabla_\psi L = \frac{1}{N}\sum_{i=1}^{N}\nabla_\psi r_\psi(\tau_i)
- \frac{1}{Z} \int p(\tau) \mathrm{exp}(r_\psi(\tau))\nabla_\psi r_\psi(\tau) d\tau \end{equation}
first term is estimated with expert samples, and the second with the soft optimal policy under current reward.
MaxEntropy Inverse RL (Ziebart et al., n.d.)
- Given
, compute backward message - Given
, compute forward message - Compute
- Evaluate:
In the case where the reward function is linear, we can show that it optimizes to maximize entropy in the policy under the constraint that the expectations of the rewards for the policy and the expert are equal.
MaxEnt IRL requires:
- Solving for soft optimal policy in the inner loop
- Enumerating all state-action tuples for visitation frequency and gradient
Sample-based Updates
This handles unknown dynamics, or large/continuous state-action spaces. This works under the assumption that we can sample from the environment.
We learn
Resources
- CS285 Fa19 10/21/19 - YouTube
- (Ratliff, Nathan D and Bagnell, J Andrew and Zinkevich, Martin A, , Abbeel, Pieter and Ng, Andrew Y, )
Bibliography
Ziebart, Brian, Andrew Maas, J. Bagnell, and Anind Dey. n.d. “Maximum Entropy Inverse Reinforcement Learning.,” 1433–38.