Inverse Reinforcement Learning

tags: Reinforcement Learning ⭐

Standard Imitation Learning copies actions performed by the expert, and do not reason about the outcomes of the actions. However, humans copy the intent of the actions, which result in vastly different actions.

We want to learn the reward function because the reward function is often hard to specify.

Inverse Reinforcement Learning is about learning reward functions. This problem is, however, ill-specified: there are infinitely many reward functions that can explain the same behaviour. Formally:

Inverse RL is the setting where we are given:

states $s \in S$ , actions $a \in A$
(sometimes) transitions $p (s^{'} | s, a)$
samples ${τ_{i}}$ sampled from $π^{⋆} (τ)$

and we would like to learn $r_{ψ} (s, a)$ , where $ψ$ are the parameters of the reward functions. A common choice is a linear reward function:

$r_{ψ} (s, a) = \sum_{i} ψ_{i} f_{i} (s, a) = ψ^{T} f (s, a)$

Feature Matching IRL

Idea: if features $f$ are important, what if we match their expectations? Let $π^{r_{ψ}}$ be the optimal policy for $r_{ψ}$ , then we pick $ψ$ such that $E_{π} r_{ψ} [f (s, a)] = E_{π^{⋆}} [f (s, a)]$

We can approximate the RHS using expert samples, and the LHS is the state-action marginal under $π^{r_{ψ}}$ . This is still ambiguous, and a solution inspired from SVMs is to use the maximum margin principle:

$\min_{ψ} \frac{1}{2} | ψ |^{2} such that ψ^{T} E_{π^{⋆}} [f (s, a)] \geq \max_{ψ \in Π} ψ^{T} E_{π} [f (s, a)] + D (π, π^{⋆})$

where $D$ could be the difference in expectations.

Issues with the maximum margin principle:

Maximizing margin is an arbitrary choice
No clear model of sub-optimality

Maximum likelihood learning

The IRL partition function is:

$\max_{ψ} \frac{1}{N} \sum_{i = 1}^{N} r_{ψ} (τ_{i}) - \log Z$

where $Z$ is the integral over all trajectories: $Z = \int p (τ) \exp (r_{ψ} (τ)) d τ$

\begin{equation} \nabla_\psi L = \frac{1}{N}\sum_{i=1}^{N}\nabla_\psi r_\psi(\tau_i)

\frac{1}{Z} \int p(\tau) \mathrm{exp}(r_\psi(\tau))\nabla_\psi r_\psi(\tau) d\tau \end{equation}

$\nabla_{ψ} L = E_{τ \sim π^{⋆} (τ)} [\nabla_{ψ} r_{ψ} (τ_{i})] - E_{τ \sim p (τ | O_{1 : T}, ψ)} [\nabla_{ψ} r_{ψ} (τ)]$

first term is estimated with expert samples, and the second with the soft optimal policy under current reward.

MaxEntropy Inverse RL (Ziebart et al., n.d.)

Given $ψ$ , compute backward message $β (s_{t}, a_{t})$
Given $ψ$ , compute forward message $α (s_{t})$
Compute $μ_{t} (s_{t}, a_{t}) \propto β (s_{t}, a_{t}) α (s_{t})$
Evaluate:

$\nabla_{ψ} L = \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 1}^{T} \nabla_{ψ} r_{ψ} (s_{i, t}, a_{i, t}) - \sum_{t = 1}^{T} \int \int μ_{t} (s_{t}, a_{t}) \nabla_{ψ} r_{ψ} (s_{t}, a_{t}) d s_{t} d a_{t}$

$ψ \leftarrow ψ + η \nabla_{ψ} L$

In the case where the reward function is linear, we can show that it optimizes to maximize entropy in the policy under the constraint that the expectations of the rewards for the policy and the expert are equal.

MaxEnt IRL requires:

Solving for soft optimal policy in the inner loop
Enumerating all state-action tuples for visitation frequency and gradient

Sample-based Updates

This handles unknown dynamics, or large/continuous state-action spaces. This works under the assumption that we can sample from the environment.

$\nabla_{ψ} L \approx \frac{1}{N} \sum_{i = 1}^{N} \nabla_{ψ} r_{ψ} (τ_{i}) - \frac{1}{M} \sum_{j = 1}^{M} \nabla_{ψ} r_{ψ} (τ_{j})$

We learn $p (a_{t} | s_{t}, O_{1 : T}, ψ)$ using any max-ent RL algorithm like soft Q-learning, then run this policy to sample $τ_{j}$ . But this is expensive, so make a small improvement to $p (a_{t} | s_{t}, O_{1 : T}, ψ)$ instead, and use importance sampling to account for the distribution mismatch. Each policy update w.r.t $r_{ψ}$ brings us closer to the target distribution.

Resources

CS285 Fa19 10/21/19 - YouTube
(Ratliff, Nathan D and Bagnell, J Andrew and Zinkevich, Martin A, , Abbeel, Pieter and Ng, Andrew Y, )

Bibliography

Ziebart, Brian, Andrew Maas, J. Bagnell, and Anind Dey. n.d. “Maximum Entropy Inverse Reinforcement Learning.,” 1433–38.