Control As Inference

tags: Optimal Control and Planning, Reinforcement Learning ⭐

Figure 1: PGM for decision making for the first 3 time-steps

We introduce a binary variable for Optimality $O_{t}$ at each time-step. We want to infer: $p (τ | O_{1 : T})$

If we choose $p (O_{t} | s_{t}, a_{t}) = \exp (r (s_{t}, a_{t}))$ , then:

$\begin{aligned} p (τ | O_{1 : T}) & = \frac{p (τ, O_{1 : T})}{p (O_{1 : T})} \\ \propto \prod_{t} \exp (r (s_{t}, a_{t})) \\ = p (τ) \exp (\sum_{t} r (s_{t}, a_{t})) \end{aligned}$

With this Probabilistic Graph Model, we can:

model sub-optimal behaviour (important for inverse RL)
can apply inference algorithms to solve control and planning problems
provides an explanation for why stochastic behaviour may be preferred (useful for exploration and transfer learning)

Inference

compute backward messages $β_{t} (s_{t}, a_{t}) = p (O_{t : T} | s_{t}, a_{t})$
compute policy $p (a_{t} | s_{t}, O_{1 : T})$ , the policy of this model under assumption of optimality
compute forward messages $α_{t} (s_{t}) = p (s_{t} | O_{1 : t - 1})$
1. useful for figuring out which states the optimal policy lands in, for the inverse RL problem (not used for forward RL)

Backward Messages

$\begin{aligned} β_{t} (s_{t}, a_{t}) & = p (O_{t : T} | s_{t}, a_{t}) \\ = \int p (O_{t : T}, s_{t + 1} | s_{t}, a_{t}) d s_{t + 1} \\ = \int p (O_{t + 1 : T} | s_{t + 1}) p (s_{t + 1} | s_{t}, a_{t}) p (O_{t} | s_{t}, a_{t}) d s_{t + 1} \end{aligned}$

$\begin{aligned} p (O_{t + 1 : T} | s_{t + 1}) & = \int p (O_{t + 1 : T} | s_{t + 1}, a_{t + 1}) p (a_{t + 1} | s_{t + 1}) d a_{t + 1} \\ = \int β_{t} (s_{t + 1}, a_{t + 1}) d a_{t + 1} \end{aligned}$

where we assume actions are likely a priori uniform. From these equations, we can get:

For $t = T - 1 to 1$ :

$β_{t} (s_{t}, a_{t}) = p (O_{t} | s_{t}, a_{t}) E_{s_{t + 1} \sim p (s_{t + 1}, a_{t + 1})} [β_{t + 1} (s_{t + 1})]$

$β_{t} (s_{t}) = E_{a_{t} \sim p (a_{t} | s_{t})} [β_{t} (s_{t}, a_{t})]$

If we choose $V_{t} (s_{t}) = \log β_{t} (s_{t})$ and $Q_{t} (s_{t}, a_{t}) = \log β_{t} (s_{t}, a_{t})$ :

$\begin{aligned} V_{t} (s_{t}) & = \log \int \exp (Q_{t} (s_{t}, a_{t})) d a_{t} \\ \to \max_{a_{t}} Q_{t} (s_{t}, a_{t}) as Q_{t} (s_{t}, a_{t}) gets bigger \end{aligned}$

For $Q$ :

$Q_{t} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + \log E [\exp (V_{t + 1} (s_{t + 1}, a_{t + 1}))]$

In a deterministic transition setting, the log and exp cancel out. However, this otherwise results in an optimistic transition, which is not a good idea!

What if the action prior is not uniform? We can always fold the action prior into the reward!

Policy computation

$\begin{aligned} p (a_{t} | s_{t}, O_{1 : T}) & = π (s_{t} | a_{t}) \\ = p (a_{t} | s_{t}, O_{t : T}) \\ = \frac{β_{t} (s_{t}, a_{t})}{β_{t} (s_{t})} p (s_{t} | a_{t}) \\ = \frac{β_{t} (s_{t}, a_{t})}{β_{t} (s_{t})} \end{aligned}$

It turns out the policy is just the ratio between the 2 backward messages. Substituting $V$ and $Q$ :

$π (a_{t} | s_{t}) = \exp (Q_{t} (s_{t}, a_{t}) - V_{t} (s_{t})) = \exp (A_{t} (s_{t}, a_{t}))$

One can also add a temperature: $π (a_{t} | s_{t}) = \exp (\frac{1}{α} A_{t} (s_{t}, a_{t}))$

Forward Messages

$p (s_{t}) \propto β_{t} (s_{t}) α_{t} (s_{t})$

same derivations as Hidden Markov Model!

Resolving Optimism with Variational Inference

For more, see Levine, n.d..

Resources

CS285 Fa19 10/16/19 - YouTube

Jethro's Braindump