In Reinforcement Learning ⭐, exploration is important where rewards are sparse, and not a direct indication of how good an action is. Some environments where good exploration is necessary is Montezuma’s revenge, where finishing a game only weakly correlates with rewarding events.

Key Questions:

- How can an agent discover high-reward strategies that require a temporally-extended sequence of complex behaviours that, individually, are not rewarding?
- How can an agent decide whether to attempt new behaviours or continue to do the best thing it knows so far?
- Is there an
*optimal*exploration strategy?

In order of theoretical tractability (tractable to intractable):

- multi-armed bandits (1-step stateless RL problems) : can be formailized as POMDP identification
- contextual bandits (1-step RL problems) : policy learning is trivial even with POMDP
- small, finite MDPs (tractable planning, model-based RL setting) : can frame as Bayesian model identification, reason explicitly about value of information
- large, infinite MDPs, continuous spaces : optimal methods don’t work

## General Themes in Exploration

- Requires some form of uncertainty
- Assumes:
- Unknown is good (optimism)
- Sample = Truth
- information gain is good

## Exploration in Bandits

### Optimistic Exploration

Keep track of average reward \(\hat{\mu}_a\) for each action \(a\), and choose $a = \mathrm{argmax} \hat{\mu}_a + Cσ_a$for some variance estimate \(\sigma_a\). This method is model-free.

### Posterior/Thompson Sampling

Here, we assume $r(a_i) ∼ p_{θ_i}(r_i), defining a POMDP with
\(s = \left[\theta_1, \dots, \theta_n \right]\), and we have a belief
over the states.

Thompson sampling does this:

- sample \(\theta_1, \dots, \theta_n \sim \hat{p}(\theta_1, \dots, \theta_n)\)
- pretend the model \(\theta_1, \dots, \theta_n\) is correct
- take the optimal action
- update the model

Thompson sampling is hard to analyze theoretically, but can work well empirically.

### Information Gain

\begin{equation} IG(z, y|a) = E_y\left[ \mathcal{H}(\hat{p}(z)) - \mathcal{H}(\hat{p}(z)|y)|a \right] \end{equation}

is how much we learn about \(z\) from action \(a\), given current beliefs

If we have \(\Delta(a) = E[r(a^\star) - r(a)]\), the expected suboptimality of \(a\), and \(g(a) = IG(\theta_a, r_a | a)\), then we can choose \(a\) according to \(\mathrm{argmin}_a \frac{\Delta(a)^2}{g(a)}\).

### Upper Confidence Bound

\begin{equation} a = \mathrm{argmax} \hat{\mu}_a + \sqrt{\frac{2 \ln T}{N(a)}} \end{equation}

## Extending Exploration to RL

### Count-based exploration (Bellemare et al., 2016)

Use pseudo-counts:

\begin{equation} r_i^+ = r_i + \mathcal{B}(\hat{N}(s)) \end{equation}

There are many choices for the bonus.

# Bibliography

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R., *Unifying count-based exploration and intrinsic motivation*, In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 1471–1479) (2016). : Curran Associates, Inc. ↩