Markov Decision Process

A sequential decision problem for a fully observable, stochastic environment with a Markovian transition model and additive rewards is called a Markov decision process. It consists of a set of states, with initial state $s_{0}$ , a set $A C T I O N S (s)$ of actions in each state, a transition model $P (s^{'} | s, a)$ , and a reward function $R (s)$ .

A policy, denoted $π$ , specifies what the agent should do in any state $s$ . This action is denoted by $π (s)$ . The optimal policy $π^{*}$ yields the highest expected utility.

The careful balancing of risk and reward is a characteristic of MDPs that does not arise in deterministic search problems.

Optimality in Markov Decision Processes

Finite Horizon

$E (\sum_{t = 0}^{h} r_{t})$

Infinite Horizon

$E (\sum_{t = 0}^{\infty} γ^{t} r_{t})$

Average-reward

$lim_{h \to \infty} E (\sum_{t = 0}^{h} \frac{1}{h} r_{t})$

Learning Performance Kaelbling, Littman, and Moore, n.d.

Asymptotic convergence:

$π_{n} \to π^{*} as n \to \infty$

PAC:

$P (N_{e r r o r s} > F (\cdot, ϵ, δ)) \leq δ$

Does not give any guarantee about the policy while it is learning

Regret (e.g. bound $B$ on total regret):

$\max \sum_{t = 0}^{T} r_{t j} - r_{t} < B$

No notion of “many small mistakes”, or “few major mistakes”.

Uniform-PAC

unifies notion of PAC and regret Dann, Lattimore, and Brunskill, n.d.