# Markov Decision Process

A sequential decision problem for a fully observable, stochastic environment with a Markovian transition model and additive rewards is called a Markov decision process. It consists of a set of states, with initial state $$s_0$$, a set $$ACTIONS(s)$$ of actions in each state, a transition model $$P(s'|s, a)$$, and a reward function $$R(s)$$.

A policy, denoted $$\pi$$, specifies what the agent should do in any state $$s$$. This action is denoted by $$\pi(s)$$. The optimal policy $$\pi^*$$ yields the highest expected utility.

The careful balancing of risk and reward is a characteristic of MDPs that does not arise in deterministic search problems.

## Optimality in Markov Decision Processes

### Finite Horizon

$$E\left( \sum_{t=0}^{h} r_t \right)$$

### Infinite Horizon

$$E\left( \sum_{t=0}^{\infty} \gamma^t r_t \right)$$

### Average-reward

$$\lim_{h \rightarrow \infty} E\left( \sum_{t=0}^{h} \frac{1}{h} r_t \right)$$

## Learning Performance (Kaelbling et al., 1996)

1. Asymptotic convergence:

$$\pi_n \rightarrow \pi^* \text { as } n \rightarrow \infty$$

1. PAC:

$$P(N_{errors} > F(\cdot, \epsilon, \delta)) \le \delta$$

Does not give any guarantee about the policy while it is learning

1. Regret (e.g. bound $$B$$ on total regret):

$$\mathrm{max} \sum_{t=0}^{T} r_{tj} - r_t < B$$

No notion of “many small mistakes”, or “few major mistakes”.

1. Uniform-PAC

unifies notion of PAC and regret (Dann et al., 2017)

# Bibliography

Icon by Laymik from The Noun Project. Website built with ♥ with Org-mode, Hugo, and Netlify.