Markov Decision Process
A sequential decision problem for a fully observable, stochastic
environment with a Markovian transition model and additive rewards is
called a Markov decision process. It consists of a set of states, with
initial state
A policy, denoted
The careful balancing of risk and reward is a characteristic of MDPs that does not arise in deterministic search problems.
Optimality in Markov Decision Processes
Finite Horizon
Infinite Horizon
Average-reward
Learning Performance Kaelbling, Littman, and Moore, n.d.
- Asymptotic convergence:
- PAC:
Does not give any guarantee about the policy while it is learning
- Regret (e.g. bound
on total regret):
No notion of “many small mistakes”, or “few major mistakes”.
- Uniform-PAC
unifies notion of PAC and regret Dann, Lattimore, and Brunskill, n.d.