Jethro's Braindump

Markov Decision Process

A sequential decision problem for a fully observable, stochastic environment with a Markovian transition model and additive rewards is called a Markov decision process. It consists of a set of states, with initial state s0, a set ACTIONS(s) of actions in each state, a transition model P(s|s,a), and a reward function R(s).

A policy, denoted π, specifies what the agent should do in any state s. This action is denoted by π(s). The optimal policy π yields the highest expected utility.

The careful balancing of risk and reward is a characteristic of MDPs that does not arise in deterministic search problems.

Optimality in Markov Decision Processes

Finite Horizon

E(t=0hrt)

Infinite Horizon

E(t=0γtrt)

Average-reward

limhE(t=0h1hrt)

Learning Performance Kaelbling, Littman, and Moore, n.d.

  1. Asymptotic convergence:

πnπ as n

  1. PAC:

P(Nerrors>F(,ϵ,δ))δ

Does not give any guarantee about the policy while it is learning

  1. Regret (e.g. bound B on total regret):

maxt=0Trtjrt<B

No notion of “many small mistakes”, or “few major mistakes”.

  1. Uniform-PAC

unifies notion of PAC and regret Dann, Lattimore, and Brunskill, n.d.

Links to this note