Jethro's Braindump

Markov Decision Process

A sequential decision problem for a fully observable, stochastic environment with a Markovian transition model and additive rewards is called a Markov decision process. It consists of a set of states, with initial state \(s_0\), a set \(ACTIONS(s)\) of actions in each state, a transition model \(P(s'|s, a)\), and a reward function \(R(s)\).

A policy, denoted \(\pi\), specifies what the agent should do in any state \(s\). This action is denoted by \(\pi(s)\). The optimal policy \(\pi^*\) yields the highest expected utility.

The careful balancing of risk and reward is a characteristic of MDPs that does not arise in deterministic search problems.

Optimality in Markov Decision Processes

Finite Horizon

\begin{equation} E\left( \sum_{t=0}^{h} r_t \right) \end{equation}

Infinite Horizon

\begin{equation} E\left( \sum_{t=0}^{\infty} \gamma^t r_t \right) \end{equation}


\begin{equation} \lim_{h \rightarrow \infty} E\left( \sum_{t=0}^{h} \frac{1}{h} r_t \right) \end{equation}

Learning Performance (Kaelbling et al., 1996)

  1. Asymptotic convergence:

\begin{equation} \pi_n \rightarrow \pi^* \text { as } n \rightarrow \infty \end{equation}

  1. PAC:

\begin{equation} P(N_{errors} > F(\cdot, \epsilon, \delta)) \le \delta \end{equation}

Does not give any guarantee about the policy while it is learning

  1. Regret (e.g. bound \(B\) on total regret):

\begin{equation} \mathrm{max} \sum_{t=0}^{T} r_{tj} - r_t < B \end{equation}

No notion of “many small mistakes”, or “few major mistakes”.

  1. Uniform-PAC

unifies notion of PAC and regret (Dann et al., 2017)


Kaelbling, L. P., Littman, M. L., & Moore, A. W., Reinforcement learning: a survey, Journal of artificial intelligence research, 4(), 237–285 (1996).

Dann, C., Lattimore, T., & Brunskill, E., Unifying pac and regret: uniform pac bounds for episodic reinforcement learning, In , Advances in Neural Information Processing Systems (pp. 5713–5723) (2017). : .

Icon by Laymik from The Noun Project. Website built with ♥ with Org-mode, Hugo, and Netlify.