Partially Observable Markov Decision Processes (POMDPs)
Partially Observable MDPs (POMDPs)
The assumption of full observability, accompanied with the Markov
assumption for the transition model means that the optimal policy
depends only on the current state. When the environment is partially
observable, the agent does not know which state it is in. The agent
then cannot execute
In addition to the elements of the MDP – the transition model
Belief states are the set of actual states the agent might be in. In POMDPs, these belief states are probability distributions over all possible states. The agent can calculate its current belief state as the conditional probability distribution over the actual states given the sequences of percepts and actions so far.
If the previous belief state is
where
The optimal action in a POMDP depends only on the agent’s belief
state. The optimal policy can be described by a mapping
The decision cycle of a POMDP can be broken down into 3 steps:
- Given the current belief state
, execute the action . - Receive percept
. - Update the belief state to
and repeat.
If we knew the action and the subsequent percept, then the update to
the belief state would be a deterministic one, following the update
equation. The subsequent percept is not yet known, so the agent will
arrive in one of several possible belief states. The probability of
perceiving
Where
Because POMDPs have continuous state space, new algorithms for computing or approximating the optimal policies for MDPs do not apply here.
Value iteration for POMDPs
The value iteration algorithm for the MDP computed one utility value for each state. With infinitely many belief states, we need to be more creative.
Consider conditional plans, and how the expected utility of executing a fixed conditional plan varies with the initial belief state.
- Let the utility of executing a fixed conditional plan
starting in physical state be . Then the expected utility of executing in belief state is Hence the expected utility of a fixed conditional plan varies linearly with . - At any given belief state
, the optimal policy will choose to execute the conditional plan with the highest expected utility, and the expected utility is just the utility of that conditional plan:
If the optimal policy
From these 2 observations, we see that the utility function
Let
This recursion gives rise to a value iteration algorithm:
function POMDP-VALUE-ITERATION returns a utility function
inputs: pomdp
e, the maximum error allowed for utility
locals: U, U' sets of plans p
U' <- set containing the empty plan [], with \alpha_[](s) = R(s)
repeat
U <- U'
U' <- set of all plans computed with above equation
U' <- REMOVE-DOMINATED-PLANS(U')
until MAX-DIFFERENCE(U, U') < e(1-\gamma) / \gamma
return U
The algorithm’s complexity is dominated by the number of plans
generated: given