Artificial Intelligence
What is Artificial Intelligence?
Designing agents that act rationally (e.g. through maximising a reward function).
Humans often act in ways that do not maximise their own benefit (irrational).
Acting Humanly: Turing Test
A computer would require:
 natural language processing
 knowledge representation
 automated reasoning
 machine learning
Thinking Humanly
Cognitive science brings together computer models and experimental techniques in psychology to construct testable and provable theories of the human mind.
Thinking rationally
Taking informal knowledge and expressing it in logical terms.
Acting Rationally
A rational agent is one that acts so as to achieve the best outcome or,when there is uncertainty, the best expected outcome.
An agent is a function from percept histories to actions, i.e. \(f: P^* \rightarrow A\). We seek the bestperforming agent for a certain task; must consider computation limits.
Intelligent Agents
 Agents perceive the environment through sensors
 Agents act upon the environment through actuators
Rational Agents
 For each possible percept sequence, select an action that is expected to maximise its performance measure. The performance measure is a function of a given sequence of environment states.
 Given the evidence provided by the percept sequence and whatever builtin knowledge the agent has.
 Agents can perform actions that help them gather useful information (exploration)
 An agent is autonomous if its behaviour is determined by its own experience (with ability to learn and adapt)
Exploration vs Exploitation
Doing actions that modify future percepts (information gathering) is an important part of rationality. In most scenarios, agents don’t know the entire environment a priori.
Specifying Task Environment (PEAS)
 Performance measure
 Environment
 Actuators
 Sensors
Properties of Task Environments
 Fully observable
 whether an agent’s sensors gives it access to the complete state of the environment at any given point in time
 Deterministic
 if the next state is completely determined by the current environment. Otherwise it is stochastic.
 Episodic
 whether an agents experience is divided into atomic episodes. In each episode, an agent receives a percept and performs a single action. In sequential environments shortterm actions can have longterm consequences. For this reason, episodic environments are generally simpler.
 Static
 whether the environment can change while the agent is deliberating.
 Discrete
 whether the state of the environment, how time is handled, and the percepts and actions of the agent discretely quantized.
 Single agent
 in some environments, for example chess, there are
multiple agents acting in the same environment.
 cooperative: if the two agents need to work together.
 competitive: if the two agents are working against each other.
 Known
 whether the agent knows the outcome of its actions.
TableDriven Agent
Simple to implement, and works. However, the number of table entries is exponential in time: \(\text{#percepts}^\text{time}\). Hence it is doomed to failure. The key challenge to AI is to produce rational behaviour from a small program rather than a vast table.
Reflex agents
A simple reflex agent is one that selects actions on the basis of the current percept, ignoring the rest of the percept history. A conditionaction rule is triggered upon processing the current percept. E.g. if the car in front is braking, then brake too.
Basing actions on only the current percept can be highly limiting, and can also lead to infinite loops. Randomized actions of the right kind can help escape these infinite loops.
Modelbased Reflex Agents
The agent maintains some internal state that depends on percept history and reflects at least some of the unobserved aspects of the current state. Information about how the world evolves independently from the agent is encoded into the agent. This knowledge is called a model of the world, and this agent is hence a modelbased agent.
Goalbased agents
Knowing about the current state of the environment may not be enough to decide on what to do. Agents may need goal information that describes situations that are desirable. Sometimes goalbased action selection is straightforward, but in others searching and planning are required to achieve the goal. Goalbased agents are flexible because the knowledge that supports its decisions is represented explicitly and can be modified, although it is less efficient.
Utilitybased agents
Goals provide a binary distinction between good and bad states. A more general performance measure should allow a comparison between world states according to exactly how good it is to the agent. An agent’s utility function is an internalisation of the performance measure. An agent chooses actions to maximise its expected utility. A utilitybased agents has to model and keep track of its environment.
Learning Agents
A learning agent can be divided into four conceptual components.
 learning element
 responsible for making improvements
 performance element
 responsible for selecting external actions
 problem generator
 suggests actions that will lead to new and informative experiences
the learning element takes in feedback from the critic on how the agent is doing and determines show the performance element should be modified to do better in the future.
State representations
Atomic Representation
In an atomic representation each state of the world is indivisible, and has no internal structure. Search, gameplaying, hidden Markov models and Markov decision processes all work with atomic representations.
Factored Representation
A factored representation splits up each state into a fixed set of variables or attributes, each of which can have a value.
Constraint satisfaction algorithms, propositional logic, planning, Bayesian networks and machine learning algorithms work with factored representations.
Structured Representations
Structured representations underlie relational databases and firstorder logic, firstorder probability models, knowledgebased learning and much of natural language understanding.
Implications
A more expressive representation can capture, at least as concisely, a everything a more expressive one can capture, plus more. On the other hand, reasoning and learning become more complex as the expressive power of the representation increases.
ProblemSolving
Problemsolving agents use atomic representations, as compared to goalbased agents, which use more advanced factored or structured representations.
The process of looking for a sequence of actions that reaches the goal is called search. A search algorithm takes a problem as input and returns a solution in the form of an action sequence.
Classical Search
This addresses observable, deterministic, and known environments where the solution is a sequence of actions.
How Search Algorithms Work
Search algorithms consider various possible action sequences. The possible action sequences start at the initial state form a search tree.
Search algorithms require a data structure to keep track of the search tree that is being constructed.
 state
 state in the state space to which the node corresponds
 parent
 the node in the search tree that generated this node
 action
 the action that was applied to the parent to generate this node
 pathcost
 the cost, traditionally denoted by \(g(n)\), of the path from the initial state to the node, as indicated by the parent pointers
Measuring Performance
 completeness
 is the algorithm guaranteed to find a solution if it exists?
 optimality
 does the strategy find the optimal solution?
 time complexity
 how long does it take to find a solution?
 space complexity
 how much memory is required to do the search?
Uninformed Search Strategies
Breadthfirst Search
The root node is expanded first, then all the successors of the root node are expanded next, then their successors, and so on.
performance  rating 

completeness  YES 
optimal  NO 
time complexity  \(O(b^d)\) 
space complexity  \(O(b^d)\) 
The shallowest node may not be the most optimal node.
The space used in the explored set is \(O(b^{d1})\) and the space used in the frontier is \(O(b^d)\).
In general, exponentialcomplexity search problems cannot be solved by uninformed methods for any but the smallest instances.
Uniformcost Search
Uniformcost search expands the node \(n\) with the lowest path cost \(g(n)\). The goal test is applied to a node when it is selected for expansion rather than when it is first generated.
This is equivalent to BFS if all step costs are qual.
performance  rating 

completeness  MAYBE 
optimal  YES 
time  \(O(b^{1+\lfloor{\frac{C^*}{\epsilon}}\rfloor})\), where \(C^*\) is the optimal cost. 
space  \(O(b^{1+\lfloor{\frac{C^*}{\epsilon}}\rfloor})\) 
Completeness is guaranteed only if the cost of every step exceeds some small positive constant \(\epsilon\). an infinite loop may occur if there is a path with an infinite sequence of zerocost actions.
Depthfirst Search
Always expands the deepest node in the current frontier of the search tree.
performance  rating 

completeness  YES 
optimal  NO 
time complexity  \(O(b^m)\) 
space complexity  \(O(b^m)\), \(O(m)\) if backtrack 
The time complexity of DFS may be worse than BFS: \(O(b^m)\) might be larger than \(O(b^d)\).
DFS only requires storage of \(O(bm)\) nodes, where \(m\) is the maximum depth of any node. backtracking search only generates one successor at a time, modifying the current state description rather than copying it. Memory requirements reduce to one state description and \(O(m)\) actions.
Depthlimited Search
In depthlimited search, nodes at depth of predetermined limit \(l\) are treated as if they had no successors. This limit solves the infinitepath problem.
performance  rating 

completeness  YES 
optimal  NO 
time complexity  \(O(b^l)\) 
space complexity  \(O(b^l)\), \(O(l)\) if backtrack 
Iterative Deepening Depthfirst Search
Key idea is to gradually increase the depth limit: first 0, then 1, then 2… until a goal is found.
\(N(IDS) = (d)b + (d1)b^2 + \dots + (1)b^d\), which gives a time complexity of \(O(b^d)\)
performance  rating 

completeness  YES 
optimal  NO (unless step cost is 1) 
time complexity  \(O(b^d)\) 
space complexity  \(O(b^d)\), \(O(m)\) if backtrack 
 BFS and IDS are complete if \(b\) is finite.
 UCS is complete if \(b\) is finite and step cost is \(\ge \epsilon\).
 BFS and IDS are optimal if all step costs are identical.
Bidirectional Search
Conduct two simultaneous searches – one forward from the initial state, and the other backward from the goal. This is implemented by replacing the goal test with a check to see whether the frontiers of two searches intersect. This reduces the time ad space complexity to \(O(b^{d/2})\).
Informed Search Strategies
Greedy bestfirst search
Greedy bestfirst search tries to expand the node that is closest to the goal, on the grounds that this is likely to lead to a solution quickly. It evaluates nodes by using just the heuristic function: \(f(n) = h(n)\).
Greedy bestfirst tree search is incomplete even in a finite state space. The graph search version is complete in finite spaces, but not in infinite ones. The worst case time and space complexity is \(O(b^m)\). However, with a good heuristic function, the complexity can be reduced substantially.
A* search
It evaluates nodes by combining \(g(n)\) the cost to reach the node, and \(h(n)\) the cost to get to the goal: \(f(n) = g(n) + h(n)\). Since \(g(n)\) gives the path cost from the start node to node \(n\), and \(h(n)\) is the estimated cost of the cheapest path from \(n\) to the goal,$f(n) = $ estimated cost of the cheapest solution through \(n\).
\(h(n)\) is an admissible heuristic iff it never overestimates the cost to reach the goal. For A*, this means that \(f(n)\) would never overestimate the cost of a solution along the current path.
Admissible heuristics are by nature optimistic because they think the cost of solving the problem is less than it actually is.
A second, slightly stronger condition is called consistency, and is required only for applications of A* to graph search. A heuristic \(h(n)\) is consistent iff for every node \(n\) and every successor \(n’\) of \(n\) generated by any action \(a\), the estimated cost of reaching the goal from \(n\) is no greater than the step cost of getting to \(n’\) plus the estimated cost of reaching the goal from \(n’\): \(h(n) \le c(n,a,n’) + h(n’)\). This is a form of the general triangle inequality.
A* search is complete, optimal and optimally efficient with a consistent heuristic. The latter means that no other optimal algorithm is guaranteed to expand fewer nodes than A*.
However, for most problems, the number of states within the goal contour search space is still exponential in the length of the solution.
The absolute error of a heuristic is defined as \(\Delta = h^*h\), and the relative error is defined as \(\epsilon = \frac{h^*h}{h*}\). The complexity results depend strongly on the assumptions made about the state space. For constant step costs, it is \(O(b^{\epsilon d})\), and the effective branching factor is \(b^\epsilon\).
A* keeps all generated nodes in memory, and hence it usually runs out of space long before it runs of time. Hence, it is not practical for largescale problems.
Other memorybounded heuristic searches include:
 iterativedeepening A* (IDA*)
 Recursive bestfirst search (RBFS)
 Memorybounded A* (MA*)
 simplified MA* (SMA*)
Learning to Search Better
Each state in a metalevel state space captures the internal computational state of a program that is searching in an objectlevel state space. A metalevel learning algorithm can learn from experiences to avoid exploring unpromising subtrees. The goal of the learning is to minimise the total cost of problem solving, trading off computational expense and path cost.
Heuristic Functions
If for any node n \(h_2(n) \ge h_1(n)\), we say that \(h_2\) dominates \(h_1\). Domination translates directly into efficiency: A* using \(h_2\) will never expand more nodes than \(h_1\). Hence it is generally better to use a heuristic function with higher value, while making sure it is consistent, and computing the heuristic function is computationally feasible.
Generating Admissible Heuristics
From Relaxed Problems
Because the relaxed problem adds edges to the state space, any optimal solution in the original problem is, by definition, also a solution in the relaxed problem. Hence, the cost of an optimal solution to a relaxed problem is an admissible heuristic for the original problem. Because the derived heuristic is an exact cost for the relaxed problem, it must obey the triangle inequality and is therefore consistent.
From Subproblems: Pattern Databases
Pattern Databases store exact solution costs for every possible subproblem instance. In the case of the 8puzzle, every possible configuration of the four tiles and the blank. Each pattern database yields an admissible heuristic, and these heuristics can be combined by taking the maximum value. Solutions to subproblems can overlap: disjoint pattern databases account for this. These work by dividing the problem in a way that each move affects only one subproblem.
From Experience
Inductive learning methods work best when supplied with features of a state that are relevant to predicting the state’s value. A common approach to combining features would be through a linear combination: \(h(n) = c_1x_1(n) + c_2x_2(n)\).
These heuristics satisfy the requirement that \(h(n) = 0\) for goal states, but are not necessarily admissible or consistent.
Beyond Classical Search
Here, we cover algorithms that perform purely local search in the state space, evaluating and modifying one or more current states rather than systematically exploring paths from an initial state. These include methods inspired by statistical physics (simulated annealing) and evolutionary biology (genetic algorithms).
If an agent cannot predict exactly what percept it will receive, then it will need to consider what to do under each contingency that its percepts may reveal.
If the path to the goal doesn’t matter, we giht consider a different class of algorithms, ones that do not worry about the paths at all. Local search algorithms operate using a single current node and generally move only to neighbours of that node. Its advantages include:
 They generally use a constant amount of memory
 They can often find reasonable solutions in large or infinite state spaces where systematic algorithms are not suitable.
Hillclimbing Search
The hillclimbing search is a loop that continually moves in the direction of increasing value.
Consider the 8queens problem.
Local search algorithms typically use a completestate formation. The successors of a state are all possible states generated by moving a single queen to another square in the same column.
We could use a heuristic cost function \(h\) equal to the number of queens that are attacking each other, either directly or indirectly.
The global minimum of this function is zero, which only occurs for perfect solutions. Hillclimbing algorithms typically choose randomly among the set of best successors having the lowest \(h\).
Hillclimbing algorithms can get stuck for the following reasons:
 local maxima
 sequence of local maxima
 flat local maximum, or shoulder, from which progress is possible.
Variants
 stochastic hillclimbing
 chooses at random from among the uphill moves; the probability of selection can vary with the steepness of the uphill move. Usually converges more slowly, but finds better solutions.
 firstchoice hillclimbing
 stochastic hillclimbing with randomly generated successors until one is generated that is better than the current state. Good when state has many successors.
 randomrestart hillclimbing
 conducts hillclimbing searches from randomly generated initial states, until a goal is found. Trivially complete with probability approaching 1.
Simulated Annealing
A hillclimbing algorithm that never makes ‘downhill’ moves towards states with lowervalue is guaranteed to be incomplete, because it can be stuck on a local maximum.
function SIMULATEDANNEALING(problem, schedule)
inputs: problem, a problem
schedule, a mapping from time to 'temperature'
current ← MAKENODE(problem, INITIALSTATE)
for t = 1 to ∞ do
T ← schedule(t)
if T = 0 then return current
next ← a randomly selected successor of current
𝞓E ← next.VALUE  current.VALUE
if 𝞓E > 0 then current ← next
else current ← next only with probability e^(𝞓E/T)
Local Beam Search
https://www.youtube.com/watch?v=RLWuzLLSIgw
Local beam search keeps track of \(n\) states rather than just one. It begins with \(n\) randomly generated states, at each step all the successors of all states are generated. If any one is a goal, the algorithm halts.
Localbeam search passes useful information between the parallel search threads (compared to running randomrestart \(n\) times), quickly abandoning unfruitful searches and moves its resources to where the most progress is being made.
Stochastic local beam search chooses \(n\) successors at random, with the probability of choosing a given successor being an increasing function of its value.
Genetic Algorithms
A genetic algorithm is a variant of stochastic beam search in which successor states are generated by combining two parent states rather than by modifying a single state.
GA begins with a set of \(n\) randomly generated states, called the population. Each state is also called an individual.
The production of the next generation of states is rated by the objective function, or fitness function. A fitness function returns higher values for better states.
Like stochastic beam search, genetic algorithms combine an uphill tendency with random exploration and exchange of information among parallel search threads. crossover in genetic algorithms raises the level of granularity at which the search operates.
function GENETICALGORITHM(population, FITNESSFN) returns an individual
inputs: population, a set of individuals
FITNESSFN, a function that measures the fitness of an
individual
repeat
new_population ← empty set
for i = 1 to SIZE(population) do
x ← RANDOMSELECTION(population, FITNESSFN)
y ← RANDOMSELECTION(population, FITNESSFN)
child ← REPRODUCE(x,y)
if (small random probability) then child ← MUTATE(child)
add child to new_population
population ← new_population
until some individual is fit enough, or enough has elapsed
return the best individual in population, according to FITNESSFN
function REPRODUCE(x,y) returns an individual
inputs: x,y, parent individuals
n ← LENGTH(x); c ← random(1,n)
return APPEND(SUBSTRING(x,1,c), SUBSTRING(y, c+1, n))
Local Search in Continuous Spaces
One way to avoid continuous problems is simply to discretize the neighbourhood of each state. Many methods attempt to use the gradient of the landscape to find a maximum: \(x \leftarrow x + \delta \nabla (x)\), where \(\delta\) is a small constant called the step size. For many problems, the NewtonRaphson method is effective. It solves the roots for equations \(g(x) = 0\), by computing a new estimate: \(x \leftarrow x  g’(x)/g(x)\). To find a maximum or minimum of \(f\), we need to find \(x\) such that the gradient is zero. In this case \(g(\mathbf{x})\) in Newton’s formula becomes \(\nabla f(\mathbf{x})\) and the update equation can be written in matrixvector form as:
\begin{align*} \mathbf{x} \leftarrow \mathbf{x}  H_f^{1}(\mathbf{x})\nabla f(\mathbf{x}) \end{align*}
where \(H_f\) is the Hessian matrix of second derivatives. For highdimensional problems, computing the \(n^2\) entries of the Hessian and inverting it may be expensive, and approximate versions have been developed.
Local search methods suffer from local maxima, ridges and plateaux in continuous spaces just as much as in discrete spaces.
Searching with Nondeterministic Actions
When the environment is either partially observable or nondeterministic, percepts become useful. In a partially observable environment, every percept helps narrow down the set of possible states the agent might be in. In a nondeterministic environment, percepts tell the agent which of the possible outcomes of its actions has actually occurred. Future percepts cannot be determined in advance, and the agent’s future actions will depend on those future percepts. The solution to a problem is not a sequence but a contingency plan
The solutions for nodeterministic problems can contain nested ifthenelse statements, meaning they are trees and not sequences.
ANDOR search trees
A solution for an ANDOR search problem is a subtree that:
 includes every outcome branch leaf
 specifies one action at each of its OR nodes
 includes every outcome branch at each of its AND nodes
function ANDORGRAPHSEARCH(problem) returns a conditional plan, or failure
ORSEARCH(problem, INITIALSTATE, problem, [])
function ORSEARCH(state,problem,path) returns a conditional plan, or failure
if problem, GOALTEST(state) then return the empty plan
if state is on path then return failure
for each action in problem, ACTIONS(state) do
plan ← ANDSEARCH(RESULTS(state,action), problem, [state  path])
if plan ≠ failure then return [action  plan]
return failure
function ANDSEARCH(states,problem,path) returns a conditional plan, or failure
for each s_i in states do
plan_i ← ORSEARCH(s_i,problem,path)
if plan_i = failure then return failure
return [if s_1 then plan_1 else if s_2 then plan_2 ...]
(stop at AIMA 4.4)
Adversarial Search
Competitive environments, in which the agent’s goals are in conflict, give rise to adversarial search problems.
Game theory views any multiagent environment as a game, provied that the impact of each agent on the others is significant.
Games often have large branching factors, and require making some decision even before computing the optimal decision.
Pruning allows us to ignore portions of the search tree that make no difference to the final choice, and heuristic evaluation functions allow us to approximate the true utility of a state without doing a complete search.
A game can be formally defined as a search problem with the following elements:
 \(S_0\)
 the initial state, which specifies how the game is set up at the start
 \(Player(s)\)
 Defines which player has the move in a state
 \(Actions(s)\)
 Returns the set of legal moves in a state
 \(Result(s,a)\)
 The transition model, which defines the result of a move
 \(TerminalTest(s)\)
 Terminal test, which is true when the game is over, and false otherwise.
 \(Utility(s,p)\)
 A utility function defines the numeric value for a game that ends in terminal state \(s\) for a player \(p\).
The initial state, \(Actions\) function and \(Result\) function define the game tree for the game.
Optimal Strategy
The optimal strategy can be determined from the minimax value of each node (\(Minimax(n)\)). The minimax value of a node is the utility of being in the corresponding state, assuming that players play optimally from there to the nd of the game. The minimax value of a terminal state is its utility.
\begin{align}
Minimax(s) =
\begin{cases}
Utility(s), \text{ if } TerminalTest(s) \\\
max_{a \in Actions(s)}MINIMAX(Result(s,a)), \text{if Player(s) =
Max} \\\
min_{a \in Actions(s)}MINIMAX(Result(s,a)), \text{if Player(s) = Min}
\end{cases}
\end{align}
Minimax uses utility function on leaf nodes, backing up through the tree, setting the node value to be the minimum of the children.
AlphaBeta Pruning
Eliminate parts of the search tree that do not affect decision.
Classical Planning
Problemsolving agents that deal with atomic representations of states require good domainspecific heuristics to perform well. The hybrid propositional logical agent can find plans without domainspecific heuristics because it uses domainindependent heuristics based on the logical structure of the problem. However, it relies on ground propositional inference, and suffers when there are large numbers of actions and states.
Planning Domain Definition Language (PDDL) was created in response to these deficiencies.
Each state is represented as a conjunction of fluents that are ground, functionless atoms. Database semantics is used, which involves:
 closedworld assumption
 all fluents not mentioned are false
 unique names assumption
 fluents with different names are distinct
State representations are carefully designed so that they can be manipulated by set operations or logical inference.
Actions are defined in terms of the preconditions and effects. Preconditions and effects are described in terms of a conjunction of literals.
Complexities of classical planning
PlanSAT is the question of whether there exists any plan that solves a planning problem. Bounded PlanSAT asks whether there is a solution of length k or less.
While the number of states is finite, adding function symbols make them infinite, making these problems semidecidable at best. Certain restrictions can reduce the questions into a P class problem.
However, most agents would not be asked to derive plans for worstcase problem instances. For many problem domains, bounded PlanSAT is NPcomplete, while PlanSAT is in P.
Statespace search for planning
The first approach is forward (propogation) statespace search, which searches forward from the initial state. However, it is inefficient, for the following reasons:
 It tends to explore irrelevant actions
 Planning problems tend to have large state spaces, and relatively small problems will be an issue without a good heuristic
It turn out that good domainindependent heuristics can be derived for forward search, which makes it feasible.
Alternatively, we can do a backward (regression) statespace search, which looks for actions that can lead to the goal. Unlike forward search, backward search only explores relevant actions, hence has a low branching factor. However, backward search deals with sets, which are make it harder to derive good domainindependent heuristics.
Heuristics for planning
Framing the search problem as a graph where the nodes are states and the edges are actions. We can think of a number of ways to relax the problem, generating admissible heuristics:
 Add more edges to the graph, making it easier to find a path
The ignore preconditions heuristic drops all preconditions from actions, and every action becomes applicable in every state. We count the number of actions required such that the union of the action’s effects satisfy the goal. This is called the setcover problem, which is unfortunately NPhard. We can also ignore selected preconditions of actions.
The ignore delete list heuristic, drops all negative literals in goals and preconditions. This way, an action cannot undo progress towards the goal, and each action taken would monotonically progress towards it.
 Grouping multiple nodes together, shrinking the size of the graph
We can reduce the number of states by forming a state abstraction – a manytoone mapping from states in the ground representation of the problem to the abstract representation. For example, one can ignore some fluents.
The key idea in defining heuristics is decomposition: diving a problem into parts. Subgoal independence is the assumption that the cost of solving a conjunction of subgoals is approximated by the sum of the costs of solving a subgoal independently.
Other classical planning approaches
We can translate a problem description in PDDL to a form that can be processed by SATPlan. The steps are below:
 Propositionalize the actions: replace each action schema with a set of ground actions formed by substituting constants for each of the variables. These ground actions are not part of the translation, but will be used in subsequent steps.
 Define the initial state: assert \(F^0\) for every fluent \(F\) in the problem’s initial state, and \(\neg F^0\) for every fluent not mentioned in the final state.
 Propositionalize the goal: the goal becomes a disjunction over all of its ground instances, where variables are replaced by constants.
 Add successorstate axioms: For each fluent \(F\), add an axiom of the form:
\begin{equation} F^{t+1} \iff ActionCausesF^t \vee \left( F^t \wedge \neg ActionCausesNotF^t \right) \end{equation}
Where \(ActionCausesF\) is a disjunction of all the ground actions that have \(F\) in their add list, and \(ActionCausesNotF\) is a disjunction of all ground actions that have \(F\) in their delete list.
 Add precondition axioms: for each ground action \(A\), add the axiom \(A^t \iff PRE(A)^t\), i.e. if an action is taken at time \(t\), then the preconditions must be true.
 Add action exclusion axioms: say that exactly one ground action can occur at each step.
function SATPLAN(init, transition, goal, Tmax) returns solution or failure
""inputs: init, transition, goal constitute the problem
description
Tmax: upper limit for plan length""
for t = 0 to Tmax do
cnf < translatetosat(init, transition, goal, t)
model < SATSolver(cnf)
if model is not null then
return extractsolution(model)
return failure
Decision Theory
Reference: AIMA Chapter 16
Utility theory combines with probability theory to yield a decisionmaking agent, that can make rational decisions based on what it believes and what it wants. This is opposed to goalbased agents, that make binary decisions between good and bad states.
Combining beliefs and desires under uncertainty
Under nondeterministic, partially observable environments, the result of an action cannot be represented by a single number. \(RESULT(\alpha)\) is now expressed as a random variable, whose values are the possible outcome states with corresponding probabilities of occurrence. The probability of an outcome \(s’\), given evidence observations \(\mathbb{e}\), is written \(P(RESULT(\alpha) = s’  a, \mathbb{e})\).
The agent’s preferences are captured by a utility function \(U(s)\), which expresses the desirability of a state. We can calculate the expected utility of a function as follows:
\begin{equation} EU(a  \mathbb{e}) = \sum_{s’} P(RESULT(\alpha) = s’  a, \mathbb{e}) U(s’) \end{equation}
The principle of maximum expected utility (MEU) says that a rational agent should choose the action that maximizes the expected utility. If an agent acts to maximise a utility function that correctly reflects all possible outcomes, then it will achieve the highest possible performance score (averaged across all environments).
Axioms of Utility Theory
To show why MEU is a rational choice, we turn to utility theory. First, we define notation:
 \(A \succ B\): the agent prefers \(A\) over \(B\).
 \(A \sim B\): the agent is indifferent between \(A\) and \(B\).
 \(A \succeq B\): the agent prefers \(A\) over \(B\), or is indifferent.
The axioms of utility theory are:
 orderability
 Exactly one of \(A \succ B\), \(B \succ A\) or \(A \sim B\) holds.
 transitivity
 \((A \succ B) \wedge (B \succ C) \Rightarrow A \succ C\)
 continuity
 \(A \succ B \succ C \Rightarrow \exist p [p, A; 1p, C] \sim B\)
 substitutability
 \(A \sim B \Rightarrow [p, A; 1p, C] \sim [p, B; 1p, C]\)
 monotonicity
 \(A \succ B \Rightarrow (p > q) \iff [p, A; 1p B] \succ [q, A; 1q, B]\)
 Decomposability
 \([p, A; 1p, [q, B; 1q, C]] \sim[p, A;(1p)q, B; (1p)q, C]\)
von Neumann and Morgenstern showed that following these axioms, it can be shown that.
 Utility functions exist. \(U(A) > U(B) \iff A \succ B\) and \((U(A) = U(B) \iff A \sim B)\).
 The utility of a lottery, can be represented as the sum of probability of each outcome, multiplied by the utility of that outcome: \(U([p_1, S_1; \dots; p_n, S_n]) = \sum_i p_iU(S_i)\).
Utility functions are not unique: in fact, that are not changed with affine transformations, of the form \(U’(s) = aU(s) + b\). This is because the agent only needs a preference on the ranking of states, and the actual value of the utility does not matter.
Utility assessment and utility scales
To build a decisiontheoretic system, we must first work out what the agent’s utility function is. This process is called preference elicitation, which involves presenting choices to the agent and using the observed preferences to pin down the underlying utility function.
We have established above that utility functions are not unique, but it helps to have a normalized utility. Normalized utilities fix the utility of a “best possible prize” at \(u_\top = 1\) and the “worst possible prize” at \(u_\bot = 0\).
Given this utility scale, we can assess the utility of a prize \(S\), by asking the agent to choose between \(S\) and the standard lottery \([p, u_\top;1p, u_\bot]\). \(p\) is adjusted until the agent is indifferent between \(S\) and the lottery. This is done for each prize \(S\).
Multiattribute utility functions
When outcomes are characterized by two or more attributes, we need a multiattribute utility function.
Dominance
We say that there is strict dominance of \(S_1\) over \(S_2\) if \(S_2\) is lower on all attributes as compared to \(S_1\). Under uncertainty, we use a more useful generalization: stochastic dominance.
If \(A_1\) stochastically dominates \(A_2\), then for any monotonically increasing utility function \(U(x)\), the expected utility of \(A_1\) is at least as high as the expected utility of \(A_2\). We can understand this via the cumulative distribution. For any action \(A_1\) and \(A_2\), that lead to probability distributions \(p_1(x)\) and \(p_2(x)\) on the random variable \(X\):
\begin{equation} \forall x \int_{\infty}^{x} p_1(x’)dx’ \le \int_{\infty}^{x} p_2(x’)dx’ \end{equation}
Preference structure and multiattribute utility
Multiattribute utility theory is based on the supposition that preferences of typical agents have structure. This alleviates the difficulty in expressing a utility function with many attributes and possible distinct values.
One simple regularity in preference structures is called preference independence. Two attributes \(X_1\) and \(X_2\) are preferentially independent if the preference between outcomes \((x_1, x_2, x_3)\) and \((x_1’, x_2’, x_3)\) does not depend on the value of the value \(x_3\) of attribute \(X_3\).
If all attributes are mutually preferentially independent, then an agent’s preference can be described by maximising the function:
\begin{equation} V(x_1, x_2, \dots, x_n) = \sum_i V_i(x_i) \end{equation}
An example is:
\begin{equation} V(noise, cost, deaths) =  noise \times 10^4  cost  deaths \times 10^{12} \end{equation}
Under uncertainty, the mathematics gets a bit more complicated. Utility independence extends preference independence to cover lotteries: a set of attributes \(\mathbb{X}\) is utility independent of a set of attributes \(\mathbb{Y}\) if preference between lotteries on the attributes in \(\mathbb{X}\) are independent of the particular values of the attributes in \(\mathbb{Y}\).
Mutual utility independence implies that the agent’s behaviour can be described using a multiplicative utility function. If we denote \(U_i\) to be \(U_i(x_i)\), then a 3 attribute utility function can be expressed as:
\begin{equation} U = k_1U_1 + k_2U_2 + k_3U_3 + k_1k_2U_1U_2 + k_1k_3U_1U_3 + k_2k_3U_2U_3 + k_1k_2k_3U_1U_2U_3 \end{equation}
Decision Networks
Decision networks extend Bayesian networks with nodes for actions and utilities. An example of a decision network is below:
The notation is as such:
 chance nodes (oval)
 these represent random variables
 decision nodes (rectangles)
 these represent points where the decision maker has a choice of actions.
 utility nodes (diamonds)
 represent the agent’s utility function
A simplified form is used in many cases. The notation remains identical, but the chance nodes describing the outcome state are omitted. Rather than representing a utility function on outcome states, the utility node represents the expected utility associated with each action. This node is associated with an actionutility function (also known as Qfunction).
Information Value Theory
The value of a given piece of information is defined to be the difference in expected value between the best action before and after information is obtained. This is one of the most important parts of decisionmaking: knowing what information to obtain. Information has value to the extent that it is likely to cause a change of plan and to the extent that the new plan is significantly better than the old plan.
Mathematically, the value of perfect information (VPI) is defined as:
\begin{equation} VPI_e(E_j) = \left( \sum_k P(E_j = e_{jk}  \mathbb{e}) MEU(\alpha_{e_{jk}}  \mathbb{e}, E_j = e_{jk}) \right)  MEU(\alpha  \mathbb{e}) \end{equation}
This is obtained by considering the best action (maximum expected utility) before and after obtaining the information, and averaging it across all possible values for the new information, using our current beliefs of its value.

Properties of the value of information
First, the expected value of information is nonnegative.
\begin{equation} \forall \mathbb{e}, E_j VPI_{\mathbb{e}} (E_j) \ge 0 \end{equation}
The theorem is about the expected value, and not the real value. This means that information can sometimes lead to a plan that is harmful.
It is important to note that VPI is dependent on the current state of information. VPI is not additive in general:
\begin{equation} VPI_{\mathbb{e}}(E_j, E_k) \ne VPI_{\mathbb{e}}(E_j) + VPI_{\mathbb{e}}(E_k) \end{equation}
VPI is order independent. That is:
\begin{equation} VPI_{\mathbb{e}}(E_j, E_k) = VPI_{\mathbb{e}}(E_j) + VPI_{\mathbb{e}, e_j}(E_k) = VPI_{\mathbb{e}}(E_k) + VPI_{\mathbb{e}, e_k}(E_j) \end{equation}

Information Gathering Agents
we can implement a myopic informationgathering agent, by using the VPI formula shortsightedly.
function INFORMATIONGATHERINGAGENT(percept) returns an action persistent: D, a decision network integrate percept into D j < the value that maximises VPI(E_j) / Cost(E_j) if VPI(E_j) > Cost(E_j) return REQUEST(E_j) else return the best action from D
If we know the associated cost of observing evidence, it simply retrieves the evidence if the cost of observing it is less than the value it provides.
RANDOM
Simon’s Ant
Simon, noble prize in economics
The complexity of the behaviour is the manifestation of the complexity of the environment and not the complexity of the program.
REFILE
Papers
Improving Policy Gradient by Exploring UnderAppreciated Rewards
The REINFORCE algorithm minimizes the KL divergence between \(\pi_\theta\) and \(\pi^*\), the optimal policy. However, learning an optimal policy by optimizing this direction of the KL is known to be modeseeking. Hence, it the agent is prone to falling into a local optimum, and miss out some modes of \(\pi^*\).
Hence, a entropy regularization loss is added to encourage exploration.
However, this exploration is undirected, and requires a small regularization coefficient to prevent too much random exploration.
Optimizing in the meanseeking direction of the KL divergence is to learn a policy by following:
\begin{equation} O_{RAML}(\theta; \tau) = \mathcal{E}_{h \sim p(h)}\left\{ \tau \sum_{a \in A} \pi_\tau^* (a  h) \log \pi_\theta (a  h) \right\} \end{equation}
In RL, the reward landscape is unknown, hence sampling from \(\pi_\tau^*\) is not straightforward. We can approximate the expectation with respect to \(\pi_\tau^*\) by using selfnormalized importance sampling. For importance sampling, one draws \(K\) i.i.d samples \(\{ a^{(k)}\}^{K}_{k=1}\) from \(\pi_\theta\) and computes a set of normalized weights to approximate \(O_{RAML}(\theta; \tau  h)\).
\begin{equation} O_{RAML} (\theta; \tau  h) \approx \tau \sum_{k=1}^{K}\frac{w_\tau(a^{(k)}  h)}{\sum_{m=1}^{K}w_\tau(a^{(m)}  h)} \log \pi_\theta(a^{(k)}  h) \end{equation}
where \(w_\tau(a^{(k)}  h) \propto \pi_\tau^* / \pi_\theta\) denotes an importance weight defined by:
\begin{equation} w_\tau (a^{(k)}  h) = exp \left\{ \frac{1}{\tau}r (a^{(k)}  h)  \log \pi_\theta (a^{(k)}  h) \right\} \end{equation}
One can view the weights as the discrepancy between scaled rewards \(r / \tau\) and the policy’s logprobabilities \(\log \pi_\theta\).
In UREX, both the RAML objective and the expected reward objective is combined and jointly maximized.