## Definitions ^{(Simard {\it et al.}, 2017)}

- machine learning research
- machine learning research aims at making the learner better by improving ML algorithms
- machine teaching research
- machine teaching research aims at making the teacher more productive at building machine learning models

## Problem Formulation

Machine learning takes a given training dataset \(D\) and learns a model \(\hat{\theta}\). The learner can take various formrs, including version space learners, Bayesian learners, deep neural networks, or cognitive models.

In contrast, in machine teaching the target model \(\theta^*\) is given, and the teacher finds a teaching set \(D\) – not necessarily i.i.d. - such that a machine learner trained on \(D\) will approximately learn \(\theta^*\).

In machine teaching, we wish to solve the following optimization problem:

\begin{align}
\begin{matrix}
\textrm{min}_{D, \hat{\theta}} & \textrm{TeachingRisk}(\hat{\theta}) +
\eta \textrm{TeachingCost}(D) \\

\textrm{s.t.} & \hat{\theta} = \textrm{MachineLearning}(D).
\end{matrix}
\end{align}

Where \(\textrm{TeachingRisk(\hat{\theta})}\) is a generic function for
how unsatisfied the teacher is. The target model \(\theta^*\) is folded
into the teaching risk function. The teaching cost function is also
generalized beyond the number of teaching items. For example,
different teaching items may have different cognitive burdens for a
human student to absorb. ^{(Zhu {\it et al.}, 2018)}

There are other formulations of machine teaching that place different constraints on the teaching. For example, one may want to minimize the teaching cost, while constraining the teaching risk, or instead choose to optimize the teaching risk given constraints on the teaching cost.

### Why bother if \(\theta^*\) is known?

There are applications where the teacher needs to convey the target model \(\theta^*\) to a learner via training data. For example:

- In education problems, the teacher may know \(\theta^*\) but is unable to telepathize the model to the students. If the teacher possesses a good cognitive model on how students learn from samples, they can use machine teaching to optimize the choice of learning examples.
- In training-set poisoning, an attacker manipulates the behaviour of a machine learning system by maliciously modifying the training data. An attacker knowing the algorithm may send specially designed training examples to manipulate the learning algorithm.

## Faster Teaching Via POMDP Teaching ^{(Rafferty {\it et al.}, 2015)}

The authors formulate teaching as a POMDP, and use a decision-theoretic approach to planning teaching. Assuming knowledge about the student’s learning model, the teacher is able to find optimal teaching actions.

A POMDP is specified as a tuple:

\begin{equation} \langle S, A, Z, p(s’|s, a), p(z|s,a), r(s,a), \gamma \rangle \end{equation}

- S
- set of states
- A
- set of actions
- Z
- set of observations
- \(p(s’ | s, a)\)
- transition model
- \(p(z|s, a)\)
- Probability of observing z
- \(r(s,a)\)
- Reward/cost model
- \(\gamma\)
- Discount factor

POMDP planning seeks to choose actions that minimize \(\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\).

The learner model specifies the space \(S\) of possible knowledge states, and transition model \(p(s’|s ,a)\) for how knowledge changes.

Simple learner models for concept learning can be specified. For example, in the memoryless model, if an action is consistent with the current concept, then the state stays the same. Else, the learner transitions to a state that is consistent with the action, with probability proportional to the prior probability of the concept:

\begin{equation}
p(s_{t+1} = c_i | s_t = c_j , a_t) = \begin{cases}
p_0(c_i) & \textrm{ if $c_i$ is consistent with $a_t$} \\

0 & \textrm{otherwise}
\end{cases}
\end{equation}

Experiments showed that POMDP planning leads to faster teaching.

### Limitations

POMDP relies on having models of learning. The models that we currently have do not fully model human learning. Human learners can also learn better if they are aware they are being taught. The models are not accurate enough to know when to decide to terminate teaching.

**Are we able to learn to teach without explicitly assuming a student model?**Beyond time to teach, it is difficult to incorporate other factors such as motivation. The learners may have their own reward function, and a joint optimization of the student and teacher rewards is computationally more difficult.

POMDP can be computationally intractable, requiring the use of techniques such as MCTS and forward search, sampling only possible actions taken.

Suppose a task is modelled with a discrete state space. A task with 1000 states will result in a belief state of 1000 dimensions. To overcome this “curse of dimensionality”, point-based POMDP algorithms like HSVI2 and SARSOP use probabilistic sampling.

One can also factor out the fully observable state components to reduce the dimensionality of the belief space into \(S = X \times Y\), where \(X\) is the space of all possible values fully observable variables, and \(Y\) is the space of partially observable variables.

^{(Yanzhu Du {\it et al.}, 2010)}Since state variable \(x\) is fully observable, we only need to maintain belief \(b_Y\) for the state variables in \(Y\).

## Bayesian Teaching

Bayesian teaching aims to induce a target model in the learner by presenting teaching sets of data. This involves two sides of inference:

- Teacher’s inference: over the space of possible teaching sets
- Learner’s inference: over the space of possible target models

### Bayesian Teaching as Model Explanation

The intuition is that subsets of training data that lead a model to
the same (or approximately similar) inference as the model trained on
all the data should be useful for understanding the fitted model.
^{(Ravi Sojitra, 2018)}

Below is an example of using Bayesian teaching, limited to a teaching set of dimension 2, to understand an MNIST model.

One can inspect the best and worst teaching sets to understand what the model finds to be the best and worst representations for a particular number.

Hence, Bayesian teaching is also useful in telling us which examples are most valuable: better suited to induce the desired target model.

# Bibliography

Simard, P. Y., Amershi, S., Chickering, D. M., Pelton, A. E., Ghorashi, S., Meek, C., Ramos, G., …, *Machine teaching: a new paradigm for building machine learning systems*, CoRR, *()*, (2017). ↩

Zhu, X., Singla, A., Zilles, S., & Rafferty, A. N., *An overview of machine teaching*, CoRR, *()*, (2018). ↩

Rafferty, A. N., Brunskill, E., Griffiths, T. L., & Shafto, P., *Faster teaching via pomdp planning*, Cognitive Science, *40(6)*, 1290–1332 (2015). http://dx.doi.org/10.1111/cogs.12290 ↩

Du, Y., Hsu, D., Kurniawati, H., Lee, W. S., Ong, S. C. W., & Png, S. W., *A pomdp approach to robot motion planning under uncertainty*, In , (pp. ) (2010). : . ↩

Sojitra, R. (2018). *Bayesian teaching as model explanation: an mnist example*. Retrieved from https://ravisoji.com/2018/03/04/bayesian-teaching-as-explanation.html. Online; accessed 19 May 2019. ↩