Machine Learning
When do we need machine learning?
Two aspects of a given problem may call for the use of programs that learn and improve on the basis of their “experience”:
 The problem’s complexity: some tasks that require elaborate introspection that cannot be welldefined in programs, such as driving, are illsuited for coding by hand. Tasks that are beyond human capabilities such as the analysis of large datasets also fall in this category.
 Adaptivity: Programmed tools are rigid, while machine learning tools allows for adaptation to the environment they interact with.
The Learning Problem
What is Learning?
An agent is said to be learning if it improves its performance P on task T based on experience/observations/data E. T must be fixed, P must be measurable, E must exist. See Learning agents.
Inductive Bias
The incorporation of prior knowledge biases the learning mechanism. This is also called inductive bias. The incorporation of prior knowledge is inevitable for the success of learning algorithms (the nofreelunch theorem). The stronger the prior knowledge that one starts the learning process with, the easier it is to learn from further examples.
Types of Learning
Active vs Passive Learning
An active learner interacts with the environment at training time, (e.g. by posing queries or performing experiments), while a passive learner only observes the information provided by the environment.
Online vs Batch Learning
In online learning, the hypothesis has to be updated each time a new label is received.
Reinforcement Learning
Each action in a state has an associated cost and a probability distribution of the next state.
Goal is to learn a policy (mapping from state to action) that minimizes the sum of expected current and future costs.
Supervised Learning
Since learning involves an interaction between the learner and the environment, one can divide tasks according to the nature of that interaction.
Supervised learning describes a scenario in which the training examples contain significant information that is missing in the unseen “test examples”. In unsupervised learning, there is no distinction between the training and test data. The goal is to come up with some summary, or compressed version of that data.

Measure of success
A loss function helps measure our success. Given a set \(H\) of hypothesis of models, and a domain \(Z\), let \(l\) be a function from \(H \times Z\) to nonnegative real numbers $l: \(H \times Z \rightarrow \mathbb{R}_{+}\).
The risk function is the expected loss of the hypothesis,
\begin{equation*} L_D(h) = E_{z \sim D}[l(h,z)] \end{equation*}
We are interested in finding a hypothesis \(h\) that has a small risk, or expected loss, typically using Empirical Risk Minimization.

Assumptions Made ⚠
 One common assumption is that the data in the data generation process is independently and identically distributed (IID), according to the distribution \(D\).
Q: Given a large enough training set, do you expect the long term test error to be similar to the training error?
 If IID, then yes
 If not, there is likely dependencies, but under certain conditions,
yes.
 If sampling mixes well, it will not take long for D’ to look like a steady set distribution.
 If dependencies are exploited, there is a possibility of attaining lower training and test error.
Is Learning Feasible?
The target function \(f\) that we want to learn is unknown. The performance of a hypothesis on the training set \(D\) tells us nothing about the performance on the data outside of \(D\).
As long as \(f\) is unknown, knowing \(D\) cannot exclude any patterns of \(f\) outside of \(D\), and the predictions of \(g\) would be meaningless.
Probabilistic View
If we accept a probabilistic answer, that is \(D\) tells us something likely about \(f\) outside of \(D\), then learning is feasible, only with a small price.
Learning a hypothesis \(g\) approximates the target function \(f\) well, i.e. \(E_{out}(g) \approx 0\). However, probabilistic analysis via Hoeffding’s Inequality gives \(E_{out}(g) \approx E_{in}(g)\). Therefore, we still need to ensure \(E_{in}(g) \approx 0\).
Training vs Testing
Generalisation Error
We can define generalisation error as the discrepancy between \(E_in\) and \(E_out\). The Hoeffding Inequality characterises the generalization error with a probabilistic bound:
\begin{align} P[E_{in}(g)  E_{out}(g) > \epsilon] \le 2Me^{2\epsilon^2N} \end{align}
Pick a tolerance level \(\delta\), and assert with probability \(1\delta\) that
\begin{align} E_{out}(g) \le E_{in}(g) + \sqrt{\frac{1}{2N}\ln \frac{2M}{\delta}} \end{align}
Notice the error bound depends on \(M\), the size of the hypothesis set \(H\). Most learning models have infinite \(H\), including the simple perceptron. Hence, to study generalisation in such models, we need to derive a counterpart that deals with infinite \(H\).
Notice that the \(M\) factor was obtained by taking the disjunction of events. Let \(B_m\) be the bad event that \(E_{in}(h_m)  E_{out}(h_m)
\epsilon\). Notice that these bad events are often strongly overlapping, and the disjunction of these events form a much smaller area.
The mathematical theory of generalisation hinges on this observation. Upon accounting for the overlaps of different hypotheses, we will be able to replace the number of hypotheses \(M\) with an effective finite number, even while \(M\) is infinite.
Growth Function
The growth function is the quantity that will formalize the effective number of hypotheses.
Each \(h \in H\) generates a dichotomy which is \(h\) is \(1\) or \(h\) i \(+1\). We then formally define dichotomies as follows:
\begin{align} H(x_1, \dots, x_n) = \left\{ h(x_1), h(x_2), \dots, h(x_n)  h \in H \right\} \end{align}
Concept Learning
A concept is a booleanvalued function over a set of input instances (each comprising input attributes). Concept learning is a form of supervised learning. Infer an unknown booleanvalued function from trainingexamples.
Hypothesis
There is a tradeoff between expressive power and smaller hypothesis space. Large hypothesis spaces are bad, because search is going to take a long time, and also requires more data. Humans exploit structure in the hypothesis space to guide search and learn faster.
A hypothesis \(h\) is consistent with a set of training examples \(D\) iff \(h(x) = c(x)\) for all \(<x,c(x)> \in D\).
Inductive Learning
Any hypothesis found to approximate the target function well over a sufficient large set of training examples will also approximate the target function well over other unobserved examples.
Concept Learning is Search
The goal is to search for a hypothesis \(h \in H\) that is consistent with \(D\).
Exploit Structure in Concept Learning
\(h_j\) is more general than or equal to \(h_k\) (denoted \(h_j \ge_{g} h_k\)) iff any input instance \(x\) that satisfies \(h_j\) also satisfies \(h_k\).
This is relation is a partial order.
FindS Algorithm
Intuition: Start with the most specific hypothesis \(h\). Whenever it wrongly classifies a positive training example, we “minimally” generalize it to satisfy its input instance.
Limitations
 Can’t tell whether FindS has learnt the target concept
 Can’t tell when training examples are inconsistent
 Picks a maximally specific \(h\)
 Depending on \(H\), there may be several solutions
Version Space
\begin{equation*} VS_{H,D} = {h \in H  h \text{ is consistent with }D} \end{equation*}
 If \(c \in H\), then D can reduce \(VS_{H,D}\) to \({c}\).
 If D is insufficient, then \(VS_{H,D}\) represents the uncertainty of what the target concept is
 \(VS_{H,D}\) contains all consistent hypotheses, including maximally specific hypotheses
The general boundary G of \(VS_{H,D}\) is the set of maximally general members of \(H\) consistent with \(D\).
The specific boundary S of \(VS_{H,D}\) is the set of maximally general members of \(H\) consistent with \(D\).
\begin{equation*} VS_{H,D} = {h \in H  \exists s \in S \exists g \in G g \ge_g h \ge_g s } \end{equation*}
ListThenEliminate Algorithm
Iterate through all hypotheses in \(H\), and eliminate any hypothesis found inconsistent with any training example. This algorithm is often prohibitively expensive.
CandidateElimination Algorithm
Start with most general and specific hypotheses. Each training example “minimally” generalizes S and specializes G to remove inconsistent hypotheses from version space.
Decision Tree Learning
Decision Tree Learning is a method of learning which approximates discretevalued functions that is robust to noisy data, and is capable of learning disjunctive expressions
It is most appropriate when:
 instances are represented as attribute pairs
 the target function has discrete output values
 Disjunctive descriptions may be required
 The training data may contain errors
 The training data may contain missing attribute values
ID3 algorithm
ID3 learns decision trees by constructing them top down. Each instance attribute is evaluated using a statistical test to determine how well it alone classifies the examples. The best attribute is selected and used as the test at the root node of the tree.
Which is the best attribute?
A statistical property called information gain measures how well a given attribute separates the training examples according to their target classification.
Information gain is the expected reduction in entropy caused by partitioning the examples according to this attribute:
\begin{align} Gain(S,A) = Entropy(S)  \sum_{v\in Values(A)}\frac{S_v}{S}Entropy(S_v) \end{align}
For example:
\begin{align}
Values(Wind) &= Weak, Strong \\\
S &= [9+, 5] \\\
S_{Weak} &\leftarrow [6+, 2] \\\
S_{Strong} &\leftarrow [3+, 3] \\\
Gain(S, Wind) &= Entropy(S)  \frac{8}{14}Entropy(S_{Weak}) 
\frac{6}{14}Entropy(S_{Strong}) \\\
&=0.048
\end{align}
Hypothesis Space Search
ID3 can be characterised as searching a space of hypotheses for one that fits the training examples. The hypothesis space searched is the set of possible decision trees. ID3 performs a simpletocomplex, hillclimbing search. The evaluation measure that guides the search is the information gain measure.
Because ID3’s hypothesis space of all decision trees is a complete space of finite discretevalued functions, it avoids the risk that the hypothesis space might not contain the target function.
ID3 maintains only a single hypothesis as it searches through the space of decision trees. ID3 loses the capabilities that follow from explicitly representing all consistent hypothesis.
ID3 in its pure form performs no backtracking in its search, and can result in locally but not globally optimal target functions.
ID3 uses all training examples at each step to make statistically based decisions, unlike other algorithms that make decisions incrementally.
Inductive bias
The inductive bias of decision tree learning is that shorter trees are preferred over larger trees (Occam’s razor). Trees that place high information gain attributes close to the root are preferred over those that do not. ID3 can be viewed as a greedy heuristic search for the shortest tree without conducting the entire breadthfirst search through the hypothesis space.
Notice that ID3 searches a complete hypothesis space incompletely, and candidateelimination searches an incomplete hypothesis space completely. The inductive bias of ID3 follows from its search strategy (preference bias), while that of candidate elimination follows from the definition of its search space. (restriction bias).
Why Prefer Shorter Hypotheses?
 fewer shorter hypothesis than larger ones, means it’s less likely to overgeneralise
Density Estimation
Density Estimation refers to the problem of modeling the probability distribution \(p(x)\) of a random variable \(x\), given a finite set \(x_1, x_2, \dots, x_n\) of observations.
We first look at parametric distributions, which are governed by a small number of adaptive parameters. In a frequentist treatment, we choose specific values for the parameters optimizing some criterion, such as the likelihood function. In a Bayesian treatment, we introduce prior distributions and then use Bayes’ theorem to compute the corresponding posterior distribution given the observed data.
An important role is played by conjugate priors, which yield posterior distributions of the same functional form.
The maximum likelihood setting for parameters can give severely overfitted results for small data sets. To develop a Bayesian treatment to this problem, we consider a form of prior distribution with similar form as the maximum likelihood function. this property is called conjugacy. For a binomial distribution, we can choose the beta distribution as the prior.
Unsupervised Learning
In unsupervised learning, given a training set \(S = \left(x_1, \dots, x_m\right)\), without a labeled output, one must construct a “good” model/description of the data.
Example use cases include:
 clustering
 dimension reduction to ind essential parts of the data and reduce noise (e.g. PCA)
 minimises description length of data
Kmeans Clustering
Input: \(\{x^{(1), x^{(2)}, x^{(3)}, \dots, x^{(m)}}\}\).
 Randomly initialize cluster centroids.
 For all points, compute which cluster centroid is the closest.
 For each cluster centroid, move centroids to the average points belonging to the cluster.
 Repeat until convergence.
Kmeans is guaranteed to converge. To show this, we define a distortion function:
\begin{equation} J(c, \mu) = \sum_{i=1}^m  x^{(i)}  \mu_{c^{(i)}}^2 \end{equation}
K means is coordinate ascent on J. Since \(J\) always decreases, the algorithm converges.
Gaussian Mixture Model
By Bayes’ Theorem:
\begin{equation} P(X^{(i)}, Z^{(i)}) = P(X^{(i)}  Z^{(i)})P(Z^{(i)}) \end{equation}
\begin{equation} Z^{(i)} \sim \text{multinomial}(\phi) \end{equation}
\begin{equation} X^{(i)}  Z^{(j)} \sim \mathcal{N}(\mu_j, \Sigma_j) \end{equation}
Refile
Data Compression
In lossy compression, we seek to trade off code length with reconstruction error.
In vector quantization, we seek a small set of vectors \({z_i}\) to describe a large dataset of vectors \({x_i}\), such that we can represent each \(xi\) with its closest approximation in \({z_i}\) with small error. (Clustering problem)
In transform coding, we transform the data, usually using a linear tranformation. The data in the transformed domain is quantized, usually discarding the small coefficients, corresponding to removing some of the dimensions.
Generative Learning Algorithms
Discriminative algorithms model \(p(y  x)\) directly from the training set.
Generative algorithms model \(p(y  x)\) and \(p(y)\). Then \(argmax_y p(yx) = argmax_y \frac{p(xy)p(y)}{p(x)} = argmax_y p(xy)p(y)\).
Multivariate Normal Distribution
A multivariate normal distribution is parameterized by a mean vector \(\mu \in R^n\) and a covariance matrix \(\Sigma \in R^{n \times n}\), where \(\Sigma \ge 0\) is symmetric and positive semidefinite.
TODO Gaussian Discriminant Analysis
In Gaussian Discriminant Analysis, p(x  y) is distributed to a Multivariate Normal Distribution.
\begin{align}
y &\sim Bernoulli(\phi) \\\
xy = 0 &\sim N(\mu_0, \Sigma) \\\
xy = 1 &\sim N(\mu_1, \Sigma)
\end{align}
We can write out the distributions:
\begin{align}
p(y) &= \phi^y (1  \phi)^{1y} \\\
p(x  y = 0) &= \frac{1}{(2\pi)^{n/2}\Sigma^{n/2}} exp \left(  \frac{1}{2} (x  \mu_0)^T \Sigma^{1}(x  \mu_0) \right) \\\
p(x  y = 1) &= \frac{1}{(2\pi)^{n/2}\Sigma^{n/2}} exp \left(  \frac{1}{2} (x  \mu_1)^T \Sigma^{1}(x  \mu_1) \right)
\end{align}
Then, the loglikelihood of the data is:
\begin{align}
l(\phi, \mu_0, \mu_1, \Sigma) &= \log \prod_{i=1}^m p(x^{(i)}, y^{(i)}; \mu_0, \mu_1, \Sigma) \\\
&= \log \prod_{i=1}^m p(x^{(i) } y^{(i)}; \mu_0, \mu_1, \Sigma)p(y^{(i)}; \phi)
\end{align}
We maximize \(l\) with respect to the parameters.
The Natural Language Decathlon: Multitask Learning as Question Answering: Richard Socher
 Joint work with Bryan McCann, Nitish Keskar and Caiming Xiong
Limits of Singletask Learning
 We can hill climb to local optima if \(dataset > 100 \times C\)
 For more general model, we need continuous learning in a single model
For pretraining in NLP, we’re still stuck at the word vector level. This compared to vision, where most of the model can be pretrained, only retraining the final few layers.
Why has weight & model sharing not happened so much in NLP?
 NLP requires many types of reasoning: logical, linguistic etc.
 Requires short and longterm memory
 NLP has been divided into intermediate and separate tasks to make progress (Benchmark chasing in each community)
 Can a single unsupervised task solve it all? No, language clearly requires supervision in nature.
Motivation for Single Multitask model
 Step towards AGI
 Important building block for:
 Sharing weights
 Transfer learning
 Zeroshot learning
 Domain adaptation
 Easier deployment in production
 Lowering the bar for anybody to solve their NLP task
End2end model vs parsing as intermediate step (e.g. running POS tagger first).
The 3 equivalent supertasks of NLP
Any NLP task can be mapped to these 3 super tasks:
 Language Modeling
 Question Answering
 Dialogue
Multitask learning as QA
 Question Answering
 Machine Translation
 Summarization
 NLI
 Sentiment Classification
 Semantic Role Labeling
 Relation Extraction
Meta supervised learning: {x, y} to {x, t, y}
Designing a model for decaNLP
 No taskspecific modules or parameters because task ID assumed to be unavailable
 Start with a context
 Ask a question
 Generate answer one at a time by
 Pointing to context
 Pointing to question
 Choosing a word
Learnings
 Transformer Layers yield benefits in singletask and multitask setting
 QA and SRL have strong connections
 Pointing to the question is essential, despite the task being just classification for some subtasks
 Mulitasking helps a lot with zeroshot tasks
(Latest version of the paper coming out soon – ICLR 2018)
Training Strategies
 Fully Joint
 Curriculum learning doesn’t work
 Anticurriculum training works instead
 Start with a really hard task
Structuring Data Science Projects
Cookiecutter Data Science provides a decent project structure, and
uses the ubiquitous build tool Make
to build data projects. (DrivenData, n.d.)
├── LICENSE
├── Makefile < Makefile with commands like `make data` or `make train`
├── README.md < The toplevel README for developers using this project.
├── data
│ ├── external < Data from third party sources.
│ ├── interim < Intermediate data that has been transformed.
│ ├── processed < The final, canonical data sets for modeling.
│ └── raw < The original, immutable data dump.
│
├── docs < A default Sphinx project; see sphinxdoc.org for details
│
├── models < Trained and serialized models, model predictions, or model summaries
│
├── notebooks < Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `` delimited description, e.g.
│ `1.0jqpinitialdataexploration`.
│
├── references < Data dictionaries, manuals, and all other explanatory materials.
│
├── reports < Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures < Generated graphics and figures to be used in reporting
│
├── requirements.txt < The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py < Make this project pip installable with `pip install e`
├── src < Source code for use in this project.
│ ├── __init__.py < Makes src a Python module
│ │
│ ├── data < Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features < Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models < Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization < Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini < tox file with settings for running tox; see tox.testrun.org
Stripe’s approach (Frank, n.d.) still primarily uses Jupyter notebooks, but has 2 main points. First, they strip the results from the Jupyter notebooks before committing. Second, they ensure that the notebooks can be reproduced on the work laptops and on their cloud infrastructure.
Bibliography
DrivenData. n.d. “Home  Cookiecutter Data Science.” https://drivendata.github.io/cookiecutterdatascience/.
Frank, Dan. n.d. “Reproducible Research: Stripe’s Approach to Data Science.” https://stripe.com/blog/reproducibleresearch.