Statistical Learning
Introduction
Statistical learning refers to a vast set of tools for understanding data. It involves building a statistical model for predicting, or estimating, an output based on one or more input.
Our goal is to apply a statistical learning method to the training data
in order to estimate the unknown function
Parametric Methods
Parametric methods involve a two-step model-based approach:
- First, we make an assumption about the functional form, or shape,
of
. For example, one simple assumption is that is linear in :
This is a linear model, and with this assumption the problem of
estimating
After a model has been selected, we need a procedure that uses the
training data to fit or train the model. In the case of the linear
model, we need to estimate the parameters
The model-based approach just described is referred to as parametric;
it reduces the problem of estimating
Non-parametric Methods
Non-parametric methods do not make explicit assumptions about the
functional form of
The Trade-Off Between Prediction Accuracy and Model Interpretability

Figure 1: A representation of the tradeoff between flexibility and interpretability, using different statistical methods.
There are several reasons to choose a restrictive model over a flexible approach. First, if the interest is mainly in inference, restrictive models tend to be much more interpretable. For example linear models make it easy to understand the associations between individual predictors and the response. Even when predictions are the only concern, highly flexible models are susceptible to overfitting, and restrictive models can often outperform them.
Measuring Quality of Fit
In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data.
In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by:
The MSE will be small if the predicted responses are very close to the true responses.
Bias-Variance Trade-Off
It can be shown taht the expected test MSE, for a given value
The expected test MSE refers to the average test MSE that we would
obtain if we repeatedly estimated
The variance of a statistical learning method refers to the amount by
which
Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. In general, more flexible approaches have lower bias.
The Bayes Classifier
When the test error rate is minimized, the classifier assigns each observation to the most likely class, given its predictor values. This is known as the Bayes classifier. The Bayes classifier produces the lowest possible error rate, known as the Bayes error rate, given by:
In theory, we would always like to predict qualitative responses using
the Bayes classifier. However, for real data, we do not know the
conditional distribution of
K-Nearest Neighbours
The Bayes classifier serves as the unattainable gold standard against
which to compare other methods. Many approaches attempt to estimate
the conditional distribution of
Given a positive integer
Finally, KNN applies Bayes rule and classifies the test observation
The Statistical Learning Framework
Consider the problem of classifying a papaya into 2 bins: tasty or not tasty. We’ve chosen 2 features:
- The papaya’s colour, ranging from dark green through orange and red to dark brown
- The papaya’s softness, ranging from rock hard to mushy
The learner’s input consists of:
-
Domain set: An arbitrary set,
. This is the set of objects that we may wish to label. The domain set in our example will be the set of all papayas. Usually, these domain represented by a vector of features (like colour and softness). We also refer to domain points as instances and to as the instance space. -
Label set: The label set is restricted in our example to a two-element set, usually {0, 1} or {-1, +1}.
-
Training data:
is a finite sequence of pairs in . This is the input that the learner has access to. Such labeled examples are often called training examples. is also sometimes referred to as the training set. -
The learner’s output: The learner is requested to output a prediction rule
. This function is also called a predictor, a hypothesis, or a classifier. The predictor can be used to predict the label of new domain points. -
A simple data-generation model: This explains how the training data is generated. First, we assume that the instances are generated by some probability distribution. We denote the probability distribution over
by . we do not assume that the learner knows anything about this distribution. -
Measures of success: We define the error of a classifier to be the probability that it does not predict the correct label on a random data point generated by the underlying distribution. That is, the error of
is the probability to draw a random instance from , such that , where is the true labelling function:
The learner is blind to the underlying probability distribution
Classification
The linear regression model assumes that the response variable
Why not Linear Regression?
Encoding non-binary categorical variables as a dummy variable using integers can lead to a unwanted encoding of a relationship between the different options. With binary outcomes, linear regression does do a good job as a classifier: in fact, it is equivalent to linear discriminant analysis.
Suppose we encode the outcome
Then the population
Logistic Regression
Rather than modelling the response
How should we model the relationship between
This model for
In logistic regression, we use the logistic function:
This restricts values of
And this monotone transformation is called the log odds or logit
transformation of
We use maximum likelihood to estimate the parameters:
This likelihood gives the probability of the observed zeros and ones
in the data. We pick
As with linear regression, we can compute the coefficient values, the standard error of the coefficients, the z-statistic, and the p-value. The z-statistic plays the same role as the t-statistic. A large absolute value of the z-statistic indicates evidence against the null hypothesis.
Multiple Logistic Regression
It is easy to generalize the formula to multiple logistic regression:
Similarly, we use the maximum likelihood method to estimate the coefficient.
TODO Case Control Sampling
Case control sampling is most effective when the prior probabilities of the classes are very unequal.
Linear Discriminant Analysis
Logistic regression involves directly modelling
When these distributions are assumed to be normal, it turns out that the model is very similar in form to logistic regression.
Why do we need another method?
- When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. LDA does not suffer from this issue.
- If n is small, and the distribution of the predictors
is approximately normal in each of the classes, the LDA model is more stable than the logistic regression model. - LDA is more popular when we have more than 2 response classes.
We first state Bayes’ theorem, and write it differently for discriminant analysis:
where
We first discuss LDA when
We can plug this into Bayes formula and get a complicated expression
for
Note that
We can estimate the parameters:
We can extend Linear Discriminant Analysis to the case of multiple
predictors. To do that, we will assume that
The multivariate Gaussian distribution assumes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors. Formally, the multivariate Gaussian density is defined as:
In the case of
The LDA model has the lowest error rate the Gaussian model is correct, since it approximates the Bayes classifier. However, misclassifications can still happen, and a good way to visualize them is through a confusion matrix. The probability threshold can also be tweaked to reduce the error rates for incorrect classification to a single class.
The ROC (Receiver Operating Characteristics) curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds. An ideal ROC curve will hug the top left corner, so the larger the AUC (Area Under Curve) the better the classifier. The overall performance of a classifier, summarized over all possible thresholds, is given by this value.
Varying the classifier threshold also changes its true positive and false negative rate. These are also called the sensitivity, and 1 - specificity of the classifier.
Quadratic Discriminant Analysis
In LDA with multiple predictors, we assumed that observations are
drawn from a multivariate Gaussian distribution with a class-specific
mean vector and a common covariance matrix. Quadratic Discriminant
Analysis (QDA) assumes that each class has its own covariance matrix.
Under this assumption, the Bayes classifier assigns an observation
When would one prefer LDA to QDA, or vice-versa? The answer lies in
the bias-variance trade-off. When there are
Comparison of Classification Methods
Logistic Regression and LDA produce linear decision boundaries. The only difference between the two approaches is that in logistic regression the coefficients are estimated using maximum likelihood, while in LDA the coefficients are approximated via the estimated mean and variance from a normal distribution.
Since logistic regression and LDA differ only in their fitting procedures, one might expect the two approaches to give similar results. Logistic regression can outperform LDA if the Gaussian assumptions are not met. On the other hand, LDA can show improvements over logistic regression if they are.
KNN takes a completely different approach from the classifiers seen in
this chapter. In order to make a prediction for an observation
Though not as flexible as the KNN, QDA can perform better in the presence of a limited number of training observations, because it does make some assumptions about the form of the decision boundary.
Reference Textbooks
- An introduction to statistical learning (James et al., n.d.)
- Understanding Machine Learning (Shalev-Shwartz and Ben-David, n.d.)
Bibliography
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. n.d. An Introduction to Statistical Learning. Vol. 112. Springer.
Shalev-Shwartz, Shai, and Shai Ben-David. n.d. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press.