PAC Learning
- tags
- Machine Learning
ERM for finite hypothesis classes
We note that Empirical Risk Minimization can easily overfit
to the training data. To correct for this, we introduce inductive
bias, restricting the hypothesis space
The simplest type of restriction is to impose an upper bound on the size of a class. Here, we show that if a hypothesis class is finite, then ERM will not overfit given a sufficiently large sample size.
Let
We make 2 assumptions. First, the realizability assumption, implying
that every ERM hypothesis we have that
The Realizability Assumption: There exists
Any guarantee on the error with respect to the underlying distribution
Since
We introduce 2 parameters:
- the probability of getting a non-representative sample, denoted by
. We denote the confidence parameter of our prediction. - We denote
as the accuracy parameter of the prediction. The event that is a failure of the learner, while is the event where the predictor is approximately correct.
We are interested in upper bounding the probability to sample m-tuple
of instances that will lead to failure of the learner. Formally, let
Let
In addition, let:
be the set of misleading samples. For every
Since the realizability assumption implies
Then:
Because the training samples are i.i.d.:
for each individual sampling of an element of the training set, we have:
Using the inequality
Applying the union bound, we get:
With this result, we can show that where
Formulation
A hypothesis class
The accuracy parameter
Under the data access model, these approximations are inevitable: for example, the training set is randomly generated, and there is a chance the training samples will be non-informative.
Sample Complexity
The function
We have shown previously that every finite hypothesis class is PAC learnable with sample complexity:
There are infinite hypothesis classes that are PAC learnable, and the combinatorial measure called the VC dimension is what deterimines its PAC learnability.
Generalizing the Learning Model
Removing the Realizability Assumption
We have initially required that the learning algorithm succeeds on a
pair of data distribution
The realizability assumption requires that there exists
Learning Problems beyond binary classification
Many learning tasks take a different form, such as regression tasks,
where a real-world value is predicted. Formally, let
For a probability distribution
- We redefine the true error to be:
We would like to find a predictor
The Bayes predictor
Given any probability distribution
This is the Bayes optimal predictor, and no other classifier has a lower error. It can be shown that no algorithm can be guaranteed to find a predictor that is as good as the Bayes optimal predictor.
Now, we require that the learning algorithm will find a predictor
whose error is not much larger than the best possible error of a
predictor in some given benchmark hypothesis class. The strength of
such a requirement depends on the choice of that hypothesis class.
That is, the algorithm returns a hypothesis
Learning Via Uniform Convergence
Given a hypothesis class,
A training set
Whenever a sample is (ε/2)-representative, the ERM learning rule is guaranteed to return a good hypothesis.
This lemma implies that to ensure that the ERM rule is an agnostic PAC
learner, it suffices to show that with probability of at least
A hypothesis class
We also need the measure concentration inequality by Hoeffding, which quantifies the gap between empirical averages andn their expected value:
Let
Classes of functions for which the uniform convergence property holds are also called Glivenko-Cantelli classes. The fundamental theorem of learning theory states that in binary classification problems, uniform convergence is not only a sufficient condition for learnability, but is also a necessary condition. This is not the case for more general learning problems.