ERM for finite hypothesis classes

We note that empirical risk minimization can easily overfit to the training data. To correct for this, we introduce inductive bias, restricting the hypothesis space \(\mathcal{H}\).

The simplest type of restriction is to impose an upper bound on the size of a class. Here, we show that if a hypothesis class is finite, then ERM will not overfit given a sufficiently large sample size.

Let \(h_S\) denote the result of applying ERM to \(S\):

\begin{equation} h_S \in \textrm{argmin}_{h \in \mathcal{H}} L_S(h) \end{equation}

We make 2 assumptions. First, the realizability assumption, implying that every ERM hypothesis we have that \(L_S(h_S) = 0\). However, we are more interested in the true risk \(L_{(D,f)}(h_S)\) rather than the empirical risk.

The Realizability Assumption: There exists \(h^* \in \mathcal{H}\) such that \(L_{(D,f)}(h^*)= 0\). That is, with probability 1 over random samples \(S\), where the instances are sampled according to \(D\), and labelled according to \(f\), we have \(L_S(h^*) = 0\).

Any guarantee on the error with respect to the underlying distribution \(D\), must depend on the relationship between \(D\) and \(S\). Here, we make the second assumption that the training examples are drawn i.i.d.

Since \(L_{(D,f)}(h_S)\) depends on the training set, which is drawn via a random process, it is also a random variable.

We introduce 2 parameters:

  1. the probability of getting a non-representative sample, denoted by \(\delta\). We denote \((1 - \delta)\) the confidence parameter of our prediction.
  2. We denote \(\epsilon\) as the accuracy parameter of the prediction. The event that \(L_{(D,f)}(h_S) > \epsilon\) is a failure of the learner, while \(L_{(D,f)}(h_S) \le \epsilon\) is the event where the predictor is approximately correct.

We are interested in upper bounding the probability to sample m-tuple of instances that will lead to failure of the learner. Formally, let \(S_x = \left(x_1, \dots, x_m \right)\) be the instances of the training set. We would like to upper-bound:

\begin{equation} D^M(\left\{ S_x ; L_{(D,f)}(h_S) > \epsilon \right\}) \end{equation}

Let \(H_B\) be the set of bad hypotheses, that is,

\begin{equation} \mathcal{H}_B = \left\{ h \in \mathcal{H} : L_{(D,f)}(h)> \epsilon \right\} \end{equation}

In addition, let:

\begin{equation} M = \left\{ S_x: \exists h \in \mathcal{H}_B, L_S(h) = 0 \right\} \end{equation}

be the set of misleading samples. For every \(S_x \in M\), there is a bad hypothesis, \(h \in \mathcal{H}_B\) that looks like a good hypothesis in \(S_x\).

Since the realizability assumption implies \(L_S(h_S) = 0\), then the event \(L_{(D,f)}(h_S) > \epsilon\) will only happen if our sample is in the set of misleading examples \(M\).

Then:

\begin{equation} D^m(\left\{ S_x : L_{(D,f)}(h_S) > \epsilon \right\}) \le D^m(M) =D^m(\cup_{h \in \mathcal{H}_B} {S_x: L_S(h) = 0}) \end{equation}

Because the training samples are i.i.d.:

\begin{align} D^m(\left\{ S_x: L_S(h) = 0\right\}) &= D^m(\left\{ S_x: \forall i, h(x_i) = f(x_i) \right\}) \\
&= \prod_{i=1}^{m}D(\left\{ x_i: h(x_i) = f(x_i) \right\}) \end{align}

for each individual sampling of an element of the training set, we have:

\begin{equation} D(\left\{ x_i: h(x_i) = y_i \right\}) = 1 - L_{(D,f)}(h) \le 1- \epsilon \end{equation}

Using the inequality \(1 - \epsilon \le e^{-\epsilon}\), we obtain that:

\begin{equation} D^m(\left\{ S_x: L_S(h) = 0 \right\}) \le (1 - \epsilon)^m \le e^{-\epsilon m} \end{equation}

Applying the union bound, we get:

\begin{equation} D^m(\left\{ S_x: L_{(D,f)}(h_S) > \epsilon \right\}) \le \left| \mathcal{H}_B \right|(1 - \epsilon)^m \le \left| \mathcal{H}_B \right| e^{-\epsilon m} \end{equation}

With this result, we can show that where \(m \ge \frac{\log(|\mathcal{H}|/\delta)}{\epsilon}\), the error \(L_{(D,f)(h_S) \le \epsilon}\) for every ERM hypothesis \(h_S\).

Formulation

A hypothesis class \(\mathcal{H}\) is PAC learnable if there exist a function \(m_{\mathcal{H}} : (0,1)^2 \rightarrow \mathbb{N}\) and a learning algorithm with the following property: For every \(\epsilon, \delta \in (0,1)\), for every distribution \(\mathcal{D}\) over \(\mathcal{X}\), and with respect to \(\mathcal{H}, \mathcal{D}, \mathcal{f}\), then when running the learning algorithm on \(m \ge m_{\mathcal{H}}(\epsilon, \delta)\) i.i.d. examples generated by \(\mathcal{D}\), and labeled by \(f\), the algorithm returns a hypothesis \(h\) such that, with probability of at least \(1 - \delta\) (over the choice of examples), \(L_{(D,f)}(h) \le \epsilon\).

The accuracy parameter \(\epsilon\) determines how far the output classifier can be from the optimal one, and the confidence parameter \(\delta\) indicates how likely the classifier is to meet the accuracy requirement.

Under the data access model, these approximations are inevitable: for example, the training set is randomly generated, and there is a chance the training samples will be non-informative.

Sample Complexity

The function \(m_{\mathcal{H}}: (0,1)^2 \rightarrow \mathbb{N}\) determines the sample complexity of learning \(\mathcal{H}\): that is, how many examples are required to guarantee a probably approximately correct solution.

We have shown previously that every finite hypothesis class is PAC learnable with sample complexity:

\begin{equation} m_{\mathcal{H}} (\epsilon, \delta) \le \lceil \frac{\log(|\mathcal{H}|/\delta)}{\epsilon} \rceil \end{equation}

There are infinite hypothesis classes that are PAC learnable, and the combinatorial measure called the VC dimension is what deterimines its PAC learnability.

Generalizing the Learning Model

Removing the Realizability Assumption

We have initially required that the learning algorithm succeeds on a pair of data distribution \(D\) and labeling function \(f\) provided that the realizability assumption is met. For practical learning tasks, this assumption is too strong.

The realizability assumption requires that there exists \(h^* \in \mathcal{H}\) such that \(\mathbb{P}_{x \sim D} [h^*(x) = f(x)] = 1\). We replace the

Learning Problems beyond binary classification

Many learning tasks take a different form, such as regression tasks, where a real-world value is predicted. Formally, let \(\mathcal{D}\) be a probability distribution over \(\mathcal{X} \times \mathcal{Y}\), where \(\mathcal{X}\) is our domain set, and \(\mathcal{Y}\) is a set of labels. \(\mathcal{D}\) is a joint distribution over domain points and labels. We can think of it as being composed of two parts: a distribution \(D_x\) over unlabeled domain points, and a conditional probability over labels for each domain point, \(D(x,y) | x\).

For a probability distribution \(D\), one can measure how likely \(h\) is to make an error when labeled points are randomly drawn according to D. We redefine the true error to be:

\begin{equation} L_D(h) = D(\left\{ (x,y): h(x) \ne y \right\}) \end{equation}

We would like to find a predictor \(h\) for which the error is minimized. However, the learner does not know \(D\), but instead has access to its training data, \(S\).

The Bayes predictor

Given any probability distribution \(D\) over \(X\), the best label predicting function \(X\) to \(\left\{ 0,1 \right\}\) will be:

\begin{equation} f_D(x) = \begin{cases} 1 & \textrm{if } P[y=1|x] \ge \frac{1}{2} \\
0 & \textrm{otherwise } \end{cases} \end{equation}

This is the Bayes optimal predictor, and no other classifier has a lower error. It can be shown that no algorithm can be guaranteed to find a predictor that is as good as the Bayes optimal predictor.

Now, we require that the learning algorithm will find a predictor whose error is not much larger than the best possible error of a predictor in some given benchmark hypothesis class. The strength of such a requirement depends on the choice of that hypothesis class. That is, the algorithm returns a hypothesis \(h\) such that with probability \(1 - \delta\):

\begin{equation} L_D(h) \le \textrm{min}{h’\in \mathcal{H}} L_D(h’) + \epsilon \end{equation}

Learning Via Uniform Convergence

Given a hypothesis class, \(\mathcal{H}\) The ERM learning paradigm works as follows: upon receiving a training sample, \(S\), the learner evaluates the risk of each \(h\) in \(H\), on the given sample, and outputs a member of \(H\) that minimizes this risk. Hence, all we need is that uniformly over all hypotheses in the hypothesis class, the empirical risk will be close to the true risk. We formalize it as follows:

A training set \(S\) is called a $ε$-representative sample (w.r.t. domain \(Z\), hypothesis class \(\mathcal{H}\), loss function \(l\), and distribution \(D\)) if

\begin{equation} \forall h \in \mathcal{H}, | L_S(h) - L_D(h)| \le \epsilon \end{equation}

Whenever a sample is (ε/2)-representative, the ERM learning rule is guaranteed to return a good hypothesis.

This lemma implies that to ensure that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least \(1 - \delta\) over the random choice of a training set, it will be an $ε$-representative set. This is referred to as the uniform convergence property.

A hypothesis class \(H\) has the uniform convergence property (w.r.t. domain \(Z\), and loss function \(l\)) if there exists a function \(m_{\mathcal{H}}^{Vc} : (0,1)^2 \rightarrow \mathbb{N}\) such that for every \(\epsilon, \delta \in (0,1)\) and for every probability distribution \(\mathcal{D}\) over \(Z\), if \(S\) is a sample of \(m \ge m_{\mathcal{H}}^{VC}(\epsilon, \delta)\) examples drawn i.i.d according to \(\mathcal{D}\), then with probability of at least \(1 - \delta\), \(S\) is $ε$-representative.

We also need the measure concentration inequality by Hoeffding, which quantifies the gap between empirical averages andn their expected value:

Let \(\theta_1, \dots, \theta_m\) be a sequence of i.i.d. random variables, and assume that for all \(i\), \(E[\theta_i] = \mu\) and P[a ≤ θ_i ≤ b] = 1$. Then for any \(\epsilon > 0\):

\begin{equation} P\left[ \left| \frac{1}{m}\sum_{i=1}^{m}\theta_i - \mu \right| > \epsilon \right] \le 2 \textrm{exp} \left( \frac{-2m\epsilon^2}{(b-a)^2} \right) \end{equation}

Classes of functions for which the uniform convergence property holds are also called Glivenko-Cantelli classes. The fundamental theorem of learning theory states that in binary classification problems, uniform convergence is not only a sufficient condition for learnability, but is also a necessary condition. This is not the case for more general learning problems.