(Shalev-Shwartz & Ben-David, 2014)

What makes one class learnable and another unlearnable? The family of learnable classes in the setup of binary valued classification with the zero-one loss relies on a combinatorial notion called the Vapnik-Chervonenkis dimension (VC-dimension).

## Infinite-size classes can be learnable

To see that this is true, we provide a counterexample.

let \(\mathcal{H}\) be the set of threshold functions over the real line, namely, \(\mathcal{H} = \left\{h_a : a \in \mathbb{R}\right\}\), where \(h_a : \mathbb{R} \rightarrow \left\{ 0,1\rightarrow \right\}\) is a function such that \(h_a(x) = \mathbb{I}_{[x < a]}\). Clearly \(H\) is of infinite size. However, we can easily show that \(\mathcal{H}\) is PAC learnable, with sample complexity:

\begin{equation} m_H(\epsilon, \delta) \le \lceil \log (2/\delta) / \epsilon \rceil \end{equation}

## The VC-dimension

Hence, while finiteness of \(\mathcal{H}\) is a sufficient condition for PAC learnability, it is not a necessary condition. Here we show that the VC-dimension of a hypothesis class gives the correct characterization of its learnability.

Let \(\mathcal{H}\) be a class of functions from \(\mathcal{X}\) to \(\left\{0,1\right\}\), and let \(C = \{c_1, \dots, c_m\} \subset X\). The restriction of \(\mathcal{H}\) to \(C\) is the set of functions from \(C\) to \(\{0, 1\}\) that can be derived from \(\mathcal{H}\). That is:

\begin{equation} \mathcal{H}_C = \left\{ h(c_1), \dots, h(c_m) : h \in \mathcal{H} \right\} \end{equation}

where we represent each function from \(C\) to \(\{0, 1\}\) as a vector in \(\{0,1\}^{|C|}\).

A hypothesis class \(\mathcal{H}\) shatters a finite set \(C \subset \mathcal{X}\) if the restriction of \(\mathcal{H}\) to \(C\) is the set of all functions from \(C\) to \(\{0, 1\}\). That is, \(|\mathcal{H}_C| = 2^{|C|}\).

Whenever some set \(C\) is shattered by \(\mathcal{H}\), the adversary is not restricted by \(\mathcal{H}\), as they can construct a distribution over \(C\) based on any target function from \(C\) to \(\{0,1\}\), while still maintaining the realizability assumption.

This leads us to the definition of VC-dimension:

The VC-dimension of a hypothesis class \(\mathcal{H}\), denoted \(\textrm{VCdim}(\mathcal{H})\), is the maximal size of a set \(C \subset \mathcal{X}\) that can be shattered by \(\mathcal{H}\). If \(\mathcal{H}\) can shatter \(C\) of any arbitrary size, then \(\mathcal{H}\) has infinite VC-dimension.

## Examples

### Threshold Functions

Let \(\mathcal{H}\) be the class of threshold functions over \(\mathbb{R}\). We have shown that for an arbitrary set \(C = \{c_1\}\), \(\mathcal{H}\) shatters \(C\). However, we have shown that for an arbitrary set \(C = \{c_1, c_2\}\) where \(c_1 \le c_2\), \(\mathcal{H}\) does not shatter \(C\). Hence \(\textrm{VCdim}(\mathcal{H}) = 1\).

### Intervals

Take \(C = {1, 2}\), and we see that \(\mathcal{H}\) shatters \(C\). Hence \(\textrm{VCdim}(\mathcal{H}) \ge 2\). However, take an arbitrary set \(C = \{c_1, c_2, c_3\}\) where \(c_1 \le c_2 \le c_3\). Then the labelling (1,0,1) cannot be obtained by an interval. Therefore, \(\textrm{VCdim}(\mathcal{H}) = 2\).

## The Fundamental Theorem of Statistical Learning

Let \(\mathcal{H}\) be a hypothesis class of functions from a domain \(\mathcal{X}\) to \(\{0, 1\}\) and let the loss function be the 0-1 loss. Then the following are equivalent:

- \(\mathcal{H}\) has the uniform convergence property.
- Any ERM rule is a successful agnostic PAC learner for \(\mathcal{H}\).
- \(\mathcal{H}\) is agnostic PAC learnable.
- \(\mathcal{H}\) is PAC learnable.
- Any ERM rule is a successful PAC learner for \(\mathcal{H}\).
- \(\mathcal{H}\) has a finite VC-dimension.

## Sauer’s Lemma and the Growth Function

We have defined the notion of shattering, by considering the restriction of \(\mathcal{H}\) to a finite set of instances. The growth function measures the maximal “effective” size of \(\mathcal{H}\) on a set of \(m\) examples. Formally:

Let \(\mathcal{H}\) be a hypothesis class. Then the growth function of \(\mathcal{H}\), denoted \(\tau_{\mathcal{H}}(m) : \mathbb{N} \rightarrow \mathbb{N}\), is defined as:

\begin{equation} \tau_{\mathcal{H}}(m) = \textrm{max}_{C \subset \mathcal{X} : |C| = m} |\mathcal{H}_C| \end{equation}

\(\tau_{\mathcal{H}}(m)\) is the number of different functions from a set \(C\) of size \(m\) to \(\{0,1\}\) that can be obtained by restricting \(\mathcal{H}\) to \(C\). With this definition we can now state Sauer’s lemma:

Let \(\mathcal{H}\) be a hypothesis class with \(\textrm{VCdim}(\mathcal{H}) \le d < \infty\). Then for all \(m\),

\begin{equation} \tau_{\mathcal{H}}(m) \le \sum_{i=0}^{d}{m \choose i} \end{equation}

In particular, if \(m > d + 1\), then \(\tau_{\mathcal{H}}(m) \le (em/d)^d\).

# Bibliography

Shalev-Shwartz, S., & Ben-David, S., *Understanding machine learning: from theory to algorithms* (2014), : Cambridge university press. ↩