VC-Dimension

tags
Bias-Complexity Tradeoff, PAC Learning

Shalev-Shwartz and Ben-David, n.d.

What makes one class learnable and another unlearnable? The family of learnable classes in the setup of binary valued classification with the zero-one loss relies on a combinatorial notion called the Vapnik-Chervonenkis dimension (VC-dimension).

Infinite-size classes can be learnable

To see that this is true, we provide a counterexample.

let $$\mathcal{H}$$ be the set of threshold functions over the real line, namely, $$\mathcal{H} = \left\{h_a : a \in \mathbb{R}\right\}$$, where $$h_a : \mathbb{R} \rightarrow \left\{ 0,1\rightarrow \right\}$$ is a function such that $$h_a(x) = \mathbb{I}_{[x < a]}$$. Clearly $$H$$ is of infinite size. However, we can easily show that $$\mathcal{H}$$ is PAC learnable, with sample complexity:

$$m_H(\epsilon, \delta) \le \lceil \log (2/\delta) / \epsilon \rceil$$

The VC-dimension

Hence, while finiteness of $$\mathcal{H}$$ is a sufficient condition for PAC learnability, it is not a necessary condition. Here we show that the VC-dimension of a hypothesis class gives the correct characterization of its learnability.

Let $$\mathcal{H}$$ be a class of functions from $$\mathcal{X}$$ to $$\left\{0,1\right\}$$, and let $$C = \{c_1, \dots, c_m\} \subset X$$. The restriction of $$\mathcal{H}$$ to $$C$$ is the set of functions from $$C$$ to $$\{0, 1\}$$ that can be derived from $$\mathcal{H}$$. That is:

$$\mathcal{H}_C = \left\{ h(c_1), \dots, h(c_m) : h \in \mathcal{H} \right\}$$

where we represent each function from $$C$$ to $$\{0, 1\}$$ as a vector in $$\{0,1\}^{|C|}$$.

A hypothesis class $$\mathcal{H}$$ shatters a finite set $$C \subset \mathcal{X}$$ if the restriction of $$\mathcal{H}$$ to $$C$$ is the set of all functions from $$C$$ to $$\{0, 1\}$$. That is, $$|\mathcal{H}_C| = 2^{|C|}$$.

Whenever some set $$C$$ is shattered by $$\mathcal{H}$$, the adversary is not restricted by $$\mathcal{H}$$, as they can construct a distribution over $$C$$ based on any target function from $$C$$ to $$\{0,1\}$$, while still maintaining the realizability assumption.

This leads us to the definition of VC-dimension:

The VC-dimension of a hypothesis class $$\mathcal{H}$$, denoted $$\textrm{VCdim}(\mathcal{H})$$, is the maximal size of a set $$C \subset \mathcal{X}$$ that can be shattered by $$\mathcal{H}$$. If $$\mathcal{H}$$ can shatter $$C$$ of any arbitrary size, then $$\mathcal{H}$$ has infinite VC-dimension.

Examples

Threshold Functions

Let $$\mathcal{H}$$ be the class of threshold functions over $$\mathbb{R}$$. We have shown that for an arbitrary set $$C = \{c_1\}$$, $$\mathcal{H}$$ shatters $$C$$. However, we have shown that for an arbitrary set $$C = \{c_1, c_2\}$$ where $$c_1 \le c_2$$, $$\mathcal{H}$$ does not shatter $$C$$. Hence $$\textrm{VCdim}(\mathcal{H}) = 1$$.

Intervals

Take $$C = {1, 2}$$, and we see that $$\mathcal{H}$$ shatters $$C$$. Hence $$\textrm{VCdim}(\mathcal{H}) \ge 2$$. However, take an arbitrary set $$C = \{c_1, c_2, c_3\}$$ where $$c_1 \le c_2 \le c_3$$. Then the labelling (1,0,1) cannot be obtained by an interval. Therefore, $$\textrm{VCdim}(\mathcal{H}) = 2$$.

The Fundamental Theorem of Statistical Learning

Let $$\mathcal{H}$$ be a hypothesis class of functions from a domain $$\mathcal{X}$$ to $$\{0, 1\}$$ and let the loss function be the 0-1 loss. Then the following are equivalent:

1. $$\mathcal{H}$$ has the uniform convergence property.
2. Any ERM rule is a successful agnostic PAC learner for $$\mathcal{H}$$.
3. $$\mathcal{H}$$ is agnostic PAC learnable.
4. $$\mathcal{H}$$ is PAC learnable.
5. Any ERM rule is a successful PAC learner for $$\mathcal{H}$$.
6. $$\mathcal{H}$$ has a finite VC-dimension.

Sauer’s Lemma and the Growth Function

We have defined the notion of shattering, by considering the restriction of $$\mathcal{H}$$ to a finite set of instances. The growth function measures the maximal “effective” size of $$\mathcal{H}$$ on a set of $$m$$ examples. Formally:

Let $$\mathcal{H}$$ be a hypothesis class. Then the growth function of $$\mathcal{H}$$, denoted $$\tau_{\mathcal{H}}(m) : \mathbb{N} \rightarrow \mathbb{N}$$, is defined as:

$$\tau_{\mathcal{H}}(m) = \textrm{max}_{C \subset \mathcal{X} : |C| = m} |\mathcal{H}_C|$$

$$\tau_{\mathcal{H}}(m)$$ is the number of different functions from a set $$C$$ of size $$m$$ to $$\{0,1\}$$ that can be obtained by restricting $$\mathcal{H}$$ to $$C$$. With this definition we can now state Sauer’s lemma:

Let $$\mathcal{H}$$ be a hypothesis class with $$\textrm{VCdim}(\mathcal{H}) \le d < \infty$$. Then for all $$m$$,

$$\tau_{\mathcal{H}}(m) \le \sum_{i=0}^{d}{m \choose i}$$

In particular, if $$m > d + 1$$, then $$\tau_{\mathcal{H}}(m) \le (em/d)^d$$.

<biblio.bib>