VC-Dimension

tags: Bias-Complexity Tradeoff, PAC Learning

Shalev-Shwartz and Ben-David, n.d.

What makes one class learnable and another unlearnable? The family of learnable classes in the setup of binary valued classification with the zero-one loss relies on a combinatorial notion called the Vapnik-Chervonenkis dimension (VC-dimension).

Infinite-size classes can be learnable

To see that this is true, we provide a counterexample.

let $H$ be the set of threshold functions over the real line, namely, $H = {h_{a} : a \in R}$ , where $h_{a} : R \to {0, 1 \to}$ is a function such that $h_{a} (x) = I_{[x < a]}$ . Clearly $H$ is of infinite size. However, we can easily show that $H$ is PAC learnable, with sample complexity:

$m_{H} (ϵ, δ) \leq ⌈ \log (2 / δ) / ϵ ⌉$

The VC-dimension

Hence, while finiteness of $H$ is a sufficient condition for PAC learnability, it is not a necessary condition. Here we show that the VC-dimension of a hypothesis class gives the correct characterization of its learnability.

Let $H$ be a class of functions from $X$ to ${0, 1}$ , and let $C = {c_{1}, \dots, c_{m}} \subset X$ . The restriction of $H$ to $C$ is the set of functions from $C$ to ${0, 1}$ that can be derived from $H$ . That is:

$H_{C} = {h (c_{1}), \dots, h (c_{m}) : h \in H}$

where we represent each function from $C$ to ${0, 1}$ as a vector in ${0, 1}^{| C |}$ .

A hypothesis class $H$ shatters a finite set $C \subset X$ if the restriction of $H$ to $C$ is the set of all functions from $C$ to ${0, 1}$ . That is, $| H_{C} | = 2^{| C |}$ .

Whenever some set $C$ is shattered by $H$ , the adversary is not restricted by $H$ , as they can construct a distribution over $C$ based on any target function from $C$ to ${0, 1}$ , while still maintaining the realizability assumption.

This leads us to the definition of VC-dimension:

The VC-dimension of a hypothesis class $H$ , denoted $VCdim (H)$ , is the maximal size of a set $C \subset X$ that can be shattered by $H$ . If $H$ can shatter $C$ of any arbitrary size, then $H$ has infinite VC-dimension.

Examples

Threshold Functions

Let $H$ be the class of threshold functions over $R$ . We have shown that for an arbitrary set $C = {c_{1}}$ , $H$ shatters $C$ . However, we have shown that for an arbitrary set $C = {c_{1}, c_{2}}$ where $c_{1} \leq c_{2}$ , $H$ does not shatter $C$ . Hence $VCdim (H) = 1$ .

Intervals

Take $C = 1, 2$ , and we see that $H$ shatters $C$ . Hence $VCdim (H) \geq 2$ . However, take an arbitrary set $C = {c_{1}, c_{2}, c_{3}}$ where $c_{1} \leq c_{2} \leq c_{3}$ . Then the labelling (1,0,1) cannot be obtained by an interval. Therefore, $VCdim (H) = 2$ .

The Fundamental Theorem of Statistical Learning

Let $H$ be a hypothesis class of functions from a domain $X$ to ${0, 1}$ and let the loss function be the 0-1 loss. Then the following are equivalent:

$H$ has the uniform convergence property.
Any ERM rule is a successful agnostic PAC learner for $H$ .
$H$ is agnostic PAC learnable.
$H$ is PAC learnable.
Any ERM rule is a successful PAC learner for $H$ .
$H$ has a finite VC-dimension.

Sauer’s Lemma and the Growth Function

We have defined the notion of shattering, by considering the restriction of $H$ to a finite set of instances. The growth function measures the maximal “effective” size of $H$ on a set of $m$ examples. Formally:

Let $H$ be a hypothesis class. Then the growth function of $H$ , denoted $τ_{H} (m) : N \to N$ , is defined as:

$τ_{H} (m) = {max}_{C \subset X : | C | = m} | H_{C} |$

$τ_{H} (m)$ is the number of different functions from a set $C$ of size $m$ to ${0, 1}$ that can be obtained by restricting $H$ to $C$ . With this definition we can now state Sauer’s lemma:

Let $H$ be a hypothesis class with $VCdim (H) \leq d < \infty$ . Then for all $m$ ,

$τ_{H} (m) \leq \sum_{i = 0}^{d} (\binom{m}{i})$

In particular, if $m > d + 1$ , then $τ_{H} (m) \leq (e m / d)^{d}$ .

<biblio.bib>