Jethro's Braindump

VC-Dimension

tags
Bias-Complexity Tradeoff, PAC Learning

Shalev-Shwartz and Ben-David, n.d.

What makes one class learnable and another unlearnable? The family of learnable classes in the setup of binary valued classification with the zero-one loss relies on a combinatorial notion called the Vapnik-Chervonenkis dimension (VC-dimension).

Infinite-size classes can be learnable

To see that this is true, we provide a counterexample.

let H be the set of threshold functions over the real line, namely, H={ha:aR}, where ha:R{0,1} is a function such that ha(x)=I[x<a]. Clearly H is of infinite size. However, we can easily show that H is PAC learnable, with sample complexity:

mH(ϵ,δ)log(2/δ)/ϵ

The VC-dimension

Hence, while finiteness of H is a sufficient condition for PAC learnability, it is not a necessary condition. Here we show that the VC-dimension of a hypothesis class gives the correct characterization of its learnability.

Let H be a class of functions from X to {0,1}, and let C={c1,,cm}X. The restriction of H to C is the set of functions from C to {0,1} that can be derived from H. That is:

HC={h(c1),,h(cm):hH}

where we represent each function from C to {0,1} as a vector in {0,1}|C|.

A hypothesis class H shatters a finite set CX if the restriction of H to C is the set of all functions from C to {0,1}. That is, |HC|=2|C|.

Whenever some set C is shattered by H, the adversary is not restricted by H, as they can construct a distribution over C based on any target function from C to {0,1}, while still maintaining the realizability assumption.

This leads us to the definition of VC-dimension:

The VC-dimension of a hypothesis class H, denoted VCdim(H), is the maximal size of a set CX that can be shattered by H. If H can shatter C of any arbitrary size, then H has infinite VC-dimension.

Examples

Threshold Functions

Let H be the class of threshold functions over R. We have shown that for an arbitrary set C={c1}, H shatters C. However, we have shown that for an arbitrary set C={c1,c2} where c1c2, H does not shatter C. Hence VCdim(H)=1.

Intervals

Take C=1,2, and we see that H shatters C. Hence VCdim(H)2. However, take an arbitrary set C={c1,c2,c3} where c1c2c3. Then the labelling (1,0,1) cannot be obtained by an interval. Therefore, VCdim(H)=2.

The Fundamental Theorem of Statistical Learning

Let H be a hypothesis class of functions from a domain X to {0,1} and let the loss function be the 0-1 loss. Then the following are equivalent:

  1. H has the uniform convergence property.
  2. Any ERM rule is a successful agnostic PAC learner for H.
  3. H is agnostic PAC learnable.
  4. H is PAC learnable.
  5. Any ERM rule is a successful PAC learner for H.
  6. H has a finite VC-dimension.

Sauer’s Lemma and the Growth Function

We have defined the notion of shattering, by considering the restriction of H to a finite set of instances. The growth function measures the maximal “effective” size of H on a set of m examples. Formally:

Let H be a hypothesis class. Then the growth function of H, denoted τH(m):NN, is defined as:

τH(m)=maxCX:|C|=m|HC|

τH(m) is the number of different functions from a set C of size m to {0,1} that can be obtained by restricting H to C. With this definition we can now state Sauer’s lemma:

Let H be a hypothesis class with VCdim(H)d<. Then for all m,

τH(m)i=0d(mi)

In particular, if m>d+1, then τH(m)(em/d)d.

<biblio.bib>