VC-Dimension
- tags
- Bias-Complexity Tradeoff, PAC Learning
Shalev-Shwartz and Ben-David, n.d.
What makes one class learnable and another unlearnable? The family of
learnable classes in the setup of binary valued classification with
the zero-one loss relies on a combinatorial notion called the
Vapnik-Chervonenkis dimension (VC-dimension).
Infinite-size classes can be learnable
To see that this is true, we provide a counterexample.
let be the set of threshold functions over the real
line, namely, ,
where is a
function such that . Clearly is of
infinite size. However, we can easily show that is PAC
learnable, with sample complexity:
The VC-dimension
Hence, while finiteness of is a sufficient condition for
PAC learnability, it is not a necessary condition. Here we show that
the VC-dimension of a hypothesis class gives the correct
characterization of its learnability.
Let be a class of functions from to
, and let . The
restriction of to is the set of functions from
to that can be derived from . That is:
where we represent each function from to as a vector in
.
A hypothesis class shatters a finite set if the restriction of to is the set of
all functions from to . That is, .
Whenever some set is shattered by , the adversary is
not restricted by , as they can construct a distribution
over based on any target function from to , while
still maintaining the realizability assumption.
This leads us to the definition of VC-dimension:
The VC-dimension of a hypothesis class , denoted
, is the maximal size of a set that can be shattered by . If
can shatter of any arbitrary size, then has infinite VC-dimension.
Examples
Threshold Functions
Let be the class of threshold functions over
. We have shown that for an arbitrary set ,
shatters . However, we have shown that for an
arbitrary set where ,
does not shatter . Hence .
Intervals
Take , and we see that shatters . Hence
. However, take an arbitrary set where . Then the labelling
(1,0,1) cannot be obtained by an interval. Therefore,
.
The Fundamental Theorem of Statistical Learning
Let be a hypothesis class of functions from a domain
to and let the loss function be the 0-1 loss.
Then the following are equivalent:
- has the uniform convergence property.
- Any ERM rule is a successful agnostic PAC learner for .
- is agnostic PAC learnable.
- is PAC learnable.
- Any ERM rule is a successful PAC learner for .
- has a finite VC-dimension.
Sauer’s Lemma and the Growth Function
We have defined the notion of shattering, by considering the
restriction of to a finite set of instances. The growth
function measures the maximal “effective” size of on a
set of examples. Formally:
Let be a hypothesis class. Then the growth function of
, denoted , is defined as:
is the number of different functions from a
set of size to that can be obtained by restricting
to . With this definition we can now state Sauer’s
lemma:
Let be a hypothesis class with
. Then for all ,
In particular, if , then .
<biblio.bib>