Lottery Ticket Hypothesis

The Lottery Ticket hypothesis states that:

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that – when trained in isolation – it can match the test accuracy of the original network after training for at most the same number of iterations.

Traditional literature has shown that a fully trained dense network can be pruned to fewer parameters without much performance degradation, but it has impossible to train such a sparse sub-network from scratch. The paper provides one explanation to this phenomenon:

After pruning, the resultant sub-networks are randomly initialized, resulting in different training dynamics. If one instead initializes the weights of the sub-network to the original masked weights, performance can be maintained.

How do we find such a sub-network?

Iterative Magnitude Pruning (IMP)

Start with dense initialization $W_{0}$ , and train to convergence to obtain $W_{T_{*}}^{(1)}$
Determine the $s$ percent smallest magnitude weights to create a binary mask $m^{(1)}$
Initialize the network with weights $m^{(1)} \cdot W_{0}$ , and repeat the pruning process
Halt when the desired sparsity is reached, or when performance drops significantly

IMP was only restricted to small-scale tasks. A more robust pruning method was required for larger models (e.g. tuning learning rate schedules, LTH with rewinding).

Resources

Robert Lange provides a great survey and visualization on LTH in his blog post.

Jethro's Braindump

Lottery Ticket Hypothesis

Iterative Magnitude Pruning (IMP)

Resources

Links to this note