Lottery Ticket Hypothesis

The Lottery Ticket hypothesis states that:

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that – when trained in isolation – it can match the test accuracy of the original network after training for at most the same number of iterations.

Traditional literature has shown that a fully trained dense network can be pruned to fewer parameters without much performance degradation, but it has impossible to train such a sparse sub-network from scratch. The paper provides one explanation to this phenomenon:

After pruning, the resultant sub-networks are randomly initialized, resulting in different training dynamics. If one instead initializes the weights of the sub-network to the original masked weights, performance can be maintained.

How do we find such a sub-network?

Iterative Magnitude Pruning (IMP)

1. Start with dense initialization $$W_0$$, and train to convergence to obtain $$W_{T_*}^{(1)}$$
2. Determine the $$s$$ percent smallest magnitude weights to create a binary mask $$m^{(1)}$$
3. Initialize the network with weights $$m^{(1)} \cdot W_0$$, and repeat the pruning process
4. Halt when the desired sparsity is reached, or when performance drops significantly

IMP was only restricted to small-scale tasks. A more robust pruning method was required for larger models (e.g. tuning learning rate schedules, LTH with rewinding).

Resources

Robert Lange provides a great survey and visualization on LTH in his blog post.