Jethro's Braindump

Making Transformer Models Efficient

The traditional Transformer model has memory and computational complexities that are quadratic with the input sequence length (\(O(N^2)\)). This limits the utility of Transformer models, since their main benefit is the ability to learn alignments across long sequences.

Efficient transformer models attempt to alleviate the cost of computing the attention matrix, either by approximating the matrix, or by introducing sparsity. (NO_ITEM_DATA:tayEfficientTransformersSurvey2020) provides a good overview of these efficient Transformer models. The key summary table in the paper is reproduced below.

Figure 1: Summary of Efficient Transformer Models

Figure 1: Summary of Efficient Transformer Models

Bibliography

NO_ITEM_DATA:tayEfficientTransformersSurvey2020