Math Problemsolving with Machine Learning
Papers
Measuring Mathematical Problem Solving With the MATH Dataset
This paper is primarily a dataset paper. It introduces two datasets:
 MATH dataset: A challenging dataset (12,500 problems) which contains
questions, and stepbystep solutions, with answers in boxes
 Problems are categorized into seven subjects, and classified into 5 difficulties
 Question figures encoded using Asymptote language to allow language models to learn to read/generate them.
 AMPS dataset: Large dataset for pretraining
 100,000 problems from Khan Academy exercises
 5 million problems generated from Mathematica scripts to help model learn mathematics fundamentals
Language models achieve poor accuracies on the MATH dataset (2.9% to 6.9%): it requires the models to learn advanced problemsolving techniques. Human performance is also relatively poor on the dataset (40% for nonadvanced students, 90% for IMO medallist).
Model Training
GPT2 pretrained on AMPS using the standard autoregressive language modelling objective.
Finetuned model trained to output both solution and answer. Inputs are equal mix of:

Final Answer:

Full Solution:
This allows the model to output both answer and solution based on prompt tuning. NOTE: Seems similar to T5’s texttotext format.
Evaluation is performed by computing the probability that a correct answer has higher confidence than an incorrect answer (AUROC). Large models are more overconfident.
Key Findings
 Accuracy increases slowly as model size increases: suggests the need for algorithmic improvements
 Pretraining on AMPS results in ~25% increase in relative accuracy, suggesting that algorithmically generated questions can be a useful pretraining step
 Having model generate stepbystep solution decreases resulting accuracy
 Possibly because incorrect generation can derail the model
 Training the models with partially observed solutions is useful, but not by much: only when the model sees 99% of the solutions is able to output the answer with high accuracy.
Entailment as FewShot Learner
Key Idea
Reformulate NLP tasks into entailment tasks. Example inputs:
sentence [SEP] It was great sentence [SEP] It is World news sentence [SEP] sentence2
The authors also introduce data augmentations to increase the size of the training data, augmenting existing limited annotated data. Consider onesentence and twosentence tasks:
S1 [SEP] It is CLS
and S1 [SEP] S2
Positive samples:
 Augment S1 slightly, add
S1 [SEP] S1'
, orS1 [SEP] S2
. Similar for S2.
Negative samples:
 Change S1 drastically, add
S1 [SEP] S1'
, orS1 [SEP] S2
. Similar for S2.  Randomly sample R1, R2 from different dataset, and add
R1 [SEP] R2
.