# Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data

## Grammatical Error Correction

Unlike translation, GEC only changes several words of the source sentence – roughly 80% can be copied from the source sentence.

## Key Ideas

- Leverage unlabelled data by training denoising autoencoders.
- Add 2 multi-task for copy-augmented architecture:
- token labeling task
- sentence-level copying task

The copying mechanism enables training a model with a small vocabulary, as it can copy unchanged and out-of-vocabulary words from the source input tokens.

## Architecture

Transformer encodes source sentence with stack of L identical blocks, each of them applies multi-head self-attention over the source tokens followed by a position-wise feedforward layer to produce its context aware hidden state. The decoder is the same architecture as the encoder, with an additional attention layer over the encoder’s hidden states.

\begin{equation} h_{1 \ldots N}^{s r c}=\text {encoder }\left(L^{s r c} x_{1 \dots N}\right) \end{equation}

\begin{equation} h_{t}=\operatorname{decoder}\left(L^{\text{train}} y_{t-1 \ldots 1}, h_{1 \ldots N}^{s r c}\right) \end{equation}

\begin{equation} P_{t}(w)=\operatorname{softmax}\left(L^{\text{train}} h_{t}\right) \end{equation}