Interest in text-generating models has been rekindled in the past year—in large part due to GPT2, which primarily demonstrates the effectiveness of using the Transformer architecture with bigger models, bigger data, and bigger compute. Notably, this model achieved SOTA results on several language modelling datasets without even training on those datasets, showing its impressive generalization capabilities. Following GPT2, several other entities have also jumped on the bandwagon and released their own large unidirectional language models, such as: Grover, Nvidia’s Megatron-LM, and Salesforce’s CTRL. Setting aside the controversy surrounding OpenAI’s claims that the model is “too dangerous to release,” the text generated by GPT2 are undeniably far better than previous text generation models. However, these models also exhibit some flaws that may not be fixable purely using the bigger-model paradigm. In this post, we take a quick look at some of these flaws and the attempts to solve them, and discuss some potential directions for future research.

## What is an autoregressive language model and why does it matter?

The core problem of language modelling is approximating the distribution of natural language sequences occurring in English (or Lojban, Navajo, Python, etc) using a parameterized function. To make modelling more manageable, the autoregressive language model formulation factors the ideal language model $p^*(x)$ into:

$\begin{aligned} p^*(x) \approx \prod_{i=1}^{n} \hat{p}_\theta(x_i | x_{<i}) \end{aligned}$In other words, to make the modelling problem more tractable, we instead train the parameterized function $\hat{p}_\theta(x)$ predict the next token conditioned on the previous tokens, and repeat this using the newly generated tokens, appended to the original context, as the new context. We can then obtain an estimate for the likelihood of any given sequence by taking the product across these conditional probabilities.

Many problems—including classification and translation—can be equivalently formulated as autoregressive problems or would benefit significantly from a strong pretrained language model. Improving language modelling would also potentially be a major step towards solving the general AI problem.

## Beam search and repetition

[…] using likelihood as a decoding objective leads to text that is bland and strangely repetitive. —Holzman et al. 2019

In the GPT2 samples provided, the authors decided to sample with top-k filtering and temperature rather than with beam search, which would be expected to return much higher-quality samples by maximizing likelihood. It was rather surprising, then, when “The Curious Case of Neural Text Degeneration” (Holzman et al. 2019) showed that GPT2 samples with higher predicted likelihood (i.e found via beam search) actually have much lower quality, tending to be extremely repetitive. The authors argue that this modelling problem is due to maximum-likelihood being a fundamentally incorrect sampling objective, and propose nucleus sampling, a sampling method that truncates low-likelihood token predictions (which can lead the model to a “downward spiral”), similar to top-k, while preserving “broad” (tail-heavy) distributions. It could be argued, however, that since sampling a maximum-likelihood sample from the ideal language model $argmax_{x} p^*(x)$ would, by definition, provide the most likely English text, it would *already take into account* the unlikelihood of extremely bland and repetitive text in English! Thus, the fault lies with the training objective, not the sampling objective.

Another tempting solution is simply to penalize repetition. In fact, shortly following the publication of the Neural Text Degeneration paper, I independently implemented my own GPT2 beam search sampler; after reproducing the text degeneration issues, I added a simple, arbitrary decoding-time penalty for repeated ngrams, with acceptable results at first glance but little theoretical justification.^{[1]} More recently, “Neural Text ~~De~~Generation with Unlikelihood Training” (Welleck, Kulikov et al. 2019) has proposed using a more complex training-time penalization scheme that involves adding a term $-k \sum_{c \in \mathcal{C^t}} log(1 - p_\theta(c | x_{<t}))$ to the training objective where $\mathcal{C^t}$ is a set of previously used tokens.^{[2]} While empirically successful, there is no good theoretical reason why less repetition would better model the underlying distribution.

## Exposure Bias

[…] the text will usually fall off a quality cliff after a certain point, suddenly becoming strikingly ungrammatical and typo-ridden and full of anomalous paragraph breaks. —nostalgebraist

One major problem with maximum-likelihood training of autoregressive models is exposure bias (Ranzato et al., 2015). Autoregressive models are only trained and evaluated on samples drawn from the target language distribution, but at evaluation time are fed samples that are themselves generated by the model. This error compounds extremely quickly and it has been observed, though admittedly anecdotally, that GPT2 exhibits a sharp drop-off in quality after a certain number of steps.

## Future Work

This problem bears striking resemblance to many problems in reinforcement learning; indeed, existing works such as “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient” (Yu et al., 2016), “Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evaluation” (Tuan et al., 2018), and “Toward Diverse Text Generation with Inverse Reinforcement Learning” (Shi et al., 2018) (this is not intended to be an exhaustive list by any means) use RL for various components of the training pipeline, from propagating the Generator gradient in a GAN setting to using Inverse Reinforcement Learning (which is itself deeply connected to GANs).

There is still a long way to go before these reinforcement learning based options become practical for models as large as the ones in GPT2. An intermediate step is to use existing pretrained language models and tune them in an RL environment. Additionally, an evaluation metric that is able to quantify exposure bias well would also be important for proper quantitative analysis. One promising paper in this direction is “Jointly Measuring Diversity and Quality in Text Generation Models” (Montahaei et al., 2019).

## Conclusion

While recent work has demonstrated an immense improvement in the quality of neural text generation due to the increase in model sizes, the problem of exposure bias still persists for long sequences of generated tokens. Progress in this area will likely require drawing from the work of Reinforcement Learning; indeed, many promising works at this junction of Reinforcement Learning and Language Modelling have already emerged. Hopefully, these improved language models will be competitive with human text not only at the scale of single paragraphs, but potentially entire articles.

An unmodified sample from 117M, generated with a beam-width of 8, top-k of 2, repetition penalty, and conditioned on the unicorn prompt is available here ↩︎

The authors also add an additional “sequence-level” objective that generates sequences from the model and uses repeating ngrams from those sequences to populate $\mathcal{C^t}$. While this does help a bit with exposure bias, the training objective still aims to reduce repetition explicitly. ↩︎