BERT: Pre-training of Deep Bidirectional Transformers for  
Language Understanding  
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova  
Google AI Language  
jacobdevlin,mingweichang,kentonl,kristout}@google.com  
{
Abstract  
models are required to produce fine-grained output  
at the token-level.  
We introduce a new language representa-  
tion model called BERT, which stands for  
Bidirectional Encoder Representations from  
Transformers. Unlike recent language repre-  
sentation models (Peters et al., 2018; Radford  
et al., 2018), BERT is designed to pre-train  
deep bidirectional representations by jointly  
conditioning on both left and right context in  
all layers. As a result, the pre-trained BERT  
representations can be fine-tuned with just one  
additional output layer to create state-of-the-  
art models for a wide range of tasks, such  
as question answering and language inference,  
without substantial task-specific architecture  
modifications.  
There are two existing strategies for apply-  
ing pre-trained language representations to down-  
stream tasks: feature-based and fine-tuning. The  
feature-based approach, such as ELMo (Peters  
et al., 2018), uses tasks-specific architectures that  
include the pre-trained representations as addi-  
tional features. The fine-tuning approach, such as  
the Generative Pre-trained Transformer (OpenAI  
GPT) (Radford et al., 2018), introduces minimal  
task-specific parameters, and is trained on the  
downstream tasks by simply fine-tuning the pre-  
trained parameters. In previous work, both ap-  
proaches share the same objective function dur-  
ing pre-training, where they use unidirectional lan-  
guage models to learn general language represen-  
tations.  
We argue that current techniques severely re-  
strict the power of the pre-trained representations,  
especially for the fine-tuning approaches. The ma-  
jor limitation is that standard language models are  
unidirectional, and this limits the choice of archi-  
tectures that can be used during pre-training. For  
example, in OpenAI GPT, the authors use a left-  
to-right architecture, where every token can only  
attended to previous tokens in the self-attention  
layers of the Transformer (Vaswani et al., 2017).  
Such restrictions are sub-optimal for sentence-  
level tasks, and could be devastating when ap-  
plying fine-tuning based approaches to token-level  
tasks such as SQuAD question answering (Ra-  
jpurkar et al., 2016), where it is crucial to incor-  
porate context from both directions.  
BERT is conceptually simple and empirically  
powerful. It obtains new state-of-the-art re-  
sults on eleven natural language processing  
tasks, including pushing the GLUE bench-  
mark to 80.4% (7.6% absolute improvement),  
MultiNLI accuracy to 86.7% (5.6% abso-  
lute improvement) and the SQuAD v1.1 ques-  
tion answering Test F1 to 93.2 (1.5 absolute  
improvement), outperforming human perfor-  
mance by 2.0.  
1
Introduction  
Language model pre-training has shown to be ef-  
fective for improving many natural language pro-  
cessing tasks (Dai and Le, 2015; Peters et al.,  
2
017, 2018; Radford et al., 2018; Howard and  
Ruder, 2018). These tasks include sentence-level  
tasks such as natural language inference (Bow-  
man et al., 2015; Williams et al., 2018) and para-  
phrasing (Dolan and Brockett, 2005), which aim  
to predict the relationships between sentences by  
analyzing them holistically, as well as token-level  
tasks such as named entity recognition (Tjong  
Kim Sang and De Meulder, 2003) and SQuAD  
question answering (Rajpurkar et al., 2016), where  
In this paper, we improve the fine-tuning based  
approaches by proposing BERT: Bidirectional  
Encoder Representations from Transformers.  
BERT addresses the previously mentioned uni-  
directional constraints by proposing a new  
pre-training objective: the “masked language  
model” (MLM), inspired by the Cloze task (Tay-  
lor, 1953). The masked language model randomly  
masks some of the tokens from the input, and the  
objective is to predict the original vocabulary id of  
the masked word based only on its context. Un-  
like left-to-right language model pre-training, the  
MLM objective allows the representation to fuse  
the left and the right context, which allows us  
to pre-train a deep bidirectional Transformer. In  
addition to the masked language model, we also  
introduce a “next sentence prediction” task that  
jointly pre-trains text-pair representations.  
Ando and Zhang, 2005; Blitzer et al., 2006) and  
neural (Collobert and Weston, 2008; Mikolov  
et al., 2013; Pennington et al., 2014) methods. Pre-  
trained word embeddings are considered to be an  
integral part of modern NLP systems, offering sig-  
nificant improvements over embeddings learned  
from scratch (Turian et al., 2010).  
These approaches have been generalized to  
coarser granularities, such as sentence embed-  
dings (Kiros et al., 2015; Logeswaran and Lee,  
2018) or paragraph embeddings (Le and Mikolov,  
2014). As with traditional word embeddings,  
these learned representations are also typically  
used as features in a downstream model.  
ELMo (Peters et al., 2017) generalizes tradi-  
tional word embedding research along a differ-  
ent dimension. They propose to extract context-  
sensitive features from a language model. When  
integrating contextual word embeddings with ex-  
isting task-specific architectures, ELMo advances  
the state-of-the-art for several major NLP bench-  
marks (Peters et al., 2018) including question an-  
swering (Rajpurkar et al., 2016) on SQuAD, sen-  
timent analysis (Socher et al., 2013), and named  
entity recognition (Tjong Kim Sang and De Meul-  
der, 2003).  
The contributions of our paper are as follows:  
We demonstrate the importance of bidirec-  
tional pre-training for language representa-  
tions. Unlike Radford et al. (2018), which  
uses unidirectional language models for pre-  
training, BERT uses masked language mod-  
els to enable pre-trained deep bidirectional  
representations. This is also in contrast to  
Peters et al. (2018), which uses a shallow  
concatenation of independently trained left-  
to-right and right-to-left LMs.  
We show that pre-trained representations  
eliminate the needs of many heavily-  
engineered task-specific architectures. BERT  
is the first fine-tuning based representation  
model that achieves state-of-the-art perfor-  
mance on a large suite of sentence-level and  
token-level tasks, outperforming many sys-  
tems with task-specific architectures.  
2
.2 Fine-tuning Approaches  
A recent trend in transfer learning from language  
models (LMs) is to pre-train some model ar-  
chitecture on a LM objective before fine-tuning  
that same model for a supervised downstream  
task (Dai and Le, 2015; Howard and Ruder, 2018;  
Radford et al., 2018). The advantage of these ap-  
proaches is that few parameters need to be learned  
from scratch. At least partly due this advantage,  
OpenAI GPT (Radford et al., 2018) achieved pre-  
viously state-of-the-art results on many sentence-  
level tasks from the GLUE benchmark (Wang  
et al., 2018).  
BERT advances the state-of-the-art for eleven  
NLP tasks. We also report extensive abla-  
tions of BERT, demonstrating that the bidi-  
rectional nature of our model is the single  
most important new contribution. The code  
and pre-trained model will be available at  
2
.3 Transfer Learning from Supervised Data  
2
Related Work  
While the advantage of unsupervised pre-training  
is that there is a nearly unlimited amount of data  
available, there has also been work showing ef-  
fective transfer from supervised tasks with large  
datasets, such as natural language inference (Con-  
neau et al., 2017) and machine translation (Mc-  
Cann et al., 2017). Outside of NLP, computer  
vision research has also demonstrated the impor-  
tance of transfer learning from large pre-trained  
models, where an effective recipe is to fine-tune  
There is a long history of pre-training general lan-  
guage representations, and we briefly review the  
most popular approaches in this section.  
2
.1 Feature-based Approaches  
Learning widely applicable representations of  
words has been an active area of research for  
decades, including non-neural (Brown et al., 1992;  
1Will be released before the end of October 2018.  
BERT (Ours)  
OpenAI GPT  
ELMo  
T1  
T2  
...  
TN  
T1  
T2  
...  
TN  
T1  
T2  
...  
TN  
Trm  
Trm  
...  
Trm  
Trm  
Trm  
...  
Trm  
Lstm  
Lstm  
Lstm  
Lstm  
Lstm  
Lstm  
Lstm  
Lstm  
Lstm  
.
..  
.
..  
Trm  
Trm  
...  
...  
Trm  
Trm  
Trm  
...  
...  
Trm  
EN  
...  
Lstm  
Lstm  
...  
Lstm  
...  
E1  
E2  
EN  
E1  
E2  
E1  
E2  
EN  
Figure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT  
uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-  
to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly  
conditioned on both left and right context in all layers.  
models pre-trained on ImageNet (Deng et al.,  
2009; Yosinski et al., 2014).  
BERTLARGE: L=24, H=1024, A=16, Total  
Parameters=340M  
3
BERT  
BERTBASE was chosen to have an identical  
model size as OpenAI GPT for comparison pur-  
poses. Critically, however, the BERT Transformer  
uses bidirectional self-attention, while the GPT  
Transformer uses constrained self-attention where  
every token can only attend to context to its left.  
We note that in the literature the bidirectional  
Transformer is often referred to as a “Transformer  
encoder” while the left-context-only version is re-  
ferred to as a “Transformer decoder” since it can  
be used for text generation. The comparisons be-  
tween BERT, OpenAI GPT and ELMo are shown  
visually in Figure 1.  
We introduce BERT and its detailed implementa-  
tion in this section. We first cover the model ar-  
chitecture and the input representation for BERT.  
We then introduce the pre-training tasks, the core  
innovation in this paper, in Section 3.3. The  
pre-training procedures, and fine-tuning proce-  
dures are detailed in Section 3.4 and 3.5, respec-  
tively. Finally, the differences between BERT and  
OpenAI GPT are discussed in Section 3.6.  
3
.1 Model Architecture  
BERT’s model architecture is a multi-layer bidi-  
rectional Transformer encoder based on the orig-  
inal implementation described in Vaswani et al.  
3
.2 Input Representation  
Our input representation is able to unambiguously  
represent both a single text sentence or a pair of  
text sentences (e.g., [Question, Answer]) in one  
(2017) and released in the tensor2tensor li-  
2
brary. Because the use of Transformers has be-  
come ubiquitous recently and our implementation  
is effectively identical to the original, we will  
omit an exhaustive background description of the  
model architecture and refer readers to Vaswani  
4
token sequence. For a given token, its input rep-  
resentation is constructed by summing the cor-  
responding token, segment and position embed-  
dings. A visual representation of our input rep-  
resentation is given in Figure 2.  
et al. (2017) as well as excellent guides such as  
3
The Annotated Transformer.”  
The specifics are:  
In this work, we denote the number of layers  
(
i.e., Transformer blocks) as L, the hidden size as  
We use WordPiece embeddings (Wu et al.,  
016) with a 30,000 token vocabulary. We  
denote split word pieces with ##.  
H, and the number of self-attention heads as A.  
In all cases we set the feed-forward/filter size to  
be 4H, i.e., 3072 for the H = 768 and 4096 for  
the H = 1024. We primarily report results on two  
model sizes:  
2
We use learned positional embeddings with  
supported sequence lengths up to 512 tokens.  
4Throughout this work, a “sentence” can be an arbitrary  
span of contiguous text, rather than an actual linguistic sen-  
tence. A “sequence” refers to the input token sequence to  
BERT, which may be a single sentence or two sentences  
packed together.  
BERTBASE: L=12, H=768, A=12, Total Pa-  
rameters=110M  
2
https://github.com/tensorflow/tensor2tensor  
3
http://nlp.seas.harvard.edu/2018/04/03/attention.html  
Input  
[CLS]  
[SEP]  
[SEP]  
my  
dog  
is  
cute  
he  
likes  
play  
##ing  
Token  
Embeddings  
E[CLS]  
Emy  
Edog  
Eis  
Ecute  
E[SEP]  
Ehe  
Elikes  
Eplay  
E##ing  
E[SEP]  
Segment  
Embeddings  
EA  
EA  
EA  
EA  
EA  
EA  
EB  
EB  
EB  
EB  
EB  
Position  
Embeddings  
E0  
E1  
E2  
E3  
E4  
E5  
E6  
E7  
E8  
E9  
E10  
Figure 2: BERT input representation. The input embeddings is the sum of the token embeddings, the segmentation  
embeddings and the position embeddings.  
The first token of every sequence is al-  
ways the special classification embedding  
refer to this procedure as a “masked LM” (MLM),  
although it is often referred to as a Cloze task in  
the literature (Taylor, 1953). In this case, the fi-  
nal hidden vectors corresponding to the mask to-  
kens are fed into an output softmax over the vo-  
cabulary, as in a standard LM. In all of our exper-  
iments, we mask 15% of all WordPiece tokens in  
each sequence at random. In contrast to denoising  
auto-encoders (Vincent et al., 2008), we only pre-  
dict the masked words rather than reconstructing  
the entire input.  
Although this does allow us to obtain a bidirec-  
tional pre-trained model, there are two downsides  
to such an approach. The first is that we are cre-  
ating a mismatch between pre-training and fine-  
tuning, since the [MASK] token is never seen dur-  
ing fine-tuning. To mitigate this, we do not always  
replace “masked” words with the actual [MASK]  
token. Instead, the training data generator chooses  
(
[CLS]). The final hidden state (i.e., out-  
put of Transformer) corresponding to this to-  
ken is used as the aggregate sequence rep-  
resentation for classification tasks. For non-  
classification tasks, this vector is ignored.  
Sentence pairs are packed together into a sin-  
gle sequence. We differentiate the sentences  
in two ways. First, we separate them with  
a special token ([SEP]). Second, we add a  
learned sentence A embedding to every token  
of the first sentence and a sentence B embed-  
ding to every token of the second sentence.  
For single-sentence inputs we only use the  
sentence A embeddings.  
3
.3 Pre-training Tasks  
Unlike Peters et al. (2018) and Radford et al.  
2018), we do not use traditional left-to-right or  
right-to-left language models to pre-train BERT.  
Instead, we pre-train BERT using two novel unsu-  
pervised prediction tasks, described in this section.  
1
5% of tokens at random, e.g., in the sentence my  
dog is hairy it chooses hairy. It then performs  
the following procedure:  
(
Rather than always replacing the chosen  
words with [MASK], the data generator will  
do the following:  
3
.3.1 Task #1: Masked LM  
Intuitively, it is reasonable to believe that a  
deep bidirectional model is strictly more power-  
ful than either a left-to-right model or the shal-  
low concatenation of a left-to-right and right-to-  
left model. Unfortunately, standard conditional  
language models can only be trained left-to-right  
or right-to-left, since bidirectional conditioning  
would allow each word to indirectly “see itself”  
in a multi-layered context.  
In order to train a deep bidirectional representa-  
tion, we take a straightforward approach of mask-  
ing some percentage of the input tokens at random,  
and then predicting only those masked tokens. We  
80% of the time: Replace the word with the  
[
my dog is [MASK]  
MASK] token, e.g., my dog is hairy →  
10% of the time: Replace the word with a  
random word, e.g., my dog is hairy my  
dog is apple  
• 10% of the time: Keep the word un-  
changed, e.g., my dog is hairy my dog  
is hairy. The purpose of this is to bias the  
representation towards the actual observed  
word.  
The Transformer encoder does not know which  
words it will be asked to predict or which have  
been replaced by random words, so it is forced to  
keep a distributional contextual representation of  
every input token. Additionally, because random  
replacement only occurs for 1.5% of all tokens  
For the pre-training corpus we use the concatena-  
tion of BooksCorpus (800M words) (Zhu et al.,  
2015) and English Wikipedia (2,500M words).  
For Wikipedia we extract only the text passages  
and ignore lists, tables, and headers. It is criti-  
cal to use a document-level corpus rather than a  
shuffled sentence-level corpus such as the Billion  
Word Benchmark (Chelba et al., 2013) in order to  
extract long contiguous sequences.  
(i.e., 10% of 15%), this does not seem to harm the  
model’s language understanding capability.  
The second downside of using an MLM is that  
only 15% of tokens are predicted in each batch,  
which suggests that more pre-training steps may  
be required for the model to converge. In Sec-  
tion 5.3 we demonstrate that MLM does con-  
verge marginally slower than a left-to-right model  
To generate each training input sequence, we  
sample two spans of text from the corpus, which  
we refer to as “sentences” even though they are  
typically much longer than single sentences (but  
can be shorter also). The first sentence receives  
the A embedding and the second receives the B  
embedding. 50% of the time B is the actual next  
sentence that follows A and 50% of the time it is  
a random sentence, which is done for the “next  
sentence prediction” task. They are sampled such  
that the combined length is 512 tokens. The  
LM masking is applied after WordPiece tokeniza-  
tion with a uniform masking rate of 15%, and no  
special consideration given to partial word pieces.  
We train with batch size of 256 sequences (256  
sequences * 512 tokens = 128,000 tokens/batch)  
for 1,000,000 steps, which is approximately 40  
epochs over the 3.3 billion word corpus. We  
use Adam with learning rate of 1e-4, β1 = 0.9,  
β2 = 0.999, L2 weight decay of 0.01, learning  
rate warmup over the first 10,000 steps, and linear  
decay of the learning rate. We use a dropout prob-  
ability of 0.1 on all layers. We use a gelu acti-  
vation (Hendrycks and Gimpel, 2016) rather than  
the standard relu, following OpenAI GPT. The  
training loss is the sum of the mean masked LM  
likelihood and mean next sentence prediction like-  
lihood.  
(which predicts every token), but the empirical im-  
provements of the MLM model far outweigh the  
increased training cost.  
3
.3.2 Task #2: Next Sentence Prediction  
Many important downstream tasks such as Ques-  
tion Answering (QA) and Natural Language In-  
ference (NLI) are based on understanding the re-  
lationship between two text sentences, which is  
not directly captured by language modeling. In  
order to train a model that understands sentence  
relationships, we pre-train a binarized next sen-  
tence prediction task that can be trivially gener-  
ated from any monolingual corpus. Specifically,  
when choosing the sentences A and B for each pre-  
training example, 50% of the time B is the actual  
next sentence that follows A, and 50% of the time  
it is a random sentence from the corpus. For ex-  
ample:  
Input = [CLS] the man went to [MASK] store [SEP]  
he bought a gallon [MASK] milk [SEP]  
Label = IsNext  
Training of BERTBASE was performed on 4  
Cloud TPUs in Pod configuration (16 TPU chips  
total). Training of BERTLARGE was performed  
on 16 Cloud TPUs (64 TPU chips total). Each pre-  
training took 4 days to complete.  
Input = [CLS] the man [MASK] to the store [SEP]  
penguin [MASK] are flight ##less birds [SEP]  
Label = NotNext  
5
We choose the NotNext sentences completely at  
random, and the final pre-trained model achieves  
3
.5 Fine-tuning Procedure  
9
7%-98% accuracy at this task. Despite its sim-  
For sequence-level classification tasks, BERT  
fine-tuning is straightforward. In order to obtain  
a fixed-dimensional pooled representation of the  
input sequence, we take the final hidden state (i.e.,  
the output of the Transformer) for the first token  
5https://cloudplatform.googleblog.com/2018/06/Cloud-  
TPU-now-offers-preemptible-pricing-and-global-  
availability.html  
plicity, we demonstrate in Section 5.1 that pre-  
training towards this task is very beneficial to both  
QA and NLI.  
3
.4 Pre-training Procedure  
The pre-training procedure largely follows the ex-  
isting literature on language model pre-training.  
in the input, which by construction corresponds to  
• GPT uses a sentence separator ([SEP]) and  
classifier token ([CLS]) which are only in-  
troduced at fine-tuning time; BERT learns  
[SEP], [CLS] and sentence A/B embed-  
dings during pre-training.  
the the special [CLS] word embedding. We de-  
H
note this vector as C R . The only new pa-  
rameters added during fine-tuning are for a classi-  
fication layer W RK×H, where K is the num-  
ber of classifier labels. The label probabilities  
K
• GPT was trained for 1M steps with a batch  
size of 32,000 words; BERT was trained for  
P R are computed with a standard softmax,  
T
P = softmax(CW ). All of the parameters of  
1
M steps with a batch size of 128,000 words.  
BERT and W are fine-tuned jointly to maximize  
the log-probability of the correct label. For span-  
level and token-level prediction tasks, the above  
procedure must be modified slightly in a task-  
specific manner. Details are given in the corre-  
sponding subsection of Section 4.  
GPT used the same learning rate of 5e-5 for  
all fine-tuning experiments; BERT chooses a  
task-specific fine-tuning learning rate which  
performs the best on the development set.  
For fine-tuning, most model hyperparameters  
are the same as in pre-training, with the excep-  
tion of the batch size, learning rate, and number  
of training epochs. The dropout probability was  
always kept at 0.1. The optimal hyperparameter  
To isolate the effect of these differences, we per-  
form ablation experiments in Section 5.1 which  
demonstrate that the majority of the improvements  
are in fact coming from the new pre-training tasks.  
values are task-specific, but we found the follow- 4 Experiments  
ing range of possible values to work well across  
all tasks:  
In this section, we present BERT fine-tuning re-  
sults on 11 NLP tasks.  
Batch size: 16, 32  
Learning rate (Adam): 5e-5, 3e-5, 2e-5  
Number of epochs: 3, 4  
4
.1 GLUE Datasets  
The General Language Understanding Evaluation  
GLUE) benchmark (Wang et al., 2018) is a col-  
(
We also observed that large data sets (e.g.,  
00k+ labeled training examples) were far less  
lection of diverse natural language understand-  
ing tasks. Most of the GLUE datasets have al-  
ready existed for a number of years, but the pur-  
pose of GLUE is to (1) distribute these datasets  
with canonical Train, Dev, and Test splits, and  
1
sensitive to hyperparameter choice than small data  
sets. Fine-tuning is typically very fast, so it is rea-  
sonable to simply run an exhaustive search over  
the above parameters and choose the model that  
performs best on the development set.  
(2) set up an evaluation server to mitigate issues  
with evaluation inconsistencies and Test set over-  
fitting. GLUE does not distribute labels for the  
Test set and users must upload their predictions to  
the GLUE server for evaluation, with limits on the  
number of submissions.  
The GLUE benchmark includes the following  
datasets, the descriptions of which were originally  
summarized in Wang et al. (2018):  
3
.6 Comparison of BERT and OpenAI GPT  
The most comparable existing pre-training method  
to BERT is OpenAI GPT, which trains a left-to-  
right Transformer LM on a large text corpus. In  
fact, many of the design decisions in BERT were  
intentionally chosen to be as close to GPT as pos-  
sible so that the two methods could be minimally  
compared. The core argument of this work is that  
the two novel pre-training tasks presented in Sec-  
tion 3.3 account for the majority of the empiri-  
cal improvements, but we do note that there are  
several other differences between how BERT and  
GPT were trained:  
MNLI Multi-Genre Natural Language Inference  
is a large-scale, crowdsourced entailment classifi-  
cation task (Williams et al., 2018). Given a pair of  
sentences, the goal is to predict whether the sec-  
ond sentence is an entailment, contradiction, or  
neutral with respect to the first one.  
GPT is trained on the BooksCorpus (800M  
words); BERT is trained on the BooksCor-  
pus (800M words) and Wikipedia (2,500M  
words).  
QQP Quora Question Pairs is a binary classifi-  
cation task where the goal is to determine if two  
questions asked on Quora are semantically equiv-  
alent (Chen et al., 2018).  
Class  
Label  
Class  
Label  
...  
T
N
C
T1  
...  
TN  
T[SEP]  
T1...  
TM’  
C
T1  
T2  
E2  
E[CLS]  
E1  
...  
...  
EN  
E[SEP]  
E1...  
EM’  
E[CLS]  
E1  
...  
...  
E
N
Tok  
1
Tok  
N
Tok  
...  
1
Tok  
M
[CLS]  
Tok 1  
Tok N  
[CLS]  
[SEP]  
Tok 2  
Sentence 1  
Sentence 2  
Single Sentence  
.
..  
..  
Start/End Span  
O
B-PER  
O
.
TN  
C
T1  
...  
TN  
T[SEP]  
T1...  
TM’  
C
T
1
T
2
E2  
E[CLS]  
E1  
...  
...  
EN  
E[SEP]  
E1...  
EM’  
...  
...  
EN  
E[CLS]  
E1  
Tok  
1
Tok  
N
Tok  
...  
1
Tok  
M
[CLS]  
Tok N  
[CLS]  
[SEP]  
Tok 1  
Tok 2  
Question  
Paragraph  
Single Sentence  
Figure 3: Our task specific models are formed by incorporating BERT with one additional output layer, so a  
minimal number of parameters need to be learned from scratch. Among the tasks, (a) and (b) are sequence-level  
tasks while (c) and (d) are token-level tasks. In the figure, E represents the input embedding, Ti represents the  
contextual representation of token i, [CLS] is the special symbol for classification output, and [SEP] is the special  
symbol to separate non-consecutive token sequences.  
QNLI Question Natural Language Inference is  
a version of the Stanford Question Answering  
Dataset (Rajpurkar et al., 2016) which has been  
converted to a binary classification task (Wang  
et al., 2018). The positive examples are (ques-  
tion, sentence) pairs which do contain the correct  
answer, and the negative examples are (question,  
sentence) from the same paragraph which do not  
contain the answer.  
the goal is to predict whether an English sentence  
is linguistically “acceptable” or not (Warstadt  
et al., 2018).  
STS-B The Semantic Textual Similarity Bench-  
mark is a collection of sentence pairs drawn from  
news headlines and other sources (Cer et al.,  
2
017). They were annotated with a score from 1  
to 5 denoting how similar the two sentences are in  
terms of semantic meaning.  
SST-2 The Stanford Sentiment Treebank is a  
binary single-sentence classification task consist-  
ing of sentences extracted from movie reviews  
with human annotations of their sentiment (Socher  
et al., 2013).  
MRPC Microsoft Research Paraphrase Corpus  
consists of sentence pairs automatically extracted  
from online news sources, with human annotations  
for whether the sentences in the pair are semanti-  
cally equivalent (Dolan and Brockett, 2005).  
CoLA The Corpus of Linguistic Acceptability is  
a binary single-sentence classification task, where  
System  
MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average  
3
92k  
363k 108k 67k 8.5k 5.7k  
66.1 82.3 93.2 35.0 81.0  
64.8 79.9 90.4 36.0 73.3  
70.3 88.1 91.3 45.4 80.0  
71.2 90.1 93.5 52.1 85.8  
72.1 91.1 94.9 60.5 86.5  
3.5k 2.5k  
-
Pre-OpenAI SOTA  
BiLSTM+ELMo+Attn  
OpenAI GPT  
BERTBASE  
BERTLARGE  
80.6/80.1  
76.4/76.1  
82.1/81.4  
84.6/83.4  
86.7/85.9  
86.0 61.7 74.0  
84.9 56.8 71.0  
82.3 56.0 75.2  
88.9 66.4 79.6  
89.3 70.1 81.9  
Table 1: GLUE Test results, scored by the GLUE evaluation server. The number below each task denotes the  
number of training examples. The “Average” column is slightly different than the official GLUE score, since  
we exclude the problematic WNLI set. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768,  
A=12); BERTLARGE = (L=24, H=1024, A=16). BERT and OpenAI GPT are single-model, single task. All  
RTE Recognizing Textual Entailment is a bi-  
nary entailment task similar to MNLI, but with  
much less training data (Bentivogli et al., 2009).6  
small data sets (i.e., some runs would produce de-  
generate results), so we ran several random restarts  
and selected the model that performed best on the  
Dev set. With random restarts, we use the same  
pre-trained checkpoint but perform different fine-  
tuning data shuffling and classifier layer initializa-  
tion. We note that the GLUE data set distribution  
does not include the Test labels, and we only made  
a single GLUE evaluation server submission for  
WNLI Winograd NLI is a small natural lan-  
guage inference dataset deriving from (Levesque  
et al., 2011). The GLUE webpage notes that there  
are issues with the construction of this dataset, 7  
and every trained system that’s been submitted  
to GLUE has has performed worse than the 65.1  
baseline accuracy of predicting the majority class.  
We therefore exclude this set out of fairness to  
OpenAI GPT. For our GLUE submission, we al-  
ways predicted the majority class.  
each BERTBASE and BERTLARGE  
.
Results are presented in Table 1.  
Both  
BERTBASE and BERTLARGE outperform all exist-  
ing systems on all tasks by a substantial margin,  
obtaining 4.4% and 6.7% respective average accu-  
racy improvement over the state-of-the-art. Note  
that BERTBASE and OpenAI GPT are nearly iden-  
tical in terms of model architecture outside of  
the attention masking. For the largest and most  
widely reported GLUE task, MNLI, BERT ob-  
tains a 4.7% absolute accuracy improvement over  
the state-of-the-art. On the official GLUE leader-  
4
.1.1 GLUE Results  
To fine-tune on GLUE, we represent the input se-  
quence or sequence pair as described in Section 3,  
H
and use the final hidden vector C R corre-  
sponding to the first input token ([CLS]) as the  
aggregate representation. This is demonstrated vi-  
sually in Figure 3 (a) and (b). The only new pa-  
rameters introduced during fine-tuning is a classi-  
fication layer W RK×H, where K is the num-  
ber of labels. We compute a standard classification  
8
board, BERTLARGE obtains a score of 80.4, com-  
pared to the top leaderboard system, OpenAI GPT,  
which obtains 72.8 as of the date of writing.  
It is interesting to observe that BERTLARGE sig-  
nificantly outperforms BERTBASE across all tasks,  
even those with very little training data. The effect  
of BERT model size is explored more thoroughly  
in Section 5.2.  
T
loss with C and W, i.e., log(softmax(CW )).  
We use a batch size of 32 and 3 epochs over  
the data for all GLUE tasks. For each task, we ran  
fine-tunings with learning rates of 5e-5, 4e-5, 3e-5,  
and 2e-5 and selected the one that performed best  
on the Dev set. Additionally, for BERTLARGE we  
found that fine-tuning was sometimes unstable on  
4
.2 SQuAD v1.1  
The Standford Question Answering Dataset  
SQuAD) is a collection of 100k crowdsourced  
6Note that we only report single-task fine-tuning results in  
this paper. Multitask fine-tuning approach could potentially  
push the results even further. For example, we did observe  
substantial improvements on RTE from multi-task training  
with MNLI.  
(
question/answer pairs (Rajpurkar et al., 2016).  
Given a question and a paragraph from Wikipedia  
8https://gluebenchmark.com/leaderboard  
containing the answer, the task is to predict the an-  
swer text span in the paragraph. For example:  
System  
Dev  
Test  
EM F1 EM F1  
Leaderboard (Oct 8th, 2018)  
Human  
#
#
-
-
-
-
-
-
-
-
-
-
82.3 91.2  
86.0 91.7  
84.5 90.5  
83.5 90.1  
82.5 89.3  
Input Question:  
1 Ensemble - nlnet  
Where do water droplets collide with ice  
2 Ensemble - QANet  
crystals to form precipitation?  
#1 Single - nlnet  
#
2 Single - QANet  
Input Paragraph:  
Published  
BiDAF+ELMo (Single)  
R.M. Reader (Single)  
R.M. Reader (Ensemble)  
-
85.8  
-
-
.
.. Precipitation forms as smaller droplets  
78.9 86.3 79.5 86.6  
81.2 87.9 82.3 88.5  
coalesce via collision with other rain drops  
or ice crystals within a cloud. ...  
Ours  
BERTBASE (Single)  
80.8 88.5  
84.1 90.9  
85.8 91.8  
-
-
-
-
-
-
Output Answer:  
within a cloud  
BERTLARGE (Single)  
BERTLARGE (Ensemble)  
BERTLARGE (Sgl.+TriviaQA) 84.2 91.1 85.1 91.8  
BERTLARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2  
This type of span prediction task is quite dif-  
ferent from the sequence classification tasks of  
GLUE, but we are able to adapt BERT to run  
on SQuAD in a straightforward manner. Just as  
with GLUE, we represent the input question and  
paragraph as a single packed sequence, with the  
question using the A embedding and the paragraph  
using the B embedding. The only new parame-  
ters learned during fine-tuning are a start vector  
Table 2: SQuAD results. The BERT ensemble is 7x  
systems which use different pre-training checkpoints  
and fine-tuning seeds.  
from the SQuAD leaderboard do not have up-to-  
date public system descriptions available, and are  
allowed to use any public data when training their  
systems. We therefore use very modest data aug-  
mentation in our submitted system by jointly train-  
ing on SQuAD and TriviaQA (Joshi et al., 2017).  
Our best performing system outperforms the top  
leaderboard system by +1.5 F1 in ensembling and  
+1.3 F1 as a single system. In fact, our single  
BERT model outperforms the top ensemble sys-  
tem in terms of F1 score. If we fine-tune on only  
SQuAD (without TriviaQA) we lose 0.1-0.4 F1  
and still outperform all existing systems by a wide  
margin.  
H
H
S R and an end vector E R . Let the final  
th  
hidden vector from BERT for the i input token  
H
be denoted as Ti R . See Figure 3 (c) for a vi-  
sualization. Then, the probability of word i being  
the start of the answer span is computed as a dot  
product between Ti and S followed by a softmax  
over all of the words in the paragraph:  
eS·Ti  
Pi = P  
j eS·Tj  
The same formula is used for the end of the an-  
swer span, and the maximum scoring span is used  
as the prediction. The training objective is the log-  
likelihood of the correct start and end positions.  
We train for 3 epochs with a learning rate of 5e-  
4
.3 Named Entity Recognition  
To evaluate performance on a token tagging task,  
we fine-tune BERT on the CoNLL 2003 Named  
Entity Recognition (NER) dataset. This dataset  
consists of 200k training words which have been  
annotated as Person, Organization, Location,  
Miscellaneous, or Other (non-named entity).  
For fine-tuning, we feed the final hidden  
5
and a batch size of 32. At inference time, since  
the end prediction is not conditioned on the start,  
we add the constraint that the end must come after  
the start, but no other heuristics are used. The tok-  
enized labeled span is aligned back to the original  
untokenized input for evaluation.  
Results are presented in Table 2. SQuAD uses  
a highly rigorous testing procedure where the sub-  
mitter must manually contact the SQuAD organiz-  
ers to run their system on a hidden test set, so we  
only submitted our best system for testing. The  
result shown in the table is our first and only Test  
submission to SQuAD. We note that the top results  
H
representation Ti R for to each token i into  
a classification layer over the NER label set. The  
predictions are not conditioned on the surround-  
ing predictions (i.e., non-autoregressive and no  
CRF). To make this compatible with WordPiece  
tokenization, we feed each CoNLL-tokenized  
input word into our WordPiece tokenizer and  
use the hidden state corresponding to the first  
System  
Dev F1 Test F1  
System  
Dev Test  
ELMo+BiLSTM+CRF  
CVT+Multi (Clark et al., 2018)  
95.7  
-
92.2  
92.6  
ESIM+GloVe  
ESIM+ELMo  
51.9 52.7  
59.1 59.2  
BERTBASE  
96.4  
92.4  
BERTBASE  
81.6  
-
BERTLARGE  
96.6  
92.8  
BERTLARGE  
86.6 86.3  
Human (expert)†  
-
-
85.0  
88.0  
Table 3: CoNLL-2003 Named Entity Recognition re-  
sults. The hyperparameters were selected using the  
Dev set, and the reported Dev and Test scores are aver-  
aged over 5 random restarts using those hyperparame-  
ters.  
Human (5 annotations)†  
Table 4: SWAG Dev and Test accuracies. Test results  
were scored against the hidden labels by the SWAG au-  
thors. Human performance is measure with 100 sam-  
ples, as reported in the SWAG paper.  
sub-token as input to the classifier. For example:  
Jim  
I-PER I-PER X  
Hen  
##son was a puppet ##eer  
O O  
score for each choice i. The probability distribu-  
tion is the softmax over the four choices:  
O
X
Where no prediction is made for X. Since  
the WordPiece tokenization boundaries are a  
known part of the input, this is done for both  
training and test. A visual representation is also  
given in Figure 3 (d). A cased WordPiece model  
is used for NER, whereas an uncased model is  
used for all other tasks.  
Results are presented in Table 3. BERTLARGE  
outperforms the existing SOTA, Cross-View  
Training with multi-task learning (Clark et al.,  
eV ·Ci  
Pi = P  
4
j=1  
eV ·Cj  
We fine-tune the model for 3 epochs with a  
learning rate of 2e-5 and a batch size of 16. Re-  
sults are presented in Table 4. BERTLARGE out-  
performs the authors’ baseline ESIM+ELMo sys-  
tem by +27.1%.  
5
Ablation Studies  
2
018), by +0.2 on CoNLL-2003 NER Test.  
Although we have demonstrated extremely strong  
empirical results, the results presented so far have  
not isolated the specific contributions from each  
aspect of the BERT framework. In this section,  
we perform ablation experiments over a number of  
facets of BERT in order to better understand their  
relative importance.  
4
.4 SWAG  
The Situations With Adversarial Generations  
SWAG) dataset contains 113k sentence-pair com-  
(
pletion examples that evaluate grounded common-  
sense inference (Zellers et al., 2018).  
Given a sentence from a video captioning  
dataset, the task is to decide among four choices  
the most plausible continuation. For example:  
5.1 Effect of Pre-training Tasks  
One of our core claims is that the deep bidirec-  
tionality of BERT, which is enabled by masked  
LM pre-training, is the single most important im-  
provement of BERT compared to previous work.  
To give evidence for this claim, we evaluate two  
new models which use the exact same pre-training  
data, fine-tuning scheme and Transformer hyper-  
A girl is going across a set of monkey bars. She  
(
(
(
(
i) jumps up across the monkey bars.  
ii) struggles onto the bars to grab her head.  
iii) gets to the end and stands on a wooden plank.  
iv) jumps up and does a back flip.  
Adapting BERT to the SWAG dataset is similar  
parameters as BERTBASE  
:
to the adaptation for GLUE. For each example, we  
construct four input sequences, which each con-  
tain the concatenation of the the given sentence  
1
. No NSP: A model which is trained using the  
masked LM” (MLM) but without the “next  
sentence prediction” (NSP) task.  
(sentence A) and a possible continuation (sentence  
B). The only task-specific parameters we introduce  
H
is a vector V R , whose dot product with the  
2. LTR & No NSP: A model which is trained  
using a Left-to-Right (LTR) LM, rather than  
H
final aggregate representation C R denotes a  
i
an MLM. In this case, we predict every in-  
put word and do not apply any masking. The  
left-only constraint was also applied at fine-  
tuning, because we found it is always worse  
to pre-train with left-only-context and fine-  
tune with bidirectional context. Additionally,  
this model was pre-trained without the NSP  
task. This is directly comparable to OpenAI  
GPT, but using our larger training dataset,  
our input representation, and our fine-tuning  
scheme.  
pre-trained bidirectional models. It also hurts per-  
formance on all four GLUE tasks.  
We recognize that it would also be possible to  
train separate LTR and RTL models and represent  
each token as the concatenation of the two mod-  
els, as ELMo does. However: (a) this is twice as  
expensive as a single bidirectional model; (b) this  
is non-intuitive for tasks like QA, since the RTL  
model would not be able to condition the answer  
on the question; (c) this it is strictly less powerful  
than a deep bidirectional model, since a deep bidi-  
rectional model could choose to use either left or  
right context.  
Results are presented in Table 5. We first ex-  
amine the impact brought by the NSP task. We  
can see that removing NSP hurts performance sig-  
nificantly on QNLI, MNLI, and SQuAD. These  
results demonstrate that our pre-training method  
is critical in obtaining the strong empirical results  
presented previously.  
5
.2 Effect of Model Size  
In this section, we explore the effect of model size  
on fine-tuning task accuracy. We trained a number  
of BERT models with a differing number of layers,  
hidden units, and attention heads, while otherwise  
using the same hyperparameters and training pro-  
cedure as described previously.  
Next, we evaluate the impact of training bidi-  
rectional representations by comparing “No NSP”  
to “LTR & No NSP”. The LTR model performs  
worse than the MLM model on all tasks, with ex-  
tremely large drops on MRPC and SQuAD. For  
SQuAD it is intuitively clear that an LTR model  
will perform very poorly at span and token predic-  
tion, since the token-level hidden states have no  
right-side context. For MRPC is unclear whether  
the poor performance is due to the small data size  
or the nature of the task, but we found this poor  
performance to be consistent across a full hyper-  
parameter sweep with many random restarts.  
In order make a good faith attempt at strength-  
ening the LTR system, we tried adding a ran-  
domly initialized BiLSTM on top of it for fine-  
tuning. This does significantly improve results on  
SQuAD, but the results are still far worse than the  
Results on selected GLUE tasks are shown in  
Table 6. In this table, we report the average Dev  
Set accuracy from 5 random restarts of fine-tuning.  
We can see that larger models lead to a strict ac-  
curacy improvement across all four datasets, even  
for MRPC which only has 3,600 labeled train-  
ing examples, and is substantially different from  
the pre-training tasks. It is also perhaps surpris-  
ing that we are able to achieve such significant  
improvements on top of models which are al-  
ready quite large relative to the existing literature.  
For example, the largest Transformer explored in  
Vaswani et al. (2017) is (L=6, H=1024, A=16)  
with 100M parameters for the encoder, and the  
largest Transformer we have found in the literature  
is (L=64, H=512, A=2) with 235M parameters  
(Al-Rfou et al., 2018). By contrast, BERTBASE  
Dev Set  
MNLI-m QNLI MRPC SST-2 SQuAD  
(Acc) (Acc) (Acc) (Acc) (F1)  
Tasks  
Hyperparams  
Dev Set Accuracy  
#
L
#H #A LM (ppl) MNLI-m MRPC SST-2  
BERTBASE  
No NSP  
84.4  
88.4  
84.9  
84.3  
84.1  
86.7  
86.5  
77.5  
75.7  
92.7  
92.6  
92.1  
91.6  
88.5  
87.9  
77.8  
84.9  
83.9  
82.1  
82.1  
3
6
6
2
2 1024 16  
4 1024 16  
768 12  
768  
5.84  
5.24  
4.68  
3.99  
3.54  
3.23  
77.9  
80.6  
81.9  
84.4  
85.7  
86.6  
79.8  
82.2  
84.8  
86.7  
86.9  
87.8  
88.4  
90.7  
91.3  
92.9  
93.3  
93.7  
LTR & No NSP  
3
+
BiLSTM  
768 12  
768 12  
1
1
2
Table 5: Ablation over the pre-training tasks using the  
BERTBASE architecture. “No NSP” is trained without  
the next sentence prediction task. “LTR & No NSP” is  
trained as a left-to-right LM without the next sentence  
prediction, like OpenAI GPT. “+ BiLSTM” adds a ran-  
domly initialized BiLSTM on top of the “LTR + No  
NSP” model during fine-tuning.  
Table 6: Ablation over BERT model size. #L = the  
number of layers; #H = hidden size; #A = number of at-  
tention heads. “LM (ppl)” is the masked LM perplexity  
of held-out training data.  
contains 110M parameters and BERTLARGE con-  
tains 340M parameters.  
5.4 Feature-based Approach with BERT  
It has been known for many years that increas-  
ing the model size will lead to continual improve-  
ments on large-scale tasks such as machine trans-  
lation and language modeling, which is demon-  
strated by the LM perplexity of held-out training  
data shown in Table 6. However, we believe that  
this is the first work to demonstrate that scaling to  
extreme model sizes also leads to large improve-  
ments on very small scale tasks, provided that the  
model has been sufficiently pre-trained.  
All of the BERT results presented so far have used  
the fine-tuning approach, where a simple classifi-  
cation layer is added to the pre-trained model, and  
all parameters are jointly fine-tuned on a down-  
stream task. However, the feature-based approach,  
where fixed features are extracted from the pre-  
trained model, has certain advantages. First, not  
all NLP tasks can be easily be represented by a  
Transformer encoder architecture, and therefore  
require a task-specific model architecture to be  
added. Second, there are major computational  
benefits to being able to pre-compute an expensive  
representation of the training data once and then  
run many experiments with less expensive models  
on top of this representation.  
5
.3 Effect of Number of Training Steps  
Figure 4 presents MNLI Dev accuracy after fine-  
tuning from a checkpoint that has been pre-trained  
for k steps. This allows us to answer the following  
questions:  
In this section we evaluate how well BERT per-  
forms in the feature-based approach by generating  
ELMo-like pre-trained contextual representations  
on the CoNLL-2003 NER task. To do this, we use  
the same input representation as in Section 4.3, but  
use the activations from one or more layers with-  
out fine-tuning any parameters of BERT. These  
contextual embeddings are used as input to a ran-  
domly initialized two-layer 768-dimensional BiL-  
STM before the classification layer.  
1
. Question: Does BERT really need such  
a large amount of pre-training (128,000  
words/batch * 1,000,000 steps) to achieve  
high fine-tuning accuracy?  
Answer: Yes, BERTBASE achieves almost  
1
.0% additional accuracy on MNLI when  
trained on 1M steps compared to 500k steps.  
2
. Question: Does MLM pre-training converge  
slower than LTR pre-training, since only 15%  
of words are predicted in each batch rather  
than every word?  
Results are shown in Table 7. The best perform-  
ing method is to concatenate the token representa-  
tions from the top four hidden layers of the pre-  
trained Transformer, which is only 0.3 F1 behind  
fine-tuning the entire model. This demonstrates  
that BERT is effective for both the fine-tuning and  
feature-based approaches.  
Answer: The MLM model does converge  
slightly slower than the LTR model. How-  
ever, in terms of absolute accuracy the MLM  
model begins to outperform the LTR model  
almost immediately.  
8
8
8
7
7
4
2
0
8
6
Layers  
Dev F1  
96.4  
Finetune All  
First Layer (Embeddings) 91.0  
Second-to-Last Hidden  
Last Hidden  
Sum Last Four Hidden  
Concat Last Four Hidden  
Sum All 12 Layers  
95.6  
94.9  
95.9  
96.1  
95.5  
BERTBASE (Masked LM)  
BERTBASE (Left-to-Right)  
2
00  
400  
600  
800  
1,000  
Pre-training Steps (Thousands)  
Figure 4: Ablation over number of training steps. This  
shows the MNLI accuracy after fine-tuning, starting  
from model parameters that have been pre-trained for  
k steps. The x-axis is the value of k.  
Table 7: Ablation using BERT with a feature-based ap-  
proach on CoNLL-2003 NER. The activations from the  
specified layers are combined and fed into a two-layer  
BiLSTM, without backpropagation to BERT.  
6
Conclusion  
ing progress in statistical language modeling. arXiv  
preprint arXiv:1312.3005.  
Recent empirical improvements due to transfer  
learning with language models have demonstrated  
that rich, unsupervised pre-training is an integral  
part of many language understanding systems. In  
particular, these results enable even low-resource  
tasks to benefit from very deep unidirectional ar-  
chitectures. Our major contribution is further gen-  
eralizing these findings to deep bidirectional ar-  
chitectures, allowing the same pre-trained model  
to successfully tackle a broad set of NLP tasks.  
While the empirical results are strong, in some  
cases surpassing human performance, important  
future work is to investigate the linguistic phenom-  
ena that may or may not be captured by BERT.  
Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.  
Kevin Clark, Minh-Thang Luong, Christopher D Man-  
ning, and Quoc V Le. 2018. Semi-supervised se-  
quence modeling with cross-view training. arXiv  
preprint arXiv:1809.08370.  
Ronan Collobert and Jason Weston. 2008. A unified  
architecture for natural language processing: Deep  
neural networks with multitask learning. In Pro-  
ceedings of the 25th International Conference on  
Machine Learning, ICML ’08.  
Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc  
Barrault, and Antoine Bordes. 2017. Supervised  
natural language inference data. In Proceedings of  
the 2017 Conference on Empirical Methods in Nat-  
ural Language Processing, pages 670–680, Copen-  
hagen, Denmark. Association for Computational  
Linguistics.  
References  
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy  
Guo, and Llion Jones. 2018. Character-level lan-  
guage modeling with deeper self-attention. arXiv  
preprint arXiv:1808.04444.  
Andrew M Dai and Quoc V Le. 2015. Semi-supervised  
sequence learning. In Advances in neural informa-  
tion processing systems, pages 3079–3087.  
Rie Kubota Ando and Tong Zhang. 2005. A framework  
for learning predictive structures from multiple tasks  
and unlabeled data. Journal of Machine Learning  
Research, 6(Nov):1817–1853.  
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-  
Fei. 2009. ImageNet: A Large-Scale Hierarchical  
Image Database. In CVPR09.  
William B Dolan and Chris Brockett. 2005. Automati-  
cally constructing a corpus of sentential paraphrases.  
In Proceedings of the Third International Workshop  
on Paraphrasing (IWP2005).  
Luisa Bentivogli, Bernardo Magnini, Ido Dagan,  
Hoa Trang Dang, and Danilo Giampiccolo. 2009.  
The fifth PASCAL recognizing textual entailment  
challenge. In TAC. NIST.  
Dan Hendrycks and Kevin Gimpel. 2016. Bridging  
sian error linear units. CoRR, abs/1606.08415.  
John Blitzer, Ryan McDonald, and Fernando Pereira.  
2
006. Domain adaptation with structural correspon-  
dence learning. In Proceedings of the 2006 confer-  
ence on empirical methods in natural language pro-  
cessing, pages 120–128. Association for Computa-  
tional Linguistics.  
Jeremy Howard and Sebastian Ruder. 2018. Universal  
ACL. Association for Computational Linguistics.  
Samuel R. Bowman, Gabor Angeli, Christopher Potts,  
and Christopher D. Manning. 2015. A large anno-  
tated corpus for learning natural language inference.  
In EMNLP. Association for Computational Linguis-  
tics.  
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke  
Zettlemoyer. 2017. Triviaqa: A large scale distantly  
supervised challenge dataset for reading comprehen-  
sion. In ACL.  
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,  
Richard Zemel, Raquel Urtasun, Antonio Torralba,  
and Sanja Fidler. 2015. Skip-thought vectors. In  
Advances in neural information processing systems,  
pages 3294–3302.  
Peter F Brown, Peter V Desouza, Robert L Mercer,  
Vincent J Della Pietra, and Jenifer C Lai. 1992.  
Class-based n-gram models of natural language.  
Computational linguistics, 18(4):467–479.  
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-  
Gazpio, and Lucia Specia. 2017. Semeval-2017  
task 1: Semantic textual similarity-multilingual and  
cross-lingual focused evaluation. arXiv preprint  
arXiv:1708.00055.  
Quoc Le and Tomas Mikolov. 2014. Distributed rep-  
resentations of sentences and documents. In Inter-  
national Conference on Machine Learning, pages  
1188–1196.  
Hector J Levesque, Ernest Davis, and Leora Morgen-  
stern. 2011. The winograd schema challenge. In  
Aaai spring symposium: Logical formalizations of  
commonsense reasoning, volume 46, page 47.  
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,  
Thorsten Brants, Phillipp Koehn, and Tony Robin-  
son. 2013. One billion word benchmark for measur-  
Lajanugen Logeswaran and Honglak Lee. 2018. An  
tations. In International Conference on Learning  
Representations.  
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob  
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz  
Kaiser, and Illia Polosukhin. 2017. Attention is all  
you need. In Advances in Neural Information Pro-  
cessing Systems, pages 6000–6010.  
Bryan McCann, James Bradbury, Caiming Xiong, and  
Richard Socher. 2017. Learned in translation: Con-  
textualized word vectors. In NIPS.  
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and  
Pierre-Antoine Manzagol. 2008. Extracting and  
composing robust features with denoising autoen-  
coders. In Proceedings of the 25th international  
conference on Machine learning, pages 1096–1103.  
ACM.  
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-  
rado, and Jeff Dean. 2013. Distributed representa-  
tions of words and phrases and their compositional-  
ity. In Advances in Neural Information Processing  
Systems 26, pages 3111–3119. Curran Associates,  
Inc.  
Alex Wang, Amapreet Singh, Julian Michael, Felix  
Hill, Omer Levy, and Samuel R Bowman. 2018.  
Glue: A multi-task benchmark and analysis platform  
for natural language understanding. arXiv preprint  
arXiv:1804.07461.  
Jeffrey Pennington, Richard Socher, and Christo-  
pher D. Manning. 2014. Glove: Global vectors for  
word representation. In Empirical Methods in Nat-  
ural Language Processing (EMNLP), pages 1532–  
A. Warstadt, A. Singh, and S. R. Bowman. 2018. Cor-  
1
543.  
Matthew Peters, Waleed Ammar, Chandra Bhagavat-  
ula, and Russell Power. 2017. Semi-supervised se-  
quence tagging with bidirectional language models.  
In ACL.  
Adina Williams, Nikita Nangia, and Samuel R Bow-  
man. 2018. A broad-coverage challenge corpus  
for sentence understanding through inference. In  
NAACL.  
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt  
Gardner, Christopher Clark, Kenton Lee, and Luke  
Zettlemoyer. 2018. Deep contextualized word rep-  
resentations. In NAACL.  
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V  
Le, Mohammad Norouzi, Wolfgang Macherey,  
Maxim Krikun, Yuan Cao, Qin Gao, Klaus  
Macherey, et al. 2016.  
Google’s neural ma-  
chine translation system: Bridging the gap between  
human and machine translation. arXiv preprint  
arXiv:1609.08144.  
Alec Radford, Karthik Narasimhan, Tim Salimans, and  
Ilya Sutskever. 2018. Improving language under-  
standing with unsupervised learning. Technical re-  
port, OpenAI.  
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod  
Lipson. 2014. How transferable are features in deep  
neural networks? In Advances in neural information  
processing systems, pages 3320–3328.  
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and  
Percy Liang. 2016. Squad: 100,000+ questions  
for machine comprehension of text. arXiv preprint  
arXiv:1606.05250.  
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin  
Choi. 2018. Swag: A large-scale adversarial dataset  
for grounded commonsense inference. In Proceed-  
ings of the 2018 Conference on Empirical Methods  
in Natural Language Processing (EMNLP).  
Richard Socher, Alex Perelygin, Jean Wu, Jason  
Chuang, Christopher D Manning, Andrew Ng, and  
Christopher Potts. 2013. Recursive deep models  
for semantic compositionality over a sentiment tree-  
bank. In Proceedings of the 2013 conference on  
empirical methods in natural language processing,  
pages 1631–1642.  
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-  
dinov, Raquel Urtasun, Antonio Torralba, and Sanja  
Fidler. 2015. Aligning books and movies: Towards  
story-like visual explanations by watching movies  
and reading books. In Proceedings of the IEEE  
international conference on computer vision, pages  
Wilson L Taylor. 1953. cloze procedure: A new  
tool for measuring readability. Journalism Bulletin,  
3
0(4):415–433.  
19–27.  
Erik F Tjong Kim Sang and Fien De Meulder.  
2003. Introduction to the conll-2003 shared task:  
Language-independent named entity recognition. In  
Proceedings of the seventh conference on Natural  
language learning at HLT-NAACL 2003-Volume 4,  
pages 142–147. Association for Computational Lin-  
guistics.  
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.  
Word representations: A simple and general method  
for semi-supervised learning. In Proceedings of the  
4
8th Annual Meeting of the Association for Compu-  
tational Linguistics, ACL ’10, pages 384–394.