BERT: Pre-training of Deep Bidirectional Transformers for

Language Understanding

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova

Google AI Language

jacobdevlin,mingweichang,kentonl,kristout}@google.com

{

Abstract

models are required to produce ﬁne-grained output

at the token-level.

We introduce a new language representa-

tion model called BERT, which stands for

Bidirectional Encoder Representations from

Transformers. Unlike recent language repre-

sentation models (Peters et al., 2018; Radford

et al., 2018), BERT is designed to pre-train

deep bidirectional representations by jointly

conditioning on both left and right context in

all layers. As a result, the pre-trained BERT

representations can be ﬁne-tuned with just one

additional output layer to create state-of-the-

art models for a wide range of tasks, such

as question answering and language inference,

without substantial task-speciﬁc architecture

modiﬁcations.

There are two existing strategies for apply-

ing pre-trained language representations to down-

stream tasks: feature-based and ﬁne-tuning. The

feature-based approach, such as ELMo (Peters

et al., 2018), uses tasks-speciﬁc architectures that

include the pre-trained representations as addi-

tional features. The ﬁne-tuning approach, such as

the Generative Pre-trained Transformer (OpenAI

GPT) (Radford et al., 2018), introduces minimal

task-speciﬁc parameters, and is trained on the

downstream tasks by simply ﬁne-tuning the pre-

trained parameters. In previous work, both ap-

proaches share the same objective function dur-

ing pre-training, where they use unidirectional lan-

guage models to learn general language represen-

tations.

We argue that current techniques severely re-

strict the power of the pre-trained representations,

especially for the ﬁne-tuning approaches. The ma-

jor limitation is that standard language models are

unidirectional, and this limits the choice of archi-

tectures that can be used during pre-training. For

example, in OpenAI GPT, the authors use a left-

to-right architecture, where every token can only

attended to previous tokens in the self-attention

layers of the Transformer (Vaswani et al., 2017).

Such restrictions are sub-optimal for sentence-

level tasks, and could be devastating when ap-

plying ﬁne-tuning based approaches to token-level

tasks such as SQuAD question answering (Ra-

jpurkar et al., 2016), where it is crucial to incor-

porate context from both directions.

BERT is conceptually simple and empirically

powerful. It obtains new state-of-the-art re-

sults on eleven natural language processing

tasks, including pushing the GLUE bench-

mark to 80.4% (7.6% absolute improvement),

MultiNLI accuracy to 86.7% (5.6% abso-

lute improvement) and the SQuAD v1.1 ques-

tion answering Test F1 to 93.2 (1.5 absolute

improvement), outperforming human perfor-

mance by 2.0.

1

Introduction

Language model pre-training has shown to be ef-

fective for improving many natural language pro-

cessing tasks (Dai and Le, 2015; Peters et al.,

2

017, 2018; Radford et al., 2018; Howard and

Ruder, 2018). These tasks include sentence-level

tasks such as natural language inference (Bow-

man et al., 2015; Williams et al., 2018) and para-

phrasing (Dolan and Brockett, 2005), which aim

to predict the relationships between sentences by

analyzing them holistically, as well as token-level

tasks such as named entity recognition (Tjong

Kim Sang and De Meulder, 2003) and SQuAD

question answering (Rajpurkar et al., 2016), where

In this paper, we improve the ﬁne-tuning based

approaches by proposing BERT: Bidirectional

Encoder Representations from Transformers.

BERT addresses the previously mentioned uni-

directional constraints by proposing a new

pre-training objective: the “masked language

model” (MLM), inspired by the Cloze task (Tay-

lor, 1953). The masked language model randomly

masks some of the tokens from the input, and the

objective is to predict the original vocabulary id of

the masked word based only on its context. Un-

like left-to-right language model pre-training, the

MLM objective allows the representation to fuse

the left and the right context, which allows us

to pre-train a deep bidirectional Transformer. In

addition to the masked language model, we also

introduce a “next sentence prediction” task that

jointly pre-trains text-pair representations.

Ando and Zhang, 2005; Blitzer et al., 2006) and

neural (Collobert and Weston, 2008; Mikolov

et al., 2013; Pennington et al., 2014) methods. Pre-

trained word embeddings are considered to be an

integral part of modern NLP systems, offering sig-

niﬁcant improvements over embeddings learned

from scratch (Turian et al., 2010).

These approaches have been generalized to

coarser granularities, such as sentence embed-

dings (Kiros et al., 2015; Logeswaran and Lee,

2018) or paragraph embeddings (Le and Mikolov,

2014). As with traditional word embeddings,

these learned representations are also typically

used as features in a downstream model.

ELMo (Peters et al., 2017) generalizes tradi-

tional word embedding research along a differ-

ent dimension. They propose to extract context-

sensitive features from a language model. When

integrating contextual word embeddings with ex-

isting task-speciﬁc architectures, ELMo advances

the state-of-the-art for several major NLP bench-

marks (Peters et al., 2018) including question an-

swering (Rajpurkar et al., 2016) on SQuAD, sen-

timent analysis (Socher et al., 2013), and named

entity recognition (Tjong Kim Sang and De Meul-

der, 2003).

The contributions of our paper are as follows:

•

We demonstrate the importance of bidirec-

tional pre-training for language representa-

tions. Unlike Radford et al. (2018), which

uses unidirectional language models for pre-

training, BERT uses masked language mod-

els to enable pre-trained deep bidirectional

representations. This is also in contrast to

Peters et al. (2018), which uses a shallow

concatenation of independently trained left-

to-right and right-to-left LMs.

•

We show that pre-trained representations

eliminate the needs of many heavily-

engineered task-speciﬁc architectures. BERT

is the ﬁrst ﬁne-tuning based representation

model that achieves state-of-the-art perfor-

mance on a large suite of sentence-level and

token-level tasks, outperforming many sys-

tems with task-speciﬁc architectures.

2

.2 Fine-tuning Approaches

A recent trend in transfer learning from language

models (LMs) is to pre-train some model ar-

chitecture on a LM objective before ﬁne-tuning

that same model for a supervised downstream

task (Dai and Le, 2015; Howard and Ruder, 2018;

Radford et al., 2018). The advantage of these ap-

proaches is that few parameters need to be learned

from scratch. At least partly due this advantage,

OpenAI GPT (Radford et al., 2018) achieved pre-

viously state-of-the-art results on many sentence-

level tasks from the GLUE benchmark (Wang

et al., 2018).

BERT advances the state-of-the-art for eleven

NLP tasks. We also report extensive abla-

tions of BERT, demonstrating that the bidi-

rectional nature of our model is the single

most important new contribution. The code

and pre-trained model will be available at

goo.gl/language/bert.¹

2

.3 Transfer Learning from Supervised Data

2

Related Work

While the advantage of unsupervised pre-training

is that there is a nearly unlimited amount of data

available, there has also been work showing ef-

fective transfer from supervised tasks with large

datasets, such as natural language inference (Con-

neau et al., 2017) and machine translation (Mc-

Cann et al., 2017). Outside of NLP, computer

vision research has also demonstrated the impor-

tance of transfer learning from large pre-trained

models, where an effective recipe is to ﬁne-tune

There is a long history of pre-training general lan-

guage representations, and we brieﬂy review the

most popular approaches in this section.

2

.1 Feature-based Approaches

Learning widely applicable representations of

words has been an active area of research for

decades, including non-neural (Brown et al., 1992;

¹Will be released before the end of October 2018.

BERT (Ours)

OpenAI GPT

ELMo

T₁

T₂

...

T_N

T₁

T₂

...

T_N

T₁

T₂

...

T_N

Trm

...

Trm

...

Trm

Lstm

.

..

.

..

Trm

...

Trm

...

Trm

E_N

...

Lstm

...

Lstm

...

^E1

^E2

^EN

^E1

^E2

E₁

E₂

E_N

Figure 1: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT

uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-

to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly

conditioned on both left and right context in all layers.

models pre-trained on ImageNet (Deng et al.,

2009; Yosinski et al., 2014).

• BERT_L_A_R_G_E: L=24, H=1024, A=16, Total

Parameters=340M

3

BERT

BERT_B_A_S_Ewas chosen to have an identical

model size as OpenAI GPT for comparison pur-

poses. Critically, however, the BERT Transformer

uses bidirectional self-attention, while the GPT

Transformer uses constrained self-attention where

every token can only attend to context to its left.

We note that in the literature the bidirectional

Transformer is often referred to as a “Transformer

encoder” while the left-context-only version is re-

ferred to as a “Transformer decoder” since it can

be used for text generation. The comparisons be-

tween BERT, OpenAI GPT and ELMo are shown

visually in Figure 1.

We introduce BERT and its detailed implementa-

tion in this section. We ﬁrst cover the model ar-

chitecture and the input representation for BERT.

We then introduce the pre-training tasks, the core

innovation in this paper, in Section 3.3. The

pre-training procedures, and ﬁne-tuning proce-

dures are detailed in Section 3.4 and 3.5, respec-

tively. Finally, the differences between BERT and

OpenAI GPT are discussed in Section 3.6.

3

.1 Model Architecture

BERT’s model architecture is a multi-layer bidi-

rectional Transformer encoder based on the orig-

inal implementation described in Vaswani et al.

3

.2 Input Representation

Our input representation is able to unambiguously

represent both a single text sentence or a pair of

text sentences (e.g., [Question, Answer]) in one

(2017) and released in the tensor2tensor li-

2

brary. Because the use of Transformers has be-

come ubiquitous recently and our implementation

is effectively identical to the original, we will

omit an exhaustive background description of the

model architecture and refer readers to Vaswani

4

token sequence. For a given token, its input rep-

resentation is constructed by summing the cor-

responding token, segment and position embed-

dings. A visual representation of our input rep-

resentation is given in Figure 2.

et al. (2017) as well as excellent guides such as

3

“

The Annotated Transformer.”

The speciﬁcs are:

In this work, we denote the number of layers

(

i.e., Transformer blocks) as L, the hidden size as

•

We use WordPiece embeddings (Wu et al.,

016) with a 30,000 token vocabulary. We

denote split word pieces with ##.

H, and the number of self-attention heads as A.

In all cases we set the feed-forward/ﬁlter size to

be 4H, i.e., 3072 for the H = 768 and 4096 for

the H = 1024. We primarily report results on two

model sizes:

2

•

We use learned positional embeddings with

supported sequence lengths up to 512 tokens.

⁴Throughout this work, a “sentence” can be an arbitrary

span of contiguous text, rather than an actual linguistic sen-

tence. A “sequence” refers to the input token sequence to

BERT, which may be a single sentence or two sentences

packed together.

•

^B^E^R^TBASE: L=12, H=768, A=12, Total Pa-

rameters=110M

2

https://github.com/tensorﬂow/tensor2tensor

3

http://nlp.seas.harvard.edu/2018/04/03/attention.html

Input

[CLS]

[SEP]

my

dog

is

cute

he

likes

play

##ing

Token

Embeddings

^E[CLS]

^Emy

^Edog

^Eis

^Ecute

^E[SEP]

^Ehe

^Elikes

^Eplay

^E##ing

^E[SEP]

Segment

Embeddings

^EA

^EB

Position

Embeddings

^E0

^E1

^E2

^E3

^E4

^E5

^E6

^E7

^E8

^E9

^E10

Figure 2: BERT input representation. The input embeddings is the sum of the token embeddings, the segmentation

embeddings and the position embeddings.

•

The ﬁrst token of every sequence is al-

ways the special classiﬁcation embedding

refer to this procedure as a “masked LM” (MLM),

although it is often referred to as a Cloze task in

the literature (Taylor, 1953). In this case, the ﬁ-

nal hidden vectors corresponding to the mask to-

kens are fed into an output softmax over the vo-

cabulary, as in a standard LM. In all of our exper-

iments, we mask 15% of all WordPiece tokens in

each sequence at random. In contrast to denoising

auto-encoders (Vincent et al., 2008), we only pre-

dict the masked words rather than reconstructing

the entire input.

Although this does allow us to obtain a bidirec-

tional pre-trained model, there are two downsides

to such an approach. The ﬁrst is that we are cre-

ating a mismatch between pre-training and ﬁne-

tuning, since the [MASK] token is never seen dur-

ing ﬁne-tuning. To mitigate this, we do not always

replace “masked” words with the actual [MASK]

token. Instead, the training data generator chooses

(

[CLS]). The ﬁnal hidden state (i.e., out-

put of Transformer) corresponding to this to-

ken is used as the aggregate sequence rep-

resentation for classiﬁcation tasks. For non-

classiﬁcation tasks, this vector is ignored.

Sentence pairs are packed together into a sin-

gle sequence. We differentiate the sentences

in two ways. First, we separate them with

a special token ([SEP]). Second, we add a

learned sentence A embedding to every token

of the ﬁrst sentence and a sentence B embed-

ding to every token of the second sentence.

For single-sentence inputs we only use the

sentence A embeddings.

3

.3 Pre-training Tasks

Unlike Peters et al. (2018) and Radford et al.

2018), we do not use traditional left-to-right or

right-to-left language models to pre-train BERT.

Instead, we pre-train BERT using two novel unsu-

pervised prediction tasks, described in this section.

1

5% of tokens at random, e.g., in the sentence my

dog is hairy it chooses hairy. It then performs

the following procedure:

(

•

Rather than always replacing the chosen

words with [MASK], the data generator will

do the following:

3

.3.1 Task #1: Masked LM

Intuitively, it is reasonable to believe that a

deep bidirectional model is strictly more power-

ful than either a left-to-right model or the shal-

low concatenation of a left-to-right and right-to-

left model. Unfortunately, standard conditional

language models can only be trained left-to-right

or right-to-left, since bidirectional conditioning

would allow each word to indirectly “see itself”

in a multi-layered context.

In order to train a deep bidirectional representa-

tion, we take a straightforward approach of mask-

ing some percentage of the input tokens at random,

and then predicting only those masked tokens. We

80% of the time: Replace the word with the

[

my dog is [MASK]

MASK] token, e.g., my dog is hairy →

10% of the time: Replace the word with a

random word, e.g., my dog is hairy → my

dog is apple

• 10% of the time: Keep the word un-

changed, e.g., my dog is hairy → my dog

is hairy. The purpose of this is to bias the

representation towards the actual observed

word.

The Transformer encoder does not know which

words it will be asked to predict or which have

been replaced by random words, so it is forced to

keep a distributional contextual representation of

every input token. Additionally, because random

replacement only occurs for 1.5% of all tokens

For the pre-training corpus we use the concatena-

tion of BooksCorpus (800M words) (Zhu et al.,

2015) and English Wikipedia (2,500M words).

For Wikipedia we extract only the text passages

and ignore lists, tables, and headers. It is criti-

cal to use a document-level corpus rather than a

shufﬂed sentence-level corpus such as the Billion

Word Benchmark (Chelba et al., 2013) in order to

extract long contiguous sequences.

(i.e., 10% of 15%), this does not seem to harm the

model’s language understanding capability.

The second downside of using an MLM is that

only 15% of tokens are predicted in each batch,

which suggests that more pre-training steps may

be required for the model to converge. In Sec-

tion 5.3 we demonstrate that MLM does con-

verge marginally slower than a left-to-right model

To generate each training input sequence, we

sample two spans of text from the corpus, which

we refer to as “sentences” even though they are

typically much longer than single sentences (but

can be shorter also). The ﬁrst sentence receives

the A embedding and the second receives the B

embedding. 50% of the time B is the actual next

sentence that follows A and 50% of the time it is

a random sentence, which is done for the “next

sentence prediction” task. They are sampled such

that the combined length is ≤ 512 tokens. The

LM masking is applied after WordPiece tokeniza-

tion with a uniform masking rate of 15%, and no

special consideration given to partial word pieces.

We train with batch size of 256 sequences (256

sequences * 512 tokens = 128,000 tokens/batch)

for 1,000,000 steps, which is approximately 40

epochs over the 3.3 billion word corpus. We

use Adam with learning rate of 1e-4, β₁ = 0.9,

β₂= 0.999, L2 weight decay of 0.01, learning

rate warmup over the ﬁrst 10,000 steps, and linear

decay of the learning rate. We use a dropout prob-

ability of 0.1 on all layers. We use a gelu acti-

vation (Hendrycks and Gimpel, 2016) rather than

the standard relu, following OpenAI GPT. The

training loss is the sum of the mean masked LM

likelihood and mean next sentence prediction like-

lihood.

(which predicts every token), but the empirical im-

provements of the MLM model far outweigh the

increased training cost.

3

.3.2 Task #2: Next Sentence Prediction

Many important downstream tasks such as Ques-

tion Answering (QA) and Natural Language In-

ference (NLI) are based on understanding the re-

lationship between two text sentences, which is

not directly captured by language modeling. In

order to train a model that understands sentence

relationships, we pre-train a binarized next sen-

tence prediction task that can be trivially gener-

ated from any monolingual corpus. Speciﬁcally,

when choosing the sentences A and B for each pre-

training example, 50% of the time B is the actual

next sentence that follows A, and 50% of the time

it is a random sentence from the corpus. For ex-

ample:

Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Training of BERT_B_A_S_E was performed on 4

Cloud TPUs in Pod conﬁguration (16 TPU chips

total). Training of BERT_L_A_R_G_E was performed

on 16 Cloud TPUs (64 TPU chips total). Each pre-

training took 4 days to complete.

Input = [CLS] the man [MASK] to the store [SEP]

penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

5

We choose the NotNext sentences completely at

random, and the ﬁnal pre-trained model achieves

3

.5 Fine-tuning Procedure

9

7%-98% accuracy at this task. Despite its sim-

For sequence-level classiﬁcation tasks, BERT

ﬁne-tuning is straightforward. In order to obtain

a ﬁxed-dimensional pooled representation of the

input sequence, we take the ﬁnal hidden state (i.e.,

the output of the Transformer) for the ﬁrst token

⁵https://cloudplatform.googleblog.com/2018/06/Cloud-

TPU-now-offers-preemptible-pricing-and-global-

availability.html

plicity, we demonstrate in Section 5.1 that pre-

training towards this task is very beneﬁcial to both

QA and NLI.

3

.4 Pre-training Procedure

The pre-training procedure largely follows the ex-

isting literature on language model pre-training.

in the input, which by construction corresponds to

• GPT uses a sentence separator ([SEP]) and

classiﬁer token ([CLS]) which are only in-

troduced at ﬁne-tuning time; BERT learns

[SEP], [CLS] and sentence A/B embed-

dings during pre-training.

the the special [CLS] word embedding. We de-

H

note this vector as C ∈ R . The only new pa-

rameters added during ﬁne-tuning are for a classi-

ﬁcation layer W ∈ R^K^×^H, where K is the num-

ber of classiﬁer labels. The label probabilities

K

• GPT was trained for 1M steps with a batch

size of 32,000 words; BERT was trained for

P ∈ R are computed with a standard softmax,

T

P = softmax(CW ). All of the parameters of

1

M steps with a batch size of 128,000 words.

BERT and W are ﬁne-tuned jointly to maximize

the log-probability of the correct label. For span-

level and token-level prediction tasks, the above

procedure must be modiﬁed slightly in a task-

speciﬁc manner. Details are given in the corre-

sponding subsection of Section 4.

•

GPT used the same learning rate of 5e-5 for

all ﬁne-tuning experiments; BERT chooses a

task-speciﬁc ﬁne-tuning learning rate which

performs the best on the development set.

For ﬁne-tuning, most model hyperparameters

are the same as in pre-training, with the excep-

tion of the batch size, learning rate, and number

of training epochs. The dropout probability was

always kept at 0.1. The optimal hyperparameter

To isolate the effect of these differences, we per-

form ablation experiments in Section 5.1 which

demonstrate that the majority of the improvements

are in fact coming from the new pre-training tasks.

values are task-speciﬁc, but we found the follow- 4 Experiments

ing range of possible values to work well across

all tasks:

In this section, we present BERT ﬁne-tuning re-

sults on 11 NLP tasks.

•

Batch size: 16, 32

Learning rate (Adam): 5e-5, 3e-5, 2e-5

Number of epochs: 3, 4

4

.1 GLUE Datasets

The General Language Understanding Evaluation

GLUE) benchmark (Wang et al., 2018) is a col-

(

We also observed that large data sets (e.g.,

00k+ labeled training examples) were far less

lection of diverse natural language understand-

ing tasks. Most of the GLUE datasets have al-

ready existed for a number of years, but the pur-

pose of GLUE is to (1) distribute these datasets

with canonical Train, Dev, and Test splits, and

1

sensitive to hyperparameter choice than small data

sets. Fine-tuning is typically very fast, so it is rea-

sonable to simply run an exhaustive search over

the above parameters and choose the model that

performs best on the development set.

(2) set up an evaluation server to mitigate issues

with evaluation inconsistencies and Test set over-

ﬁtting. GLUE does not distribute labels for the

Test set and users must upload their predictions to

the GLUE server for evaluation, with limits on the

number of submissions.

The GLUE benchmark includes the following

datasets, the descriptions of which were originally

summarized in Wang et al. (2018):

3

.6 Comparison of BERT and OpenAI GPT

The most comparable existing pre-training method

to BERT is OpenAI GPT, which trains a left-to-

right Transformer LM on a large text corpus. In

fact, many of the design decisions in BERT were

intentionally chosen to be as close to GPT as pos-

sible so that the two methods could be minimally

compared. The core argument of this work is that

the two novel pre-training tasks presented in Sec-

tion 3.3 account for the majority of the empiri-

cal improvements, but we do note that there are

several other differences between how BERT and

GPT were trained:

MNLI Multi-Genre Natural Language Inference

is a large-scale, crowdsourced entailment classiﬁ-

cation task (Williams et al., 2018). Given a pair of

sentences, the goal is to predict whether the sec-

ond sentence is an entailment, contradiction, or

neutral with respect to the ﬁrst one.

•

GPT is trained on the BooksCorpus (800M

words); BERT is trained on the BooksCor-

pus (800M words) and Wikipedia (2,500M

words).

QQP Quora Question Pairs is a binary classiﬁ-

cation task where the goal is to determine if two

questions asked on Quora are semantically equiv-

alent (Chen et al., 2018).

Class

Label

Class

Label

...

T

N

C

T₁

...

T_N

T_[_S_E_P_]

T₁’ ...

T_M’

C

T₁

T₂

BERT

E₂

^E^[CLS]

^E1

...

E_N

E_[_S_E_P_]

E₁’ ...

E_M’

E_[_C_L_S_]

E₁

...

E

N

Tok

1

Tok

N

Tok

...

1

Tok

M

[CLS]

Tok 1

Tok N

[CLS]

[SEP]

Tok 2

Sentence 1

Sentence 2

Single Sentence

.

..

Start/End Span

O

B-PER

O

.

^TN

C

T₁

...

T_N

T_[_S_E_P_]

T₁’ ...

T_M’

C

T

1

T

2

BERT

E₂

E_[_C_L_S_]

E₁

...

E_N

E_[_S_E_P_]

E₁’ ...

E_M’

...

E_N

E_[_C_L_S_]

E₁

Tok

1

Tok

N

Tok

...

1

Tok

M

[CLS]

Tok N

[CLS]

[SEP]

Tok 1

Tok 2

Question

Paragraph

Single Sentence

Figure 3: Our task speciﬁc models are formed by incorporating BERT with one additional output layer, so a

minimal number of parameters need to be learned from scratch. Among the tasks, (a) and (b) are sequence-level

tasks while (c) and (d) are token-level tasks. In the ﬁgure, E represents the input embedding, Ti represents the

contextual representation of token i, [CLS] is the special symbol for classiﬁcation output, and [SEP] is the special

symbol to separate non-consecutive token sequences.

QNLI Question Natural Language Inference is

a version of the Stanford Question Answering

Dataset (Rajpurkar et al., 2016) which has been

converted to a binary classiﬁcation task (Wang

et al., 2018). The positive examples are (ques-

tion, sentence) pairs which do contain the correct

answer, and the negative examples are (question,

sentence) from the same paragraph which do not

contain the answer.

the goal is to predict whether an English sentence

is linguistically “acceptable” or not (Warstadt

et al., 2018).

STS-B The Semantic Textual Similarity Bench-

mark is a collection of sentence pairs drawn from

news headlines and other sources (Cer et al.,

2

017). They were annotated with a score from 1

to 5 denoting how similar the two sentences are in

terms of semantic meaning.

SST-2 The Stanford Sentiment Treebank is a

binary single-sentence classiﬁcation task consist-

ing of sentences extracted from movie reviews

with human annotations of their sentiment (Socher

et al., 2013).

MRPC Microsoft Research Paraphrase Corpus

consists of sentence pairs automatically extracted

from online news sources, with human annotations

for whether the sentences in the pair are semanti-

cally equivalent (Dolan and Brockett, 2005).

CoLA The Corpus of Linguistic Acceptability is

a binary single-sentence classiﬁcation task, where

System

MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average

3

92k

363k 108k 67k 8.5k 5.7k

66.1 82.3 93.2 35.0 81.0

64.8 79.9 90.4 36.0 73.3

70.3 88.1 91.3 45.4 80.0

71.2 90.1 93.5 52.1 85.8

72.1 91.1 94.9 60.5 86.5

3.5k 2.5k

-

Pre-OpenAI SOTA

BiLSTM+ELMo+Attn

OpenAI GPT

BERT_B_A_S_E

BERT_L_A_R_G_E

80.6/80.1

76.4/76.1

82.1/81.4

84.6/83.4

86.7/85.9

86.0 61.7 74.0

84.9 56.8 71.0

82.3 56.0 75.2

88.9 66.4 79.6

89.3 70.1 81.9

Table 1: GLUE Test results, scored by the GLUE evaluation server. The number below each task denotes the

number of training examples. The “Average” column is slightly different than the ofﬁcial GLUE score, since

we exclude the problematic WNLI set. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768,

A=12); BERTLARGE = (L=24, H=1024, A=16). BERT and OpenAI GPT are single-model, single task. All

results obtained from https://gluebenchmark.com/leaderboard and https://blog.openai.

com/language-unsupervised/.

RTE Recognizing Textual Entailment is a bi-

nary entailment task similar to MNLI, but with

much less training data (Bentivogli et al., 2009).⁶

small data sets (i.e., some runs would produce de-

generate results), so we ran several random restarts

and selected the model that performed best on the

Dev set. With random restarts, we use the same

pre-trained checkpoint but perform different ﬁne-

tuning data shufﬂing and classiﬁer layer initializa-

tion. We note that the GLUE data set distribution

does not include the Test labels, and we only made

a single GLUE evaluation server submission for

WNLI Winograd NLI is a small natural lan-

guage inference dataset deriving from (Levesque

et al., 2011). The GLUE webpage notes that there

are issues with the construction of this dataset, ⁷

and every trained system that’s been submitted

to GLUE has has performed worse than the 65.1

baseline accuracy of predicting the majority class.

We therefore exclude this set out of fairness to

OpenAI GPT. For our GLUE submission, we al-

ways predicted the majority class.

each BERT_B_A_S_E and BERT_L_A_R_G_E

.

Results are presented in Table 1.

Both

BERT_B_A_S_Eand BERT_L_A_R_G_E outperform all exist-

ing systems on all tasks by a substantial margin,

obtaining 4.4% and 6.7% respective average accu-

racy improvement over the state-of-the-art. Note

that BERT_B_A_S_E and OpenAI GPT are nearly iden-

tical in terms of model architecture outside of

the attention masking. For the largest and most

widely reported GLUE task, MNLI, BERT ob-

tains a 4.7% absolute accuracy improvement over

the state-of-the-art. On the ofﬁcial GLUE leader-

4

.1.1 GLUE Results

To ﬁne-tune on GLUE, we represent the input se-

quence or sequence pair as described in Section 3,

H

and use the ﬁnal hidden vector C ∈ R corre-

sponding to the ﬁrst input token ([CLS]) as the

aggregate representation. This is demonstrated vi-

sually in Figure 3 (a) and (b). The only new pa-

rameters introduced during ﬁne-tuning is a classi-

ﬁcation layer W ∈ R^K^×^H, where K is the num-

ber of labels. We compute a standard classiﬁcation

8

board, BERT_L_A_R_G_E obtains a score of 80.4, com-

pared to the top leaderboard system, OpenAI GPT,

which obtains 72.8 as of the date of writing.

It is interesting to observe that BERT_L_A_R_G_E sig-

niﬁcantly outperforms BERT_B_A_S_E across all tasks,

even those with very little training data. The effect

of BERT model size is explored more thoroughly

in Section 5.2.

T

loss with C and W, i.e., log(softmax(CW )).

We use a batch size of 32 and 3 epochs over

the data for all GLUE tasks. For each task, we ran

ﬁne-tunings with learning rates of 5e-5, 4e-5, 3e-5,

and 2e-5 and selected the one that performed best

on the Dev set. Additionally, for BERT_L_A_R_G_E we

found that ﬁne-tuning was sometimes unstable on

4

.2 SQuAD v1.1

The Standford Question Answering Dataset

SQuAD) is a collection of 100k crowdsourced

⁶Note that we only report single-task ﬁne-tuning results in

this paper. Multitask ﬁne-tuning approach could potentially

push the results even further. For example, we did observe

substantial improvements on RTE from multi-task training

with MNLI.

(

question/answer pairs (Rajpurkar et al., 2016).

Given a question and a paragraph from Wikipedia

⁷https://gluebenchmark.com/faq

⁸https://gluebenchmark.com/leaderboard

containing the answer, the task is to predict the an-

swer text span in the paragraph. For example:

System

Dev

Test

EM F1 EM F1

Leaderboard (Oct 8th, 2018)

Human

#

-

82.3 91.2

86.0 91.7

84.5 90.5

83.5 90.1

82.5 89.3

•

Input Question:

1 Ensemble - nlnet

Where do water droplets collide with ice

2 Ensemble - QANet

crystals to form precipitation?

#1 Single - nlnet

#

2 Single - QANet

•

Input Paragraph:

Published

BiDAF+ELMo (Single)

R.M. Reader (Single)

R.M. Reader (Ensemble)

-

85.8

-

.

.. Precipitation forms as smaller droplets

78.9 86.3 79.5 86.6

81.2 87.9 82.3 88.5

coalesce via collision with other rain drops

or ice crystals within a cloud. ...

Ours

BERTBASE (Single)

80.8 88.5

84.1 90.9

85.8 91.8

-

•

Output Answer:

within a cloud

BERTLARGE (Single)

BERTLARGE (Ensemble)

BERTLARGE (Sgl.+TriviaQA) 84.2 91.1 85.1 91.8

BERTLARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2

This type of span prediction task is quite dif-

ferent from the sequence classiﬁcation tasks of

GLUE, but we are able to adapt BERT to run

on SQuAD in a straightforward manner. Just as

with GLUE, we represent the input question and

paragraph as a single packed sequence, with the

question using the A embedding and the paragraph

using the B embedding. The only new parame-

ters learned during ﬁne-tuning are a start vector

Table 2: SQuAD results. The BERT ensemble is 7x

systems which use different pre-training checkpoints

and ﬁne-tuning seeds.

from the SQuAD leaderboard do not have up-to-

date public system descriptions available, and are

allowed to use any public data when training their

systems. We therefore use very modest data aug-

mentation in our submitted system by jointly train-

ing on SQuAD and TriviaQA (Joshi et al., 2017).

Our best performing system outperforms the top

leaderboard system by +1.5 F1 in ensembling and

+1.3 F1 as a single system. In fact, our single

BERT model outperforms the top ensemble sys-

tem in terms of F1 score. If we ﬁne-tune on only

SQuAD (without TriviaQA) we lose 0.1-0.4 F1

and still outperform all existing systems by a wide

margin.

H

S ∈ R and an end vector E ∈ R . Let the ﬁnal

th

hidden vector from BERT for the i input token

H

be denoted as T_i ∈ R . See Figure 3 (c) for a vi-

sualization. Then, the probability of word i being

the start of the answer span is computed as a dot

product between T_i and S followed by a softmax

over all of the words in the paragraph:

_eS·Ti

P_i= P

j ^e^S^·^T^j

The same formula is used for the end of the an-

swer span, and the maximum scoring span is used

as the prediction. The training objective is the log-

likelihood of the correct start and end positions.

We train for 3 epochs with a learning rate of 5e-

4

.3 Named Entity Recognition

To evaluate performance on a token tagging task,

we ﬁne-tune BERT on the CoNLL 2003 Named

Entity Recognition (NER) dataset. This dataset

consists of 200k training words which have been

annotated as Person, Organization, Location,

Miscellaneous, or Other (non-named entity).

For ﬁne-tuning, we feed the ﬁnal hidden

5

and a batch size of 32. At inference time, since

the end prediction is not conditioned on the start,

we add the constraint that the end must come after

the start, but no other heuristics are used. The tok-

enized labeled span is aligned back to the original

untokenized input for evaluation.

Results are presented in Table 2. SQuAD uses

a highly rigorous testing procedure where the sub-

mitter must manually contact the SQuAD organiz-

ers to run their system on a hidden test set, so we

only submitted our best system for testing. The

result shown in the table is our ﬁrst and only Test

submission to SQuAD. We note that the top results

H

representation T_i ∈ R for to each token i into

a classiﬁcation layer over the NER label set. The

predictions are not conditioned on the surround-

ing predictions (i.e., non-autoregressive and no

CRF). To make this compatible with WordPiece

tokenization, we feed each CoNLL-tokenized

input word into our WordPiece tokenizer and

use the hidden state corresponding to the ﬁrst

System

Dev F1 Test F1

System

Dev Test

ELMo+BiLSTM+CRF

CVT+Multi (Clark et al., 2018)

95.7

-

92.2

92.6

ESIM+GloVe

ESIM+ELMo

51.9 52.7

59.1 59.2

BERT_B_A_S_E

96.4

92.4

BERT_B_A_S_E

81.6

-

BERT_L_A_R_G_E

96.6

92.8

BERT_L_A_R_G_E

86.6 86.3

Human (expert)^†

-

85.0

88.0

Table 3: CoNLL-2003 Named Entity Recognition re-

sults. The hyperparameters were selected using the

Dev set, and the reported Dev and Test scores are aver-

aged over 5 random restarts using those hyperparame-

ters.

Human (5 annotations)^†

Table 4: SWAG Dev and Test accuracies. Test results

were scored against the hidden labels by the SWAG au-

†

thors. Human performance is measure with 100 sam-

ples, as reported in the SWAG paper.

sub-token as input to the classiﬁer. For example:

Jim

I-PER I-PER X

Hen

##son was a puppet ##eer

O O

score for each choice i. The probability distribu-

tion is the softmax over the four choices:

O

X

Where no prediction is made for X. Since

the WordPiece tokenization boundaries are a

known part of the input, this is done for both

training and test. A visual representation is also

given in Figure 3 (d). A cased WordPiece model

is used for NER, whereas an uncased model is

used for all other tasks.

Results are presented in Table 3. BERT_L_A_R_G_E

outperforms the existing SOTA, Cross-View

Training with multi-task learning (Clark et al.,

_eV ·Ci

P_i= P

4

j=1

_eV ·C_j

We ﬁne-tune the model for 3 epochs with a

learning rate of 2e-5 and a batch size of 16. Re-

sults are presented in Table 4. BERT_L_A_R_G_E out-

performs the authors’ baseline ESIM+ELMo sys-

tem by +27.1%.

5

Ablation Studies

2

018), by +0.2 on CoNLL-2003 NER Test.

Although we have demonstrated extremely strong

empirical results, the results presented so far have

not isolated the speciﬁc contributions from each

aspect of the BERT framework. In this section,

we perform ablation experiments over a number of

facets of BERT in order to better understand their

relative importance.

4

.4 SWAG

The Situations With Adversarial Generations

SWAG) dataset contains 113k sentence-pair com-

(

pletion examples that evaluate grounded common-

sense inference (Zellers et al., 2018).

Given a sentence from a video captioning

dataset, the task is to decide among four choices

the most plausible continuation. For example:

5.1 Effect of Pre-training Tasks

One of our core claims is that the deep bidirec-

tionality of BERT, which is enabled by masked

LM pre-training, is the single most important im-

provement of BERT compared to previous work.

To give evidence for this claim, we evaluate two

new models which use the exact same pre-training

data, ﬁne-tuning scheme and Transformer hyper-

A girl is going across a set of monkey bars. She

(

i) jumps up across the monkey bars.

ii) struggles onto the bars to grab her head.

iii) gets to the end and stands on a wooden plank.

iv) jumps up and does a back flip.

Adapting BERT to the SWAG dataset is similar

parameters as BERT_B_A_S_E

:

to the adaptation for GLUE. For each example, we

construct four input sequences, which each con-

tain the concatenation of the the given sentence

1

. No NSP: A model which is trained using the

masked LM” (MLM) but without the “next

sentence prediction” (NSP) task.

“

(sentence A) and a possible continuation (sentence

B). The only task-speciﬁc parameters we introduce

H

is a vector V ∈ R , whose dot product with the

2. LTR & No NSP: A model which is trained

using a Left-to-Right (LTR) LM, rather than

H

ﬁnal aggregate representation C ∈ R denotes a

i

an MLM. In this case, we predict every in-

put word and do not apply any masking. The

left-only constraint was also applied at ﬁne-

tuning, because we found it is always worse

to pre-train with left-only-context and ﬁne-

tune with bidirectional context. Additionally,

this model was pre-trained without the NSP

task. This is directly comparable to OpenAI

GPT, but using our larger training dataset,

our input representation, and our ﬁne-tuning

scheme.

pre-trained bidirectional models. It also hurts per-

formance on all four GLUE tasks.

We recognize that it would also be possible to

train separate LTR and RTL models and represent

each token as the concatenation of the two mod-

els, as ELMo does. However: (a) this is twice as

expensive as a single bidirectional model; (b) this

is non-intuitive for tasks like QA, since the RTL

model would not be able to condition the answer

on the question; (c) this it is strictly less powerful

than a deep bidirectional model, since a deep bidi-

rectional model could choose to use either left or

right context.

Results are presented in Table 5. We ﬁrst ex-

amine the impact brought by the NSP task. We

can see that removing NSP hurts performance sig-

niﬁcantly on QNLI, MNLI, and SQuAD. These

results demonstrate that our pre-training method

is critical in obtaining the strong empirical results

presented previously.

5

.2 Effect of Model Size

In this section, we explore the effect of model size

on ﬁne-tuning task accuracy. We trained a number

of BERT models with a differing number of layers,

hidden units, and attention heads, while otherwise

using the same hyperparameters and training pro-

cedure as described previously.

Next, we evaluate the impact of training bidi-

rectional representations by comparing “No NSP”

to “LTR & No NSP”. The LTR model performs

worse than the MLM model on all tasks, with ex-

tremely large drops on MRPC and SQuAD. For

SQuAD it is intuitively clear that an LTR model

will perform very poorly at span and token predic-

tion, since the token-level hidden states have no

right-side context. For MRPC is unclear whether

the poor performance is due to the small data size

or the nature of the task, but we found this poor

performance to be consistent across a full hyper-

parameter sweep with many random restarts.

In order make a good faith attempt at strength-

ening the LTR system, we tried adding a ran-

domly initialized BiLSTM on top of it for ﬁne-

tuning. This does signiﬁcantly improve results on

SQuAD, but the results are still far worse than the

Results on selected GLUE tasks are shown in

Table 6. In this table, we report the average Dev

Set accuracy from 5 random restarts of ﬁne-tuning.

We can see that larger models lead to a strict ac-

curacy improvement across all four datasets, even

for MRPC which only has 3,600 labeled train-

ing examples, and is substantially different from

the pre-training tasks. It is also perhaps surpris-

ing that we are able to achieve such signiﬁcant

improvements on top of models which are al-

ready quite large relative to the existing literature.

For example, the largest Transformer explored in

Vaswani et al. (2017) is (L=6, H=1024, A=16)

with 100M parameters for the encoder, and the

largest Transformer we have found in the literature

is (L=64, H=512, A=2) with 235M parameters

(Al-Rfou et al., 2018). By contrast, BERT_B_A_S_E

Dev Set

MNLI-m QNLI MRPC SST-2 SQuAD

(Acc) (Acc) (Acc) (Acc) (F1)

Tasks

Hyperparams

Dev Set Accuracy

#

L

#H #A LM (ppl) MNLI-m MRPC SST-2

BERTBASE

No NSP

84.4

88.4

84.9

84.3

84.1

86.7

86.5

77.5

75.7

92.7

92.6

92.1

91.6

88.5

87.9

77.8

84.9

83.9

82.1

3

6

2

2 1024 16

4 1024 16

768 12

768

5.84

5.24

4.68

3.99

3.54

3.23

77.9

80.6

81.9

84.4

85.7

86.6

79.8

82.2

84.8

86.7

86.9

87.8

88.4

90.7

91.3

92.9

93.3

93.7

LTR & No NSP

3

+

BiLSTM

768 12

1

2

Table 5: Ablation over the pre-training tasks using the

BERTBASE architecture. “No NSP” is trained without

the next sentence prediction task. “LTR & No NSP” is

trained as a left-to-right LM without the next sentence

prediction, like OpenAI GPT. “+ BiLSTM” adds a ran-

domly initialized BiLSTM on top of the “LTR + No

NSP” model during ﬁne-tuning.

Table 6: Ablation over BERT model size. #L = the

number of layers; #H = hidden size; #A = number of at-

tention heads. “LM (ppl)” is the masked LM perplexity

of held-out training data.

contains 110M parameters and BERT_L_A_R_G_E con-

tains 340M parameters.

5.4 Feature-based Approach with BERT

It has been known for many years that increas-

ing the model size will lead to continual improve-

ments on large-scale tasks such as machine trans-

lation and language modeling, which is demon-

strated by the LM perplexity of held-out training

data shown in Table 6. However, we believe that

this is the ﬁrst work to demonstrate that scaling to

extreme model sizes also leads to large improve-

ments on very small scale tasks, provided that the

model has been sufﬁciently pre-trained.

All of the BERT results presented so far have used

the ﬁne-tuning approach, where a simple classiﬁ-

cation layer is added to the pre-trained model, and

all parameters are jointly ﬁne-tuned on a down-

stream task. However, the feature-based approach,

where ﬁxed features are extracted from the pre-

trained model, has certain advantages. First, not

all NLP tasks can be easily be represented by a

Transformer encoder architecture, and therefore

require a task-speciﬁc model architecture to be

added. Second, there are major computational

beneﬁts to being able to pre-compute an expensive

representation of the training data once and then

run many experiments with less expensive models

on top of this representation.

5

.3 Effect of Number of Training Steps

Figure 4 presents MNLI Dev accuracy after ﬁne-

tuning from a checkpoint that has been pre-trained

for k steps. This allows us to answer the following

questions:

In this section we evaluate how well BERT per-

forms in the feature-based approach by generating

ELMo-like pre-trained contextual representations

on the CoNLL-2003 NER task. To do this, we use

the same input representation as in Section 4.3, but

use the activations from one or more layers with-

out ﬁne-tuning any parameters of BERT. These

contextual embeddings are used as input to a ran-

domly initialized two-layer 768-dimensional BiL-

STM before the classiﬁcation layer.

1

. Question: Does BERT really need such

a large amount of pre-training (128,000

words/batch * 1,000,000 steps) to achieve

high ﬁne-tuning accuracy?

Answer: Yes, BERT_B_A_S_E achieves almost

1

.0% additional accuracy on MNLI when

trained on 1M steps compared to 500k steps.

2

. Question: Does MLM pre-training converge

slower than LTR pre-training, since only 15%

of words are predicted in each batch rather

than every word?

Results are shown in Table 7. The best perform-

ing method is to concatenate the token representa-

tions from the top four hidden layers of the pre-

trained Transformer, which is only 0.3 F1 behind

ﬁne-tuning the entire model. This demonstrates

that BERT is effective for both the ﬁne-tuning and

feature-based approaches.

Answer: The MLM model does converge

slightly slower than the LTR model. How-

ever, in terms of absolute accuracy the MLM

model begins to outperform the LTR model

almost immediately.

8

7

4

2

0

8

6

Layers

Dev F1

96.4

Finetune All

First Layer (Embeddings) 91.0

Second-to-Last Hidden

Last Hidden

Sum Last Four Hidden

Concat Last Four Hidden

Sum All 12 Layers

95.6

94.9

95.9

96.1

95.5

BERTBASE (Masked LM)

BERTBASE (Left-to-Right)

2

00

400

600

800

1,000

Pre-training Steps (Thousands)

Figure 4: Ablation over number of training steps. This

shows the MNLI accuracy after ﬁne-tuning, starting

from model parameters that have been pre-trained for

k steps. The x-axis is the value of k.

Table 7: Ablation using BERT with a feature-based ap-

proach on CoNLL-2003 NER. The activations from the

speciﬁed layers are combined and fed into a two-layer

BiLSTM, without backpropagation to BERT.

6

Conclusion

ing progress in statistical language modeling. arXiv

preprint arXiv:1312.3005.

Recent empirical improvements due to transfer

learning with language models have demonstrated

that rich, unsupervised pre-training is an integral

part of many language understanding systems. In

particular, these results enable even low-resource

tasks to beneﬁt from very deep unidirectional ar-

chitectures. Our major contribution is further gen-

eralizing these ﬁndings to deep bidirectional ar-

chitectures, allowing the same pre-trained model

to successfully tackle a broad set of NLP tasks.

While the empirical results are strong, in some

cases surpassing human performance, important

future work is to investigate the linguistic phenom-

ena that may or may not be captured by BERT.

Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.

Quora question pairs.

Kevin Clark, Minh-Thang Luong, Christopher D Man-

ning, and Quoc V Le. 2018. Semi-supervised se-

quence modeling with cross-view training. arXiv

preprint arXiv:1809.08370.

Ronan Collobert and Jason Weston. 2008. A uniﬁed

architecture for natural language processing: Deep

neural networks with multitask learning. In Pro-

ceedings of the 25th International Conference on

Machine Learning, ICML ’08.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc

Barrault, and Antoine Bordes. 2017. Supervised

learning of universal sentence representations from

natural language inference data. In Proceedings of

the 2017 Conference on Empirical Methods in Nat-

ural Language Processing, pages 670–680, Copen-

hagen, Denmark. Association for Computational

Linguistics.

References

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy

Guo, and Llion Jones. 2018. Character-level lan-

guage modeling with deeper self-attention. arXiv

preprint arXiv:1808.04444.

Andrew M Dai and Quoc V Le. 2015. Semi-supervised

sequence learning. In Advances in neural informa-

tion processing systems, pages 3079–3087.

Rie Kubota Ando and Tong Zhang. 2005. A framework

for learning predictive structures from multiple tasks

and unlabeled data. Journal of Machine Learning

Research, 6(Nov):1817–1853.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-

Fei. 2009. ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR09.

William B Dolan and Chris Brockett. 2005. Automati-

cally constructing a corpus of sentential paraphrases.

In Proceedings of the Third International Workshop

on Paraphrasing (IWP2005).

Luisa Bentivogli, Bernardo Magnini, Ido Dagan,

Hoa Trang Dang, and Danilo Giampiccolo. 2009.

The ﬁfth PASCAL recognizing textual entailment

challenge. In TAC. NIST.

Dan Hendrycks and Kevin Gimpel. 2016. Bridging

nonlinearities and stochastic regularizers with gaus-

sian error linear units. CoRR, abs/1606.08415.

John Blitzer, Ryan McDonald, and Fernando Pereira.

2

006. Domain adaptation with structural correspon-

dence learning. In Proceedings of the 2006 confer-

ence on empirical methods in natural language pro-

cessing, pages 120–128. Association for Computa-

tional Linguistics.

Jeremy Howard and Sebastian Ruder. 2018. Universal

language model ﬁne-tuning for text classiﬁcation. In

ACL. Association for Computational Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,

and Christopher D. Manning. 2015. A large anno-

tated corpus for learning natural language inference.

In EMNLP. Association for Computational Linguis-

tics.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke

Zettlemoyer. 2017. Triviaqa: A large scale distantly

supervised challenge dataset for reading comprehen-

sion. In ACL.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,

Richard Zemel, Raquel Urtasun, Antonio Torralba,

and Sanja Fidler. 2015. Skip-thought vectors. In

Advances in neural information processing systems,

pages 3294–3302.

Peter F Brown, Peter V Desouza, Robert L Mercer,

Vincent J Della Pietra, and Jenifer C Lai. 1992.

Class-based n-gram models of natural language.

Computational linguistics, 18(4):467–479.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-

Gazpio, and Lucia Specia. 2017. Semeval-2017

task 1: Semantic textual similarity-multilingual and

cross-lingual focused evaluation. arXiv preprint

arXiv:1708.00055.

Quoc Le and Tomas Mikolov. 2014. Distributed rep-

resentations of sentences and documents. In Inter-

national Conference on Machine Learning, pages

1188–1196.

Hector J Levesque, Ernest Davis, and Leora Morgen-

stern. 2011. The winograd schema challenge. In

Aaai spring symposium: Logical formalizations of

commonsense reasoning, volume 46, page 47.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,

Thorsten Brants, Phillipp Koehn, and Tony Robin-

son. 2013. One billion word benchmark for measur-

Lajanugen Logeswaran and Honglak Lee. 2018. An

efﬁcient framework for learning sentence represen-

tations. In International Conference on Learning

Representations.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. In Advances in Neural Information Pro-

cessing Systems, pages 6000–6010.

Bryan McCann, James Bradbury, Caiming Xiong, and

Richard Socher. 2017. Learned in translation: Con-

textualized word vectors. In NIPS.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and

Pierre-Antoine Manzagol. 2008. Extracting and

composing robust features with denoising autoen-

coders. In Proceedings of the 25th international

conference on Machine learning, pages 1096–1103.

ACM.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-

rado, and Jeff Dean. 2013. Distributed representa-

tions of words and phrases and their compositional-

ity. In Advances in Neural Information Processing

Systems 26, pages 3111–3119. Curran Associates,

Inc.

Alex Wang, Amapreet Singh, Julian Michael, Felix

Hill, Omer Levy, and Samuel R Bowman. 2018.

Glue: A multi-task benchmark and analysis platform

for natural language understanding. arXiv preprint

arXiv:1804.07461.

Jeffrey Pennington, Richard Socher, and Christo-

pher D. Manning. 2014. Glove: Global vectors for

word representation. In Empirical Methods in Nat-

ural Language Processing (EMNLP), pages 1532–

A. Warstadt, A. Singh, and S. R. Bowman. 2018. Cor-

pus of linguistic acceptability.

1

543.

Matthew Peters, Waleed Ammar, Chandra Bhagavat-

ula, and Russell Power. 2017. Semi-supervised se-

quence tagging with bidirectional language models.

In ACL.

Adina Williams, Nikita Nangia, and Samuel R Bow-

man. 2018. A broad-coverage challenge corpus

for sentence understanding through inference. In

NAACL.

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt

Gardner, Christopher Clark, Kenton Lee, and Luke

Zettlemoyer. 2018. Deep contextualized word rep-

resentations. In NAACL.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V

Le, Mohammad Norouzi, Wolfgang Macherey,

Maxim Krikun, Yuan Cao, Qin Gao, Klaus

Macherey, et al. 2016.

Google’s neural ma-

chine translation system: Bridging the gap between

human and machine translation. arXiv preprint

arXiv:1609.08144.

Alec Radford, Karthik Narasimhan, Tim Salimans, and

Ilya Sutskever. 2018. Improving language under-

standing with unsupervised learning. Technical re-

port, OpenAI.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod

Lipson. 2014. How transferable are features in deep

neural networks? In Advances in neural information

processing systems, pages 3320–3328.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and

Percy Liang. 2016. Squad: 100,000+ questions

for machine comprehension of text. arXiv preprint

arXiv:1606.05250.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin

Choi. 2018. Swag: A large-scale adversarial dataset

for grounded commonsense inference. In Proceed-

ings of the 2018 Conference on Empirical Methods

in Natural Language Processing (EMNLP).

Richard Socher, Alex Perelygin, Jean Wu, Jason

Chuang, Christopher D Manning, Andrew Ng, and

Christopher Potts. 2013. Recursive deep models

for semantic compositionality over a sentiment tree-

bank. In Proceedings of the 2013 conference on

empirical methods in natural language processing,

pages 1631–1642.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-

dinov, Raquel Urtasun, Antonio Torralba, and Sanja

Fidler. 2015. Aligning books and movies: Towards

story-like visual explanations by watching movies

and reading books. In Proceedings of the IEEE

international conference on computer vision, pages

Wilson L Taylor. 1953. cloze procedure: A new

tool for measuring readability. Journalism Bulletin,

3

0(4):415–433.

19–27.

Erik F Tjong Kim Sang and Fien De Meulder.

2003. Introduction to the conll-2003 shared task:

Language-independent named entity recognition. In

Proceedings of the seventh conference on Natural

language learning at HLT-NAACL 2003-Volume 4,

pages 142–147. Association for Computational Lin-

guistics.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.

Word representations: A simple and general method

for semi-supervised learning. In Proceedings of the

4

8th Annual Meeting of the Association for Compu-

tational Linguistics, ACL ’10, pages 384–394.