model” (MLM), inspired by the Cloze task (Tay-
lor, 1953). The masked language model randomly
masks some of the tokens from the input, and the
objective is to predict the original vocabulary id of
the masked word based only on its context. Un-
like left-to-right language model pre-training, the
MLM objective allows the representation to fuse
the left and the right context, which allows us
to pre-train a deep bidirectional Transformer. In
addition to the masked language model, we also
introduce a “next sentence prediction” task that
jointly pre-trains text-pair representations.
Ando and Zhang, 2005; Blitzer et al., 2006) and
neural (Collobert and Weston, 2008; Mikolov
et al., 2013; Pennington et al., 2014) methods. Pre-
trained word embeddings are considered to be an
integral part of modern NLP systems, offering sig-
nificant improvements over embeddings learned
from scratch (Turian et al., 2010).
These approaches have been generalized to
coarser granularities, such as sentence embed-
dings (Kiros et al., 2015; Logeswaran and Lee,
2018) or paragraph embeddings (Le and Mikolov,
2014). As with traditional word embeddings,
these learned representations are also typically
used as features in a downstream model.
ELMo (Peters et al., 2017) generalizes tradi-
tional word embedding research along a differ-
ent dimension. They propose to extract context-
sensitive features from a language model. When
integrating contextual word embeddings with ex-
isting task-specific architectures, ELMo advances
the state-of-the-art for several major NLP bench-
marks (Peters et al., 2018) including question an-
swering (Rajpurkar et al., 2016) on SQuAD, sen-
timent analysis (Socher et al., 2013), and named
entity recognition (Tjong Kim Sang and De Meul-
der, 2003).
The contributions of our paper are as follows:
•
We demonstrate the importance of bidirec-
tional pre-training for language representa-
tions. Unlike Radford et al. (2018), which
uses unidirectional language models for pre-
training, BERT uses masked language mod-
els to enable pre-trained deep bidirectional
representations. This is also in contrast to
Peters et al. (2018), which uses a shallow
concatenation of independently trained left-
to-right and right-to-left LMs.
•
•
We show that pre-trained representations
eliminate the needs of many heavily-
engineered task-specific architectures. BERT
is the first fine-tuning based representation
model that achieves state-of-the-art perfor-
mance on a large suite of sentence-level and
token-level tasks, outperforming many sys-
tems with task-specific architectures.
2
.2 Fine-tuning Approaches
A recent trend in transfer learning from language
models (LMs) is to pre-train some model ar-
chitecture on a LM objective before fine-tuning
that same model for a supervised downstream
task (Dai and Le, 2015; Howard and Ruder, 2018;
Radford et al., 2018). The advantage of these ap-
proaches is that few parameters need to be learned
from scratch. At least partly due this advantage,
OpenAI GPT (Radford et al., 2018) achieved pre-
viously state-of-the-art results on many sentence-
level tasks from the GLUE benchmark (Wang
et al., 2018).
BERT advances the state-of-the-art for eleven
NLP tasks. We also report extensive abla-
tions of BERT, demonstrating that the bidi-
rectional nature of our model is the single
most important new contribution. The code
and pre-trained model will be available at
2
.3 Transfer Learning from Supervised Data
2
Related Work
While the advantage of unsupervised pre-training
is that there is a nearly unlimited amount of data
available, there has also been work showing ef-
fective transfer from supervised tasks with large
datasets, such as natural language inference (Con-
neau et al., 2017) and machine translation (Mc-
Cann et al., 2017). Outside of NLP, computer
vision research has also demonstrated the impor-
tance of transfer learning from large pre-trained
models, where an effective recipe is to fine-tune
There is a long history of pre-training general lan-
guage representations, and we briefly review the
most popular approaches in this section.
2
.1 Feature-based Approaches
Learning widely applicable representations of
words has been an active area of research for
decades, including non-neural (Brown et al., 1992;
1Will be released before the end of October 2018.