This tutorial introduces a new and powerful set of techniques variously called “neural machine translation” or “neural sequence-to-sequence models”. These techniques have been used in a number of tasks regarding the handling of human language, and can be a powerful tool in the toolbox of anyone who wants to model sequential data of some sort. The tutorial assumes that the reader knows the basics of math and programming, but does not assume any particular experience with neural networks or natural language processing. It attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.
Before getting into the details, it might be worth describing each of the terms that appear in the title “Neural Machine Translation and Sequence-to-sequence Models”. is the technology used to translate between human language. Think of the universal translation device showing up in sci-fi movies to allow you to communicate effortlessly with those that speak a different language, or any of the plethora of online translation web sites that you can use to assimilate content that is not in your native language. This ability to remove language barriers, needless to say, has the potential to be very useful, and thus machine translation technology has been researched from shortly after the advent of digital computing. Because of this, machine translation has been a target technology for the use of computers since the advent of the computing era, and describes a bit more about this long history.
We call the language input to the machine translation system the , and call the output language the . Thus, machine translation can be described as the task of converting a sequence of words in the source, and converting into a sequence of words in the target. The goal of the machine translation practitioner is to come up with an effective model that allows us to perform this conversion accurately over a broad variety of languages and content.
The second part of the title, , refers to the broader class of models that include all models that map one sequence to another. This, of course, includes machine translation, but it also covers a broad spectrum of other methods used to handle other tasks as shown in . In fact, if we think of a computer program as something that takes in a sequence of input bits, then outputs a sequence of output bits, we could say that every single program is a sequence-to-sequence model expressing some behavior (although of course in many cases this is not the most natural or intuitive way to express things).
The motivation for using machine translation as a representative of this larger class of sequence-to-sequence models is many-fold:
Machine translation is a widely-recognized and useful instance of sequence-to-sequence models, and allows us to use many intuitive examples demonstrating the difficulties encountered when trying to tackle these problems.
Machine translation is often one of the main driving tasks behind the development of new models, and thus these models tend to be tailored to MT first, then applied to other tasks.
However, there are also cases where MT has learned from other tasks as well, and introducing these tasks helps explain the techniques used in MT as well.
This tutorial first starts out with a general mathematical definition of statistical techniques for machine translation in . The rest of this tutorial will sequentially describe techniques of increasing complexity, leading up to attentional models, which represent the current state-of-the-art in the field.
First, Sections [sec:ngramlm]-[sec:rnnlm] focus on , which calculate the probability of a target sequence of interest. These models are not capable of performing translation or sequence transduction, but will provide useful preliminaries to understand sequence-to-sequence models.
describes , simple models that calculate the probability of words based on their counts in a set of data. It also describes how we evaluate how well these models are doing using measures such as .
describes , models that instead calculate the probability of the next word based on features of the context. It describes how we can learn the parameters of the models through – calculating derivatives and gradually updating the parameters to increase the likelihood of the observed data.
introduces the concept of , which allow us to combine together multiple pieces of information more easily than log-linear models, resulting in increased modeling accuracy. It gives an example of , which calculate the probability of the next word based on a few previous words using neural networks.
introduces , a variety of neural networks that have mechanisms to allow them to remember information over multiple time steps. These lead to , which allow for the handling of long-term dependencies that are useful when modeling language or other sequential data.
Finally, Sections [sec:encdec] and [sec:attention] describe actual sequence-to-sequence models capable of performing machine translation or other tasks.
describes models, which use a recurrent neural network to encode the target sequence into a vector of numbers, and another network to decode this vector of numbers into an output sentence. It also describes to generate output sequences based on this model.
describes , a method that allows the model to focus on different parts of the input sentence while generating translations. This allows for a more efficient and intuitive method of representing sentences, and is often more effective than its simpler encoder-decoder counterpart.