1 Introduction

This tutorial introduces a new and powerful set of techniques variously called “neural machine translation” or “neural sequence-to-sequence models”. These techniques have been used in a number of tasks regarding the handling of human language, and can be a powerful tool in the toolbox of anyone who wants to model sequential data of some sort. The tutorial assumes that the reader knows the basics of math and programming, but does not assume any particular experience with neural networks or natural language processing. It attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.

Background

Before getting into the details, it might be worth describing each of the terms that appear in the title “Neural Machine Translation and Sequence-to-sequence Models”. is the technology used to translate between human language. Think of the universal translation device showing up in sci-fi movies to allow you to communicate effortlessly with those that speak a different language, or any of the plethora of online translation web sites that you can use to assimilate content that is not in your native language. This ability to remove language barriers, needless to say, has the potential to be very useful, and thus machine translation technology has been researched from shortly after the advent of digital computing. Because of this, machine translation has been a target technology for the use of computers since the advent of the computing era, and describes a bit more about this long history.

We call the language input to the machine translation system the , and call the output language the . Thus, machine translation can be described as the task of converting a sequence of words in the source, and converting into a sequence of words in the target. The goal of the machine translation practitioner is to come up with an effective model that allows us to perform this conversion accurately over a broad variety of languages and content.

An example of sequence-to-sequence modeling tasks.

The second part of the title, , refers to the broader class of models that include all models that map one sequence to another. This, of course, includes machine translation, but it also covers a broad spectrum of other methods used to handle other tasks as shown in . In fact, if we think of a computer program as something that takes in a sequence of input bits, then outputs a sequence of output bits, we could say that every single program is a sequence-to-sequence model expressing some behavior (although of course in many cases this is not the most natural or intuitive way to express things).

The motivation for using machine translation as a representative of this larger class of sequence-to-sequence models is many-fold:

  1. Machine translation is a widely-recognized and useful instance of sequence-to-sequence models, and allows us to use many intuitive examples demonstrating the difficulties encountered when trying to tackle these problems.

  2. Machine translation is often one of the main driving tasks behind the development of new models, and thus these models tend to be tailored to MT first, then applied to other tasks.

  3. However, there are also cases where MT has learned from other tasks as well, and introducing these tasks helps explain the techniques used in MT as well.

Structure of this Tutorial

This tutorial first starts out with a general mathematical definition of statistical techniques for machine translation in . The rest of this tutorial will sequentially describe techniques of increasing complexity, leading up to attentional models, which represent the current state-of-the-art in the field.

First, Sections [sec:ngramlm]-[sec:rnnlm] focus on , which calculate the probability of a target sequence of interest. These models are not capable of performing translation or sequence transduction, but will provide useful preliminaries to understand sequence-to-sequence models.

Finally, Sections [sec:encdec] and [sec:attention] describe actual sequence-to-sequence models capable of performing machine translation or other tasks.