9 Conclusion

This tutorial has covered the basics of neural machine translation and sequence-to-sequence models. It gradually stepped through models of increasing sophistication, starting with n-gram language models, and culminating in attention, which now represents the state-of-the-art in many sequence-to-sequence modeling tasks.

It should be noted that this is a very active reserach field, and there are a number of advanced research topics that are beyond the scope of this tutorial, but may be of interest to readers who have mastered the basics and would like to learn more.

Handling large vocabularies:

One difficulty of neural MT models is that they perform badly when using large vocabularies; it is hard to learn how to properly translate rare words with limited data, and computation becomes a burden. One method to handle this is to break words into smaller units such as characters or subwords . It is also possible to incorporate translation dictionaries with broad coverage to handle low-frequency phenomena .

Optimizing translation performance:

While the models presented in this tutorial are trained to maximize the likelihood of the target sentence given the source P(E ∣ F), in reality what we actually care about is the accuracy of the generated sentences. There have been a number of works proposed to resolve this disconnect by directly considering the accuracy of the generated results when training our models. These include methods that sample translation results from the current model and move towards parameters that result in good translations , methods that optimize parameters towards partially mistaken hypotheses to try to improve robustness to mistakes in generation , or methods that try to prevent mistakes that may occur during the search process .

Multi-lingual learning:

Up until now we assumed that we were training a model between two languages F and E. However, in reality there are many languages in the world, and some work has shown that we can benefit by using data from all these languages to learn models together . It is also possible to perform transfer across languages, training a model first on one language pair, then fine-tuning it to others .

Other applications:

Similar sequence-to-sequence models have been used for a wide variety of tasks, from dialog systems to text summarization , speech recognition , speech synthesis , image captioning , image generation , and more.

This is just a small sampling of topics from this exciting and rapidly expanding field, and hopefully this tutorial gave readers the tools to strike out on their own and apply these models to their applications of interest.

Acknowledgements

I am extremely grateful to Qinlan Shen and Dongyeop Kang for their careful reading of these materials and useful comments about unclear parts. I also thank the students in the Machine Translation and Sequence-to-sequence Models class at CMU for pointing out various bugs in the materials when a preliminary version was used in the class.