4In the actual implementation, we use the same subword tokenization as Vaswani et al. (2018). We run it for 4 iterations and retain only subwords occurring at least 250 times, containing no more than 20 UTF8 characters, also disallowing more than 4 consecutive digits.