8Large batch training can improve training efficiency even without large scale parallel hardware through gradient accumulation, whereby gradients from multiple mini-batches are accumulated locally before each optimization step. This functionality is supported natively in fairseq (Ott et al.2019).