LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Yinquan Lu¹, Wenhao Zhu^1,2, Lei Li³, Yu Qiao¹, Fei Yuan¹
¹ Shanghai AI Laboratory, ² Nanjing University, ³ Carnegie Mellon University
{luyinquan,yuanfei}@pjlab.org.cn, zhuwh@smail.nju.edu.cn, leili@cs.cmu.edu

Abstract

Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs (by more than 10 spBLEU points) and performs on-par with specialized translation model (M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code ¹¹1https://github.com/CONE-MT/LLaMAX/. and models ²²2https://huggingface.co/LLaMAX/. are publicly available.

1 Introduction

Refer to caption — Figure 1: We assess translations in both directions, X $\rightarrow$ LG and LG $\rightarrow$ X, across various models using Flores-101 test, with X representing all 101 languages included in Flores-101. The results are visualized in a figure where different markers represent various models, a red marker indicates that the language (LG) is Arabic, while a blue marker indicates English. We count the number of translation directions that achieve a spBLEU score higher than 10. The findings indicate that modest LLMs demonstrate strong support for English-centric translation, but underperform in Arabic-centric translation.

Large Language Models (LLMs; Brown et al., 2020; Zhang et al., 2022; Chowdhery et al., 2022; OpenAI, 2023; Touvron et al., 2023a, b) exhibit excellence in translation tasks involving high-resource languages (Vilar et al., 2023; Zhu et al., 2024b), yet their effectiveness in low-resource translation is suboptimal (Hendy et al., 2023; Bang et al., 2023; Zhu et al., 2024b). Figure 1 illustrates the number of translation directions with performance exceeding 10 spBLEU (Goyal et al., 2022) score on Flores-101 (Goyal et al., 2022). It is evident the majority of models are clustered around the origin point for Arabic-centric translations, demonstrating a significant disparity when compared to their English-centric performance.

This discrepancy is primarily due to the lack of pre-training data for these languages (Wei et al., 2023; Yuan et al., 2023b; Alves et al., 2024). Many researchers are actively working to address this issue. Guo et al. (2024) enhance the LLMs’ ability by translating low-resource languages after learning textbooks. Zhu et al. (2024b) find cross-lingual examples that can provide better task guidance for low-resource translation. In addition to the efforts focus on the fine-tuning stage, some studies have attempted to train a multilingual LLM from scratch (Wei et al., 2023), or to train a language-specific LLM (Faysse et al., 2024; Alves et al., 2024; Cui et al., 2024). However, the languages covered by these works are not extensive (Wei et al., 2023; Alves et al., 2024; Luo et al., 2023), and the translation performance is still unsatisfactory (Wei et al., 2023; Alves et al., 2024; Luo et al., 2023).

Input:

A

: all language list.

\mathcal{D}_{\mathrm{mono}}^{A}

: monolingual data for all languages.

\mathcal{D}_{\mathrm{En}}

: an English monolingual data.

\mathcal{D}_{\mathrm{para}}^{A}

: a parallel data for all translation directions. Notably,

\mathcal{D}_{\mathrm{mono}}^{A}\bigcap\mathcal{D}_{\mathrm{En}}=\varnothing

\boldsymbol{x}

: a single data point.

g(\boldsymbol{x};\boldsymbol{\varphi)}

: A translation model with parameter

\boldsymbol{\varphi}

f(\boldsymbol{x};\boldsymbol{\theta})

: a large language model with parameter

\boldsymbol{\theta}

Output:

\mathcal{D}_{\mathrm{train}}

: a training dataset for current training epoch.

\mathcal{D}_{\mathrm{train}}=\{\}

for $s\in A$ do

\mathcal{D}_{\mathrm{mono}}^{s}\subset\mathcal{D}_{\mathrm{mono}}^{A}

// Extract a

s

-specific monolingual subset

for $t\in A$ do

\mathcal{D}_{\mathrm{para}}\leftarrow\mathcal{D}_{\mathrm{para}}^{s\rightarrow t% }\cup\mathcal{D}_{\mathrm{para}}^{t\rightarrow s}

\mathcal{D}_{\mathrm{para}}^{s}\subset\mathcal{D}_{\mathrm{para}}

// Extract the

s

-centric parallel subset

if $|\mathcal{D}_{\mathrm{para}}^{s}|<25,000$ then

// The quantity of 25,000 determined by the machine’s memory capacity

\mathcal{D}_{\mathrm{En}}^{s}\subset\mathcal{D}_{\mathrm{En}}

, s.t.

|\mathcal{D}_{\mathrm{En}}^{s}|=25,000-|\mathcal{D}_{\mathrm{para}}^{s}|

// Extract an English subset for

s

language

\mathcal{D}_{\mathrm{En}}^{s\rightarrow t}\leftarrow g(\boldsymbol{x};% \boldsymbol{\varphi})

\mathcal{D}_{\mathrm{En}}^{t\rightarrow s}\leftarrow g(\boldsymbol{x};% \boldsymbol{\varphi})

, where

\boldsymbol{x}\in\mathcal{D}_{\mathrm{En}}^{s}

\mathcal{D}_{\mathrm{aug}}^{s\rightarrow t},\mathcal{D}_{\mathrm{aug}}^{t% \rightarrow s}

// using dictionary to augment

\mathcal{D}_{\mathrm{aug}}^{s}\leftarrow\mathcal{D}_{\mathrm{aug}}^{s% \rightarrow t}\cup\mathcal{D}_{\mathrm{aug}}^{t\rightarrow s}

end for

\mathcal{D}_{\mathrm{train}}\leftarrow\mathcal{D}_{\mathrm{train}}\cup\mathcal% {D}_{\mathrm{mono}}^{s}\cup\mathcal{D}_{\mathrm{para}}^{s}\cup\mathcal{D}_{% \mathrm{aug}}^{s}

end for

Algorithm 1 Illustration of the Training Data Construction Process During a Single Training Epoch

To tackle this discrepancy, we conduct a massive multilingual continual pre-training for non-English languages. Firstly, we present a comprehensive analysis of critical technical designs, including vocabulary extension (Section 3.1) and data augmentation (Section 3.2). These analyses establish the groundwork for the training procedure, directly influencing the efficacy and, ultimately, the performance of the LLMs. Subsequently, we apply those strategies in continual pre-training using both parallel and monolingual data to enhance the translation performance of LLMs across the 102 languages covered by Flores-101, particularly for low-resource languages.

A primary challenge in expanding language support lies in determining the appropriate vocabulary (Cui et al., 2024; Fujii et al., 2024). To face this, we conduct a quantitative analysis, assessing the impact of adding language-specific tokens from various angles: tokenization granularity, embedding quality, and the model’s inner distribution. Introducing a small number of new tokens significantly degrades existing LLM performance, while a larger new token set increases training complexity and data requirements. Surprisingly, adhering to the original vocabulary of LLMs emerges as the most cost-effective strategy for extending LLMs to 102 languages.

Another great challenge in extending language support is the scarcity of data for low-resource languages (Chang et al., 2023; Guo et al., 2024). To alleviate the scarcity of training data, we delve into dictionary-based data augmentation (Pan et al., 2021) and conduct a comprehensive analysis of various augmentation strategies. This analysis takes into consideration different dictionaries and data sources (monolingual or parallel data). We find that the optimal approach for data augmentation involves using parallel data, with the choice of dictionary correlated to the number of target language entities it covers.

Finally, we leverage the above discussed techniques to perform large-scale, multilingual continual pre-training on LLaMA series models (Touvron et al., 2023b; AI@Meta, 2024), resulting in LLaMAX series models (LLaMAX2 and LLaMAX3). The LLaMAX2, trained over 60 days using 24 A100 GPUs, significantly enhances translation capabilities and achieves comparable performance (evaluated on Flores-101) to the specialized translation model M2M-100-12B (Fan et al., 2021). Specifically, our method demonstrates an average improvement of more than 10 spBLEU compared to baseline models in low-resource-centric translation, as shown in Table 4. Furthermore, when extending our evaluation to Flores-200 (Team et al., 2022), it shows significant performance enhancements even for languages not included in the training set. All these translation performance improvements do not compromise general task performance. Interestingly, enhancing translation capabilities also establishes a robust multilingual base model foundation. When comparing results of supervised fine-tuning using task-specific English data on the X-CSQA (Lin et al., 2021a), XNLI (Conneau et al., 2018), and MGSM (Shi et al., 2023) tasks, we observe an average improvement of 5 points over LLaMA2. Our main contributions can be summarized as follows:

•

A series of open-sourced LLaMAX models enhance the translation performance across more than 100 languages.
•

Comprehensive analysis of the key techniques in multilingual continual pre-training to LLMs, including vocabulary extension and data augmentation.
•

Extensive experiments on key technique design, comprehensive translation benchmark evaluation across various models, general task testing, and supervised fine-tuning on task-specific data demonstrate the superiority of LLaMAX.

2 Training Data Construction

To build powerful LLMs that support translation across a hundred languages, it is crucial to collect and construct a sufficient amount of data.

2.1 Components of Training Data

During the continual pertaining stage, the collected training data covering 102 languages (refer to $A$ , which are all languages supported by Flores-101), mainly consists of two parts: monolingual ( $\mathcal{D}_{\mathrm{mono}}^{A}$ ) and parallel ( $\mathcal{D}_{\mathrm{para}}^{A}$ ) data. For languages with limited data availability, we generated a pseudo-parallel dataset ( $\mathcal{D}_{\mathrm{aug}}$ ) with multilingual dictionaries: MUSE (Lample et al., 2018) and PanLex (Wang et al., 2022). More details regarding the supported languages, dataset description, and data statistics can be found in Appendix B.

Monolingual Data ( $\mathcal{D}_{\mathrm{mono}}^{A}$ ).

Our monolingual training data includes 94 languages supported by Flores-101 from MC4 (Xue et al., 2021) and MADLAD (Kudugunta et al., 2024), totaling 40,000,000 sentences. To ensure efficient handling and processing of the data, we take a strategy where each piece of monolingual data is split into multiple entries, with a block size of 512.

Parallel Data ( $\mathcal{D}_{\mathrm{para}}^{A}$ ).

Our parallel data from Lego-MT Yuan et al. (2023a) encompasses 102 languages, forming a total $4,737$ language pairs and $9,474$ translation directions. For each translation direction, denoted as source language ( $s$ ) to target language ( $t$ ), we concatenate each translation set, merely using a space as a delimiter, to form a single entry for training data. For each language pair, the probability of occurrence for each translation direction, for example, $s\rightarrow t$ and $t\rightarrow s$ is set as 50%. During the training stage, the gradient is computed for the entire data entry, rather than solely for the target sentence. For language pairs that have fewer than 25,000 (bound by machine resources) sentence pairs, we replicate the original data thrice (Muennighoff et al., 2023).

Data Generated Through Augmentation ( $\mathcal{D}_{\mathrm{aug}}$ ).

The way to obtain code-switch data consists of two steps: 1) build multilingual lexicons; 2) construct pseudo-parallel data. We show the data augmentation process in Figure 2.

Step 1: Building multilingual lexicons.

The existing multilingual dictionaries, MUSE and PanLex, encompass multiple bilingual dictionaries, such as en-fr, en-de, en-zh bilingual dictionaries. A dictionary comprises numerous entries, each being a word or a term defined, usage, and provided with other relevant information. We iterate through each entry in the bilingual dictionary, reformat all entries, and create entries in the format of {entity}_{language}. For instance, the English word “hello” as translation in three bilingual dictionaries (en-fr, en-de, en-zh), leading us to construct a multilingual lexicons entry as hello_en, Bonjour_fr, Hallo_de, UTF8gbsn你好_zh.

# New Token	Romanian (ro)						Bengali (bn)
# New Token	fertility	cosine	R@1	shift distance	# shift token	spBLEU	fertility	cosine	R@1	shift distance	# shift token	spBLEU
0	2.25	0.39	0.37	0.4708	112	32.50	8.62	0.17	0.01	0.4689	112	20.12
100	2.19	0.36	0.34	0.4720	112	28.75	4.96	0.14	0.02	0.4680	113	14.02
800	2.02	0.35	0.36	0.4682	113	27.78	3.21	0.13	0.02	0.4706	113	10.18
1600	1.93	0.34	0.34	0.4690	113	26.40	2.78	0.13	0.02	0.4695	113	1.82
6,400	1.74	0.31	0.31	0.4694	113	22.66	2.15	0.12	0.02	0.4712	113	1.96
12,800	1.63	0.29	0.29	0.0205	1	21.95	1.95	0.12	0.02	-	0	1.84
25,600	1.53	0.27	0.28	-	0	19.72	1.80	0.12	0.02	-	0	2.58
51,200	1.45	0.26	0.25	0.0203	1	17.79	1.70	0.12	0.03	-	0	1.14

Table 1: Building upon LLaMA2, we add varying numbers of languages-specific new tokens, fully fine-tune LLaMA2, and test the translation performance of en

\rightarrow

ro (bn) using Flores-101 test. Furthermore, we assess the effect of new tokens using several metrics: fertility, the cosine similarity with English sentence embeddings, the performance in the English language retrieval translation task (R@1), and the distribution shift of the original embedding vector. Our experiments demonstrate that the inclusion of new words significantly complicates the learning process, underscoring that the integration of new words is a complex task.

Step 2: Constructing pseudo-parallel data.

The foundational data for construction can be based on either parallel or monolingual data, as shown in Figure 2. For each sentence, we convert it to lowercase and subsequently divide it into multiple words using spaces (for Chinese sentences, the Jieba tokenizer is utilized). In parallel data processing, words in a source sentence are randomly replaced with translation from a different language using the multilingual dictionary created in Step 1. During the training, the loss is computed solely on the target sentence. In monolingual data processing, each word is individually replaced with a randomly chosen word from the multilingual dictionary. If no suitable replacement word in another language is found, the original word remains unchanged. Consequently, the modified sentence and the original sentence can form pseudo-parallel data. During the training, the loss is computed solely on both the source and the target sentence.

We further conduct an experimental analysis in Section 3, and find the augmentation based on parallel data outperforms that on monolingual data. Therefore, the data augmentation is merely based on parallel data during continual pretraining.

2.2 Training Algorithm.

Given an LLM $f(\boldsymbol{x};\boldsymbol{\boldsymbol{\theta}})$ on a collected training data $\{\boldsymbol{x}^{\scriptscriptstyle(i)}\}_{i=1}^{n}$ , where $\boldsymbol{\theta}$ is the pre-trained parameters, our objective is to obtain an LLM through continual pre-training, denoted as $f(\boldsymbol{x};\boldsymbol{\theta}^{\prime})$ . Here, $\boldsymbol{\theta}^{\prime}$ indicates the updated parameters. The target of $f(\boldsymbol{x};\boldsymbol{\theta}^{\prime})$ is to preserve the general capabilities of the model in high-resource languages while simultaneously enhancing the translation performance across all translation directions among 102 languages. The process of constructing training data is outlined in Algorithm 1. We gather monolingual data for each of the languages and parallel data for every translation direction. In particular, there is no augmentation for translations involving high-resource languages. Instead, we solely augment the translation data that is insufficient by utilizing a trained translation model, Lego-MT model. Then we train the $f(\boldsymbol{x};\boldsymbol{\theta})$ , the loss function is calculated as:

\arg\max_{\boldsymbol{\theta}}~{}\sum_{i=1}^{n}\sum_{t=1}^{T_{i}}\log f(x^{% \scriptscriptstyle(i)}_{\scriptscriptstyle t}|\boldsymbol{x}^{% \scriptscriptstyle(i)}_{<t};\boldsymbol{\theta})

(1)

where $T$ is the total decoding time step.

After continual pre-training, we perform instruction tuning on LLaMAX using Alpaca (Taori et al., 2023), a dataset comprising 52,000 English instruction examples. This process enhances the model’s capability to comprehend and follow instructions without introducing additional multilingual information, resulting in LLaMAX-Alpaca. We are currently using Alpaca to enhance the model’s capacity for instruction following. In the future, we will release a more robust instruction model fine-tuned with a multilingual instruction dataset.

Setting	spBLEU			# entity				similarity
Setting	MUSE	PanLex	$\Delta$	MUSE	PanLex	$\Delta$	ratio	MUSE	PanLex	$\Delta$
en $\rightarrow$ ta	3.74	3.45	-0.29	139,134	91,652	-47,482	0.66	0.08	0.04	-0.04
en $\rightarrow$ th	5.45	6.14	0.69	21,567	297,573	276,006	13.80	0.20	0.06	-0.14
en $\rightarrow$ fr	44.03	43.85	-0.18	139,134	568,428	429,294	4.09	0.31	0.35	0.04
en $\rightarrow$ zh	14.65	16.64	1.99	139134	1,333,762	1,194,628	9.59	0.14	0.09	-0.05
en $\rightarrow$ es	26.98	27.36	0.38	142,780	433,468	290,688	3.04	0.28	0.32	0.04

Table 2: Evaluate a specific data augmentation technique with different dictionaries. We measure translation performance (spBLEU), the number of target language entities in the dictionary (# entity), and average cosine similarity of entities (similarity), revealing a strong correlation between performance and “# entity”.

3 Key Technique Design

In this section, we analyze primarily two key challenges related to the extension of language support: determining an appropriate vocabulary (in Section 3.1) and improving the effectiveness of data augmentation (in Section 3.2). For a more detailed analysis, refer to the discussions on the selection of multi-hop translation in the lexicon (see Appendix F) and the format of parallel data during continual pre-training (see Appendix G).

3.1 An Appropriate Vocabulary: The Original One.

Existing Pipeline.

Exploring adapting pre-trained LLMs to new languages without starting from scratch seems to have a concise pipeline, resulting in ChineseLLaMA2 (Cui et al., 2024), Swallow (Fujii et al., 2024), and so on. This pipeline comprises three crucial steps: 1) vocabulary expansion: extending the vocabulary of LLMs by adding new tokens specific to that language and initializing these new tokens as the average of embeddings from the existing tokens (Dobler and de Melo, 2023). 2) continual pre-training: continual pre-training the LLM on a large corpus of text data from the target language. 3) instruction tuning: aligning the model with specific tasks or instructions, enhancing its performance. Instead of simply following the pipeline, we question the necessity of vocabulary expansion.

Setting.

We conduct a series of analytical experiments on the LLaMA2 vocabulary. Our initial focus is on examining the correlation between fertility and the quality of token representation. Here, fertility refers to the ratio of the length of the token sequence produced by the LLaMA2 tokenizer to the length of the input sentence when split by spaces (Chinese and Japanese is split by character). Furthermore, we carry out experiments using 10,000 en $\rightarrow$ ro and en $\rightarrow$ bn bilingual sentence pairs from Lego-MT dataset. In each experiment, we introduce a varying number of language-specific new tokens and evaluate each model on the Flores-101.

Research Question 1: Why is adding new tokens considered a straightforward method for extending language support?

We assess the quality of representation by en $\rightarrow$ X translation task. This task identifies the translated result that best aligns with the corresponding English sentence within an extensive target dataset, and evaluates with Recall at top 1, denoted as R@1 (Kabir and Carpuat, 2021). A higher R@1 value signifies a more robust quality of the representation. Concurrently, we present the cosine similarity of representations generated by LLaMA2 for identical sentences in English and other languages. On experiments across 102 languages, more details in Appendix D, there exists a strong correlation between fertility and the quality of representation, evidenced by a Spearman correlation coefficient of approximately -0.88 for each assessed quality metric.

Research Question 2: Does adding new tokens to reduce fertility yield prompt performance improvements?

Extending vocabulary is a common method to reduce fertility. However, while adding new tokens indeed reduces fertility, it does not necessarily enhance its ability to capture and generalize linguistic patterns across multiple languages. As shown in Table 1, the more new tokens added, the worse the translation performance.

Research Question 3: What is the impact of adding new tokens on model performance?

As demonstrated in Table 1, even the addition of a small number (100) of new language-specific tokens can have a significant impact on the multilingual performance of LLMs. In addition, we conduct a further analysis on the original tokens (32k) embedding distribution and the token number before and after adding new tokens by KS-Lottery (Yuan et al., 2024). For more details on KS-Lottery, refer to Appendix E. As the experimental result of “shift distance” and “# shift token” in Tabel 1, fine-tuning the entire model with limited new tokens follows a similar pattern to that with the original vocabulary. However, an excessive number of new tokens can shift the model’s training focus. This holds true regardless of whether the language (ro) is well-supported by the model or not (bn). The influence of these additional tokens is substantial, indicating that the process of enhancing the multilingual capabilities of LLMs is not as straightforward as simply expanding the vocabulary and training with more multilingual data.

Setting	Aug	en-centric		ta-centric		th-centric		zh-centric
Setting	Aug	en $\rightarrow$ X	X $\rightarrow$ en	ta $\rightarrow$ X	X $\rightarrow$ ta	th $\rightarrow$ X	X $\rightarrow$ th	zh $\rightarrow$ X	X $\rightarrow$ zh
LLaMA2	✗	18.31	23.61	0.99	0.49	4.83	1.15	10.02	7.35
$\mathcal{D}_{\mathrm{P_{1}}}$	✗	19.06	25.98	3.20	0.91	7.66	3.13	11.32	7.83
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}$	✗	19.46	26.40	4.17	1.76	7.28	3.02	11.65	8.82
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{M}}$	✗	19.22	25.91	3.51	1.34	7.64	2.83	11.56	7.99
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}$ + $\mathcal{D}_{\mathrm{M}}$	✗	19.36	26.47	4.35	1.82	7.78	3.49	11.44	9.14
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}^{\prime}$	✓	19.47	26.65	4.54	1.83	7.66	3.13	11.89	9.17
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{M}}^{\prime}$	✓	18.59	25.98	3.61	1.36	6.72	2.35	10.81	6.45
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}^{\prime}$ + $\mathcal{D}_{\mathrm{M}}$	✓	19.70	26.71	4.68	1.82	8.21	3.65	12.05	9.28
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}$ + $\mathcal{D}_{\mathrm{M}}^{\prime}$	✓	19.17	26.58	4.57	1.95	7.12	3.12	11.52	7.73
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}^{\prime}$ + $\mathcal{D}_{\mathrm{M}}^{\prime}$	✓	18.80	26.56	4.78	1.79	7.31	3.18	11.35	7.28

Table 3: A comprehensive analysis of data augmentation sources reveals that using a dictionary to augment parallel data alone improves performance. “Aug” is a boolean representing whether a dictionary is used for augmentation.

Finding: The original vocabulary suffices to present the multilingualism of LLMs.

The LLaMA tokenizer, which utilizes the Byte-level Byte Pair Encoding (BBPE; Wang et al., 2019) algorithm, is the foundation for multilingual language processing tasks. Its universal compatibility across all languages, in conjunction with the absence of the requirement for an “unknown” token, optimizes vocabulary sharing (Yuan et al., 2023b) and improves its robustness. It allows the model to understand/generate responses in various languages using the same vocabulary. Meanwhile, studies have shown that LLMs trained on unbalanced English-centric datasets, often use English as an internal pivot language. This helps LLMs to map the inputs closer to English in internal space before generating the output (Zhu et al., 2024a; Yoon et al., 2024). Maintaining the original vocabulary helps to preserve this behavior, which also benefits for improving the multilingual capability.

3.2 Data Augmentation

Setting.

Given a parallel dataset subset ( $\mathcal{D}_{\mathrm{P}}$ ) from $\mathcal{D}_{\mathrm{para}}^{A}$ that contains translations in all directions for 6 languages (en,fr,es,zh,ta,th) and a monolingual subset ( $\mathcal{D}_{\mathrm{M}}$ ) from $\mathcal{D}_{\mathrm{mono}}^{A}$ for the same 6 languages. We then perform non-repetitive sampling 12,500 sentence pairs from $\mathcal{D}_{\mathrm{P}}$ in each direction to generate two subsets of parallel corpus data $\mathcal{D}_{\mathrm{P_{1}}}$ and $\mathcal{D}_{\mathrm{P_{2}}}$ , respectively. Consequently, we preserve $\mathcal{D}_{\mathrm{P_{1}}}$ and evaluate the effect of augmentation on parallel data $\mathcal{D}_{\mathrm{P_{2}}}$ or monolingual data $\mathcal{D}_{\mathrm{M}}$ , resulting in two new dataset, $\mathcal{D}_{\mathrm{P_{2}}}^{\prime}$ and $\mathcal{D}_{\mathrm{M}}^{\prime}$ , post-augmentation. To assess both the in-domain and out-of-domain capabilities of the model, we perform inference on it using 10 languages (en, fr, es, pt, de, zh, ta, th, is, zu), utilizing the Flores-101.

		en-X		zh-X		de-X		ne-X		ar-X		az-X		ceb-X
System	Size	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU
Encoder-Decoder Models
M2M-100^∗ Fan et al. (2021)	418M	63.76	17.26	61.41	10.13	61.62	14.10	46.98	4.03	59.97	11.52	45.75	4.17	44.23	6.13
M2M-100^∗ Fan et al. (2021)	1.2B	70.00	21.54	67.29	13.13	67.62	17.73	56.04	7.14	62.62	12.57	52.39	6.06	52.79	9.46
M2M-100^∗ Fan et al. (2021)	12B	74.19	24.74	71.56	14.91	72.07	20.34	62.19	9.68	68.91	16.36	54.78	6.24	60.09	12.48
Lego-MT^∗ Yuan et al. (2023a)	1.2B	69.49	24.96	68.23	16.28	69.20	21.42	68.37	16.98	65.57	18.38	65.69	13.51	58.21	16.83
NLLB-200 Team et al. (2022)	1.3B	81.69	31.77	78.05	19.61	79.49	25.99	81.63	23.65	78.66	24.32	78.46	19.18	76.50	23.71
MADLAD-400 Kudugunta et al. (2024)	7B	77.79	29.19	74.07	18.23	74.73	23.15	72.74	17.74	74.53	22.14	61.29	9.92	64.44	15.29
Aya-101 Üstün et al. (2024)	13B	77.26	24.30	75.29	15.50	76.17	20.86	77.78	18.65	74.82	18.44	75.36	15.46	71.90	18.76
LLM based Decoder-Only Models
LLaMA2 Touvron et al. (2023b)	7B	43.95	4.21	44.62	0.91	45.26	2.14	38.22	0.39	39.43	0.54	47.43	0.68	33.50	1.49
LLaMA2 Touvron et al. (2023b)	13B	31.37	0.24	34.91	0.25	31.22	0.10	35.32	0.21	32.34	0.11	36.03	0.17	30.84	0.17
LLaMA3 AI@Meta (2024)	8B	45.04	3.84	45.14	3.50	42.11	3.27	44.15	2.65	39.36	2.36	43.00	1.86	36.06	2.43
LLaMA2-Alpaca Taori et al. (2023)	7B	52.83	9.44	51.29	3.80	51.47	6.82	46.59	1.31	46.76	2.84	48.63	1.36	41.02	2.69
LLaMA2-Alpaca Taori et al. (2023)	13B	57.16	11.85	53.93	6.25	54.70	9.42	51.47	3.11	50.73	5.23	50.68	2.74	47.86	4.96
LLaMA3-Alpaca Taori et al. (2023)	8B	67.97	17.23	64.65	10.14	64.67	13.62	62.95	7.96	63.45	11.27	60.61	6.98	55.26	8.52
PolyLM Wei et al. (2023)	13B	45.16	5.72	52.41	1.42	47.89	3.59	38.00	0.45	45.82	1.04	38.65	0.57	29.74	0.77
Yayi2 Luo et al. (2023)	30B	54.13	7.80	55.23	4.38	56.48	4.72	47.88	0.92	49.45	1.73	53.06	1.23	36.75	1.87
TowerInstruct Alves et al. (2024)	7B	58.69	9.41	57.75	4.15	58.31	6.79	51.42	2.07	50.76	3.35	48.01	1.79	41.69	3.36
Aya-23 Aryabumi et al. (2024)	8B	57.91	11.18	56.65	7.20	55.69	9.30	51.78	3.50	55.49	8.00	51.45	3.27	44.14	4.24
Qwen2-Instruct (Bai et al., 2023)	7B	59.64	9.61	59.70	6.84	57.44	7.69	58.62	4.40	57.22	6.35	54.49	3.83	49.61	3.76
ChineseLLaMA2-Alpaca Cui et al. (2024)	7B	-	-	49.72	2.31	-	-	-	-	-	-	-	-	-	-
LLaMAX2-Alpaca	7B	76.66	23.17	73.54	14.17	73.82	18.96	74.64	14.49	72.00	15.82	70.91	11.34	68.67	15.53
LLaMAX3-Alpaca	8B	75.52	22.77	73.16	14.43	73.47	18.95	75.13	15.32	72.29	16.42	72.06	12.41	68.88	15.85
		X-en		X-zh		X-de		X-ne		X-ar		X-az		X-ceb
System	Size	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU
Encoder-Decoder Models
M2M-100^∗ Fan et al. (2021)	418M	68.47	21.19	62.15	10.34	60.19	14.25	40.43	1.30	63.33	11.53	49.74	2.44	47.80	4.85
M2M-100^∗ Fan et al. (2021)	1.2B	73.06	26.26	67.91	12.94	67.78	19.33	42.60	1.40	60.28	8.57	55.86	4.58	55.87	6.83
M2M-100^∗ Fan et al. (2021)	12B	74.45	28.01	69.27	13.35	70.17	21.31	45.50	2.85	69.94	15.15	61.36	6.44	57.07	8.77
Lego-MT^∗ Yuan et al. (2023a)	1.2B	75.44	30.71	71.41	16.42	70.75	23.75	59.66	15.02	70.73	18.21	66.73	11.88	59.28	15.06
NLLB-200 Team et al. (2022)	1.3B	84.22	38.60	76.75	15.27	79.50	25.71	73.70	21.84	79.85	21.80	80.02	15.55	69.05	24.72
MADLAD-400 Kudugunta et al. (2024)	7B	83.05	38.14	78.49	20.48	77.50	26.79	61.94	13.93	77.84	22.25	75.41	13.85	51.33	4.24
Aya-101 Üstün et al. (2024)	13B	80.72	31.92	78.51	22.49	77.37	15.43	69.69	17.13	77.90	16.54	78.70	13.51	67.76	21.58
LLM Based Decoder-Only Models
LLaMA2 Touvron et al. (2023b)	7B	55.46	11.80	43.50	0.55	43.10	3.22	34.41	0.42	39.13	0.25	43.98	0.59	41.64	1.16
LLaMA2 Touvron et al. (2023b)	13B	38.25	0.75	37.06	0.22	31.73	0.25	30.13	0.15	33.68	0.06	33.47	0.08	37.49	0.20
LLaMA3 AI@Meta (2024)	8B	67.66	19.81	42.52	1.37	49.42	6.61	33.38	0.52	34.12	0.49	37.27	0.79	37.97	1.41
LLaMA2-Alpaca Taori et al. (2023)	7B	65.85	16.44	56.53	4.46	56.76	9.01	34.96	1.03	44.10	2.18	40.67	0.63	45.69	1.73
LLaMA2-Aplaca Taori et al. (2023)	13B	68.72	19.69	64.46	8.80	62.86	12.57	38.88	2.16	52.08	4.48	41.18	0.87	48.47	2.51
LLaMA3-Alpaca Taori et al. (2023)	8B	77.43	26.55	73.56	13.17	71.59	16.82	46.56	3.83	66.49	10.20	58.30	4.81	52.68	4.18
PolyLM Wei et al. (2023)	13B	50.98	7.75	42.60	1.20	43.95	3.69	33.69	0.36	42.27	1.67	40.24	0.44	39.29	0.96
Yayi2 Luo et al. (2023)	30B	68.06	19.37	57.81	6.07	53.82	5.62	40.95	0.48	46.61	0.52	49.29	0.71	45.50	1.71
TowerInstruct Alves et al. (2024)	7B	65.37	18.87	64.26	10.37	60.73	12.81	38.80	0.62	44.72	0.39	47.17	0.71	47.15	2.24
Aya-23 Aryabumi et al. (2024)	8B	67.53	20.57	66.11	11.20	63.09	14.09	44.33	2.69	63.59	11.84	46.97	1.19	45.17	2.29
Qwen2-Instruct (Bai et al., 2023)	7B	73.25	19.04	72.52	13.52	64.61	11.33	41.41	2.27	64.94	8.50	47.96	1.66	55.45	3.00
ChineseLLaMA2-Alpaca Cui et al. (2024)	7B	-	-	55.06	6.15	-	-	-	-	-	-	-	-	-	-
LLaMAX2-Alpaca	7B	80.55	30.63	75.52	13.53	74.47	19.26	67.36	15.47	75.40	15.32	72.03	10.27	65.05	16.11
LLaMAX3-Alpaca	8B	81.28	31.85	78.34	16.46	76.23	20.64	65.83	14.16	75.84	15.45	70.61	9.32	63.35	12.66

Table 4: Comparison with different architecture, including encoder-decoder and decoder-only models, on Flores-101 dataset, where X refers to any language in 101 languages. ^∗ refers to that model comparisons are restricted to 85 languages, denoted as |X| = 85. We make this choice because the M2M-100 baselines cover only 86 languages, as reported in the work by Flores-101 (Goyal et al., 2022; Yuan et al., 2023a). This table compares our instruction-aligned LLaMAX2 model (LLaMAX2-Alpaca) with the instruction-aligned LLaMA2 model (LLaMA2-Alpaca) to demonstrate the benefits of our multilingual continual pre-training. Additionally, we compare LLaMAX with other open-source multilingual-focus LLMs to highlight the impressive multilingual capabilities.

System	Size	TED (en-X)		TED (X-en)		TICO (en-X)		WMT23 (en-X)		WMT23 (X-en)
System	Size	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU
LLaMA2 Touvron et al. (2023b)	7B	52.15	3.34	61.54	8.66	39.63	3.45	51.55	2.96	65.68	14.87
LLaMA2 Touvron et al. (2023b)	13B	34.66	0.17	40.87	0.49	31.65	0.42	33.74	0.43	41.18	0.85
LLaMA3 AI@Meta (2024)	8B	44.72	2.09	53.56	6.04	40.02	4.82	47.44	2.61	55.18	7.84
LLaMA2-Alpaca Taori et al. (2023)	7B	62.04	9.15	68.62	12.67	44.73	8.60	73.17	17.23	75.82	24.97
LLaMA2-Alpaca Taori et al. (2023)	13B	65.62	11.40	70.74	14.54	48.64	10.79	77.93	21.60	77.90	28.67
LLaMA3-Alpaca Taori et al. (2023)	8B	73.20	14.13	75.03	16.83	56.73	14.49	80.05	24.11	79.22	29.76
PolyLM Wei et al. (2023)	13B	50.18	5.53	55.16	7.28	40.36	7.17	62.67	10.62	69.15	19.09
Yayi2 Luo et al. (2023)	30B	61.53	8.54	70.92	14.09	47.02	7.91	65.69	10.76	75.60	20.47
TowerInstruct Alves et al. (2024)	7B	64.83	8.22	70.91	15.29	50.48	10.14	74.03	18.42	80.08	30.03
Qwen2-Instruct (Bai et al., 2023)	7B	66.68	8.84	71.83	13.37	55.16	11.47	75.11	18.86	77.48	25.61
Aya-23 Aryabumi et al. (2024)	8B	68.06	10.69	72.87	16.44	52.44	12.98	83.29	27.15	82.00	31.21
LLaMAX2-Alpaca	7B	75.58	16.12	76.18	17.81	68.33	19.79	80.17	23.91	79.55	30.30
LLaMAX3-Alpaca	8B	74.95	15.15	76.99	18.47	67.71	20.06	79.96	24.49	79.88	30.34

Table 5: Benchmarking results on WMT23, TED and TICO dataset. X denotes various languages across different translation benchmarks; detailed information is available in Appendix B. Evaluation results across these benchmarks further validate the strong multilingual translation capabilities of LLaMAX.

Finding: The choice of dictionary is related to the number of entities for the language in the dictionary.

As shown in Table 2, there is no clear dictionary preference is observed for en/ta/th/zh-centric translation, with optimal performance randomly distributed across the two dictionaries. Furthermore, we conduct an in-depth analysis of the MUSE and PanLex dictionary for translation from en to another 5 languages. We compare the end-to-end translation performance (spBLEU), the number of target language entities in the dictionary (# entity), and the similarity of entities embedding (simple average with entity token embeddings) extracted from the trained model. And find a clear correlation between the translation performance and #entity.

4 Benchmarking Results

In this section, we present multilingual benchmarking results to comprehensively demonstrate the potential of LLaMAX2. We evaluate translation quality with spBLEU (Goyal et al., 2022) and COMET-22 (Rei et al., 2020) for both LLMs and translation models. See Appendix C for training details on LLaMAX2 and description of baseline models.

	Knowledge			Commonsense Reasoning		Math Reasoning		Code		Avg.
	MMLU	BBH	NQ	HellaSwag	Winogrande	GSM8K	Math	HumanEval	MBPP	Avg.
LLaMA2-Alpaca	44.22	37.95	24.32	31.12	61.09	14.03	3.82	14.63	27.63	28.76
LLaMAX2-Alpaca	44.60	38.25	23.21	33.75	61.48	12.21	3.74	12.20	25.29	28.30

Table 6: Evaluation results, assessed by OpenCompass (Contributors, 2023), on monolingual general benchmarks.

We significantly enhances the multilingual translation capabilities of the base LLaMA2 model through massive multilingual continual pre-training.

The benefits of our continual pre-training is enhancing the base LLM’s multilingual translation capabilities. Evaluation results on Flores-101 benchmark are shown in Table 4. By comparing our multilingual-enhanced model with the base LLaMA2 model in instruction-tuned versions (LLaMAX2-Alpaca vs. LLaMA2-Alpaca), we consistently observe a significant performance improvement on both English-centric and non-English-centric translation. In addition to Flores-101, we also make evaluation on a range of diverse translation benchmarks (Table 5). The performance enhancement brought by our multilingual continual pre-training is consistent across these benchmarks.

LLaMAX outperforms other open-source decoder-only LLMs on multilingual translation by a large margin.

Next, we compare our language-extended LLaMAX2-Alpaca model with other open-source decoder-only LLMs built for multilingual purposes (Table 4, Table 5). Compared to other from-scratch trained LLMs, such as PolyLM, Yayi2, LLaMAX2 consistently shows better performance across various multilingual translation benchmarks, indicating that the LLaMA2 base model provides a strong foundation for language extension. Furthermore, when compared to other LLaMA-based continual pre-trained models, such as TowerInstruct, LLaMAX2 also achieves superior performance, demonstrating the effectiveness of our optimized continual pre-training pipeline.

LLaMAX benefits unseen long-tail low-resource languages as well.

A significant challenge in multilingual enhancement is that the substantial cost of collecting scarce multilingual resources makes it prohibitive to cover massive languages. While our multilingual pre-training corpus already covers 102 languages, we acknowledge that there remains a large group of long-tail, low-resource languages that are not well covered. To assess the generalization capability of LLaMAX2, we evaluate it on Flores-200 dataset and observe its performance on these unseen languages (Figure 4). We find that for languages not encountered during training, LLaMAX2 still achieves significant improvements, demonstrating the generalization capability of our massive continual pre-training.

LLaMAX is closing the performance gap between open-source LLM translator and specialized encoder-decoder translation systems.

While LLaMAX2 has achieved the state-of-the-art translation performance among open-source decoder-only LLMs, the next critical question is whether we can close the gap between LLMs and specialized encoder-decoder translation systems. Table 4 provides a comprehensive comparison, reveals LLaMAX2 has reached the level of the M2M-100-12B model. Future work will be needed to optimize the language extension framework to match the performance of advanced translation systems, e.g., MADLAD-400.

X	LLaMA2-Alpaca		ChineseLLaMA2-Alpaca		LLaMAX2-Alpaca		X	LLaMA2-Alpaca		ChineseLLaMA2-Alpaca		LLaMAX2-Alpaca
X	$R_{\mathrm{zh}}$	$R_{\mathrm{X}}$	$R_{\mathrm{zh}}$	$R_{\mathrm{X}}$	$R_{\mathrm{zh}}$	$R_{\mathrm{X}}$	X	$R_{\mathrm{zh}}$	$R_{\mathrm{X}}$	$R_{\mathrm{zh}}$	$R_{\mathrm{X}}$	$R_{\mathrm{zh}}$	$R_{\mathrm{X}}$
af	0.20	28.36	31.32	0.10	0.30	79.84	ln	0.30	0.00	66.40	0.00	0.00	0.00
am	1.09	40.12	67.29	21.15	0.00	89.23	lo	1.38	32.71	89.03	0.10	0.00	58.30
ar	2.17	81.23	72.92	24.70	0.00	99.80	lt	1.09	14.13	50.69	24.31	0.20	96.34
as	8.40	0.59	84.39	0.30	0.00	76.78	luo	5.83	0.00	87.65	0.00	1.38	0.00
ast	0.30	0.20	18.77	0.10	0.00	33.20	lv	0.30	15.51	52.67	15.42	0.20	97.73
az	0.20	18.87	39.23	4.25	0.00	96.44	mi	0.49	0.00	59.58	0.00	0.00	0.00
be	0.10	49.11	2.96	2.87	0.00	99.70	mk	0.40	17.19	7.31	21.94	0.00	99.31
bg	2.37	44.66	29.74	30.24	0.30	98.62	ml	8.20	12.15	79.55	7.51	0.49	51.88
bn	3.95	44.96	78.75	17.79	0.10	99.60	mn	1.58	17.49	85.67	1.48	0.00	99.51
bs	0.40	2.17	8.10	1.98	0.10	4.25	mr	0.40	19.86	31.42	1.58	0.00	99.01
ca	0.30	90.12	5.14	79.84	0.00	98.91	ms	0.59	5.93	20.36	3.95	0.00	43.18
ceb	0.20	21.94	6.72	16.01	0.00	95.55	mt	0.20	63.44	29.15	25.00	0.00	97.13
cs	0.20	54.55	24.90	38.14	0.30	94.76	my	1.78	47.33	38.74	29.74	0.00	99.90
cy	0.30	19.66	20.55	44.66	0.00	98.81	ne	0.49	35.77	71.64	3.06	0.00	98.72
da	0.30	49.01	22.73	39.72	0.49	91.80	nl	0.30	65.81	4.55	65.22	0.10	94.76
de	0.79	70.55	10.97	75.69	0.30	96.94	no	0.99	32.21	22.53	28.06	0.20	88.74
el	0.69	21.25	52.67	28.26	0.00	100.00	ns	0.20	0.00	38.74	0.00	0.10	0.00
en	0.00	100.00	0.30	99.70	0.00	100.00	ny	0.59	0.00	60.08	0.00	0.20	0.00
es	0.10	96.94	4.74	93.08	0.00	99.51	oc	0.10	0.79	20.55	0.30	0.40	59.39
et	2.27	8.50	75.49	2.96	0.10	96.34	om	0.20	0.00	38.04	0.00	0.20	0.00
fa	0.40	45.95	34.49	57.61	0.00	98.12	or	1.28	37.35	62.65	1.78	0.00	99.80
ff	0.49	0.00	73.81	0.00	0.59	0.00	pa	1.28	49.41	39.62	5.43	0.00	100.00
fi	3.95	55.43	65.22	17.59	0.30	97.13	pl	0.20	64.33	12.55	58.50	0.00	98.42
fr	0.10	94.17	3.46	92.98	0.00	98.72	ps	0.99	20.16	39.03	0.49	0.00	97.83
ga	0.20	19.37	8.70	6.82	0.00	93.08	pt	0.30	84.39	5.34	79.84	0.10	98.42
gl	0.20	0.89	26.19	0.10	0.20	83.99	ro	0.10	19.57	26.98	42.39	0.20	87.15
gu	0.59	36.96	45.65	29.74	0.00	99.60	ru	0.69	79.74	46.64	37.06	0.10	99.01
ha	0.79	0.00	67.98	0.00	0.10	0.00	sd	0.89	7.41	41.70	0.20	0.00	95.16
he	1.68	58.70	65.51	31.03	0.00	100	sk	0.40	20.26	25.40	3.56	0.10	97.23
hi	0.79	50.79	55.83	23.81	0.00	98.91	sl	1.19	37.25	49.60	16.21	0.69	91.90
hr	0.49	41.60	20.95	20.36	0.10	69.66	sn	0.49	0.00	34.58	0.00	0.10	0.00
hu	0.40	64.33	27.47	38.74	0.10	97.13	so	0.30	8.70	58.70	0.20	0.10	57.71
hy	4.74	47.13	79.15	12.15	0.00	99.60	sr	0.59	12.45	17.89	18.87	0.20	48.02
id	0.49	81.92	16.21	60.38	0.00	95.85	sv	0.10	47.33	46.94	25.00	0.10	96.94
ig	0.20	0.00	51.48	0.00	0.10	0.00	sw	0.20	39.23	36.86	22.73	0.00	94.66
is	0.40	35.08	40.02	28.46	0.20	92.98	ta	1.48	24.41	55.24	34.09	0.00	98.62
it	0.49	79.55	3.36	77.57	0.10	98.42	te	1.38	38.93	69.47	28.56	0.00	99.60
ja	48.02	16.70	28.36	70.95	6.62	92.00	tg	1.28	2.77	44.86	7.61	0.20	97.04
jv	0.20	0.00	13.83	0.00	0.00	64.62	th	1.28	58.60	71.25	28.56	0.00	100.00
ka	3.56	31.72	70.06	4.74	0.00	99.80	tl	0.20	66.7	32.91	45.75	0.00	98.91
kam	0.99	0.00	65.51	0.00	1.58	0.00	tr	0.89	37.94	48.02	31.42	0.00	95.65
kea	0.59	0.00	35.47	0.00	0.40	0.00	uk	0.49	71.54	10.38	28.06	0.49	98.62
kk	0.99	45.95	37.06	29.45	0.00	98.32	umb	0.59	0.00	54.94	0.00	0.30	0.00
km	1.58	29.25	58.89	28.26	0.00	100.00	ur	1.68	19.86	75.49	14.82	0.10	96.54
kn	3.16	38.24	75.59	14.72	0.00	100.00	uz	0.20	30.24	58.99	2.77	0.10	89.92
ko	3.85	71.94	75.69	23.52	0.00	98.02	vi	0.10	92.69	13.44	81.13	0.00	99.70
ku	0.10	14.13	31.72	0.00	0.40	75.20	wo	0.30	0.00	56.62	0.00	0.49	0.00
ky	1.19	25.99	48.62	4.35	0.00	99.11	xh	0.20	0.00	40.51	0.00	0.10	0.00
lb	0.10	24.21	30.73	0.40	0.59	89.53	yo	0.10	3.56	57.91	0.40	0.10	15.81
lg	10.57	0.00	79.35	0.00	6.13	0.00	zhtrad	98.12	0.00	98.42	0.00	99.51	0.00
							zu	0.20	0.00	45.55	0.00	0.10	0.00

Table 7: Using langdetect (Joulin et al., 2016), we individually identify the language of the translation output in zh

\rightarrow

X (where X represents any of the 101 languages included in Flores-101) for the LLaMA2-Alpaca, ChineseLLaMA2-Alpaca, and LLaMAX2-Alpaca models on the Flores-101 devtest.

R_{\mathrm{zh}}

refers to the proportion of sentences in the zh

\rightarrow

X translation output where the top predicted language is Chinese.

R_{\mathrm{X}}

, on the other hand, denotes the proportion where the top prediction corresponds to the target translated language.

LLaMAX provides a better starting point for specialized instruction-tuning on English task data.

In the end, we demonstrate the usage of our continual pre-trained model (LLaMAX2) on tasks beyond translation. While in previous experiments we use basic Alpaca instruction data to teach LLM to follow translation instructions, we now show that our released checkpoint can be enpowered to handle more multilingual tasks beyond translation. Figure 3 presents three example tasks where we use specialized instruction data to unlock LLaMAX2’s abilities on specific tasks, such as math reasoning and common sense reasoning. We find that the instruction-tuned LLaMAX2 model outperforms its LLaMA2 model counterpart in non-English performance across all three tasks, demonstrating that provides a better starting point for instruction-tuning with task-specific data.

Direct	BLEU				COMET
	LLaMA3-Alpaca		LLaMAX2-Alpaca		LLaMA3-Alpaca		LLaMAX2-Alpaca
	src $\rightarrow$ trg	src $\rightarrow$ en $\rightarrow$ trg	src $\rightarrow$ trg	src $\rightarrow$ en $\rightarrow$ trg	src $\rightarrow$ trg	src $\rightarrow$ en $\rightarrow$ trg	src $\rightarrow$ trg	src $\rightarrow$ en $\rightarrow$ trg
zh $\rightarrow$ x	10.14	11.34	14.17	15.54	64.65	66.61	73.54	74.74
x $\rightarrow$ zh	13.17	15.37	13.53	15.11	73.56	75.66	75.52	77.21
de $\rightarrow$ x	13.62	14.24	18.96	19.38	64.67	65.79	73.82	74.36
x $\rightarrow$ de	16.82	18.08	19.26	20.71	71.59	73.11	74.47	76.04
ar $\rightarrow$ x	11.27	12.60	15.82	17.10	63.45	65.33	72.00	73.17
x $\rightarrow$ ar	10.20	10.88	15.32	16.00	66.49	69.54	75.40	76.32
ne $\rightarrow$ x	7.96	10.29	14.49	16.16	62.95	67.87	74.64	76.86
x $\rightarrow$ ne	3.83	7.08	15.47	16.86	46.56	58.89	67.36	69.47
az $\rightarrow$ x	6.98	9.52	11.34	13.54	60.61	65.16	70.91	73.60
x $\rightarrow$ az	4.81	6.96	10.27	11.44	58.30	67.52	72.03	75.60
ceb $\rightarrow$ x	8.52	10.69	15.53	16.98	55.26	60.71	68.67	70.76
x $\rightarrow$ ceb	4.18	7.17	16.11	18.94	52.68	59.55	65.05	66.52
Avg.	9.29	11.19	15.02	16.48	61.73	66.31	71.95	73.72

Table 8: Using English (en) as a pivot, we can convert a translation task from the source language (src) to the target language (trg), represented as src

\rightarrow

trg, to src

\rightarrow

\rightarrow

trg. The experimental results indicate that the performance of English as a powerful pivot falls short compared to LLaMAX2-Alpaca (LLaMA3 pivot translation vs. LLaMAX2-Alpaca). Furthermore, conducting similar pivot translation experiments on LLaMAX2-Alpaca can further improve translation performance.

LLaMAX circumvents catastrophic forgetting issue.

A common concern with continual pre-training on additional multilingual corpus is that the process might disturb the parametric knowledge and working pattern of the original model, a phenomenon known as catastrophic forgetting Goodfellow et al. (2013). Furthermore, we compare LLaMAX2 with LLaMA2 on popular English benchmarks that measure a diverse set of core capabilities of LLMs. Experiment results in Table 6 show that the two models achieve very similar performance on these benchmarks (More details about these benchmarks are in Appendix B.), demonstrating that our continual pre-training does not compromise the general capability of the base model.

Comparsion between LLaMAX2-Alpaca and language-specific LLMs.

The comparison between LLaMAX2-Alpaca, ChineseLLaMA2-Alpaca, and Swallow (a Japanese-specific LLM) explores the difference between the traditional pipeline for enhancing specific language capabilities based on existing pre-trained models and our proposed recipe. As shown in Figure 5, we evaluate language-specific LLMs to translate from the enhanced language to any of the 101 languages on Flores-101 and find that their performance is not significantly different from the original LLaMA2 model, but there exists a notable performance gap compared to LLaMAX2-Alpaca. As we described in Section 3.1, excessively adding new language-specific tokens can shift the focus of training the LLM.

In addition, we conduct a deeper analysis of translation output to identify the factors contributing to the limited improvement in translation performance. The experimental results in Table 7 indicate that the language-specific LLM obtained through the traditional pipeline tends to output specific languages, while LLaMAX2 can accurately produce the answer with the corresponding language. We provide more analysis on Swallow in Appendix I, and arrive at the same conclusion.

Beyond the English-centric translation is more efficient and effective.

We further investigate the necessity and feasibility of multilingual augmentation for an English-centric LLM. We can effectively transform a translation task (src $\rightarrow$ trg) from the source language (src) to the target language (trg) into src $\rightarrow$ en and en $\rightarrow$ trg, which allows us to leverage the power of English as a central language, facilitating seamless communication and comprehension across various language pairs. We refer to this experimental setup as a pivot translation experiment. As shown in Table 8, the experimental results demonstrate that the pivot translation experiments effectively leverage the power of English to enhance translation performance (compared src $\rightarrow$ en $\rightarrow$ trg to src $\rightarrow$ trg on the same model), although it still falls short of the results obtained from large-scale multilingual continual pre-trained models (LLaMA3-Alpaca src $\rightarrow$ en $\rightarrow$ trg vs. LLaMAX2-Alpaca src $\rightarrow$ trg). Interestingly, conducting pivot translation experiments based on LLaMAX2-Alpaca reveals the potential for significant improvements in translation performance (LLaMAX2-Alpaca src $\rightarrow$ en $\rightarrow$ trg vs. LLaMAX2-Alpaca src $\rightarrow$ trg).

5 Related Work

Multilingual Large Language Models.

Large Language Model (LLMs; OpenAI, 2023; Zhang et al., 2022; Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023a, b) trained with English-centric data can also solve various non-English tasks (Hendrycks et al., 2021a, b; Srivastava et al., 2022; Kwiatkowski et al., 2019; Hendrycks et al., 2021c), but the performance between non-English and English is significantly large Yuan et al. (2023b). Efforts to develop more multilingual LLMs in two different ways: retraining LLMs with diverse multilingual data from scratch (Wei et al., 2023); or continuous training of pre-trained models using language-specific data with the option to expand the vocabulary (Zhao et al., 2024a; Cui et al., 2024; Faysse et al., 2024; Alves et al., 2024). Instead of training from scratch, continual pre-training aims at updating pre-trained models with new data, making the process more efficient and cost-effective (Gupta et al., 2023; Alves et al., 2024; Xie et al., 2023).

Multilinguality in LLMs.

Recent research has shed light on the multilingual capabilities of LLMs. A comprehensive survey by Huang et al. (2024) discusses various aspects of multilingualism in LLMs, including training and inference methods, model security, multi-domain with languages culture, and emphasizes the need for language-fai technology. Yuan et al. (2023b) analysis multilingualism of LLMs from the vocabulary sharing aspect. Zhao et al. (2024b) delve into the architecture of LLMs to find how LLMs handle multilingualism. Recently, Li et al. (2024) quantify the multilingual performance of LLMs. These studies provide valuable insights into the multilingual capabilities of LLMs, and the key technical design of continual pre-training for LLaMAX.

6 Conclusion

In this work, we enhance the series models of LLaMA translation performance for 102 languages through continual pre-training, creating LLaMAX. We compare LLaMAX ’s translation capabilities with other decoder-only LLMs and encoder-decoder models across multiple benchmarks. LLaMAX is also assessed on general tasks and fine-tuned with task-specific instructions. Our results indicate that LLaMAX improves translation quality while maintaining general capabilities and can serve as a powerful foundation model for downstream multilingual applications.

Acknowledgments

Authors of this paper would like to thank Zixian Huang, Qiushi Sun, Fangzhi Xu, Hanxu Hu, Chuanyang Jin, Yichao Du, and Zichen Ding for giving many helpful comments on previous versions of this paper.

References

AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
Alves et al. (2024) Duarte M Alves, José Pombal, Nuno M Guerreiro, Pedro H Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, et al. 2024. Tower: An open multilingual large language model for translation-related tasks. arXiv preprint arXiv:2402.17733.
Anastasopoulos et al. (2020) Antonios Anastasopoulos, Alessandro Cattelan, Zi-Yi Dou, Marcello Federico, Christian Federmann, Dmitriy Genzel, Franscisco Guzmán, Junjie Hu, Macduff Hughes, Philipp Koehn, Rosie Lazar, Will Lewis, Graham Neubig, Mengmeng Niu, Alp Öktem, Eric Paquin, Grace Tang, and Sylwia Tur. 2020. TICO-19: the translation initiative for COvid-19. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.
Aryabumi et al. (2024) Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, et al. 2024. Aya 23: Open weight releases to further multilingual progress. arXiv preprint arXiv:2405.15032.
Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Cettolo et al. (2012) Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Annual conference of the European Association for Machine Translation, pages 261–268, Trento, Italy. European Association for Machine Translation.
Chang et al. (2023) Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K. Bergen. 2023. When is multilinguality a curse? language modeling for 250 high- and low-resource languages.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Contributors (2023) OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
Cui et al. (2024) Yiming Cui, Ziqing Yang, and Xin Yao. 2024. Efficient and effective text encoding for chinese llama and alpaca.
Dobler and de Melo (2023) Konstantin Dobler and Gerard de Melo. 2023. FOCUS: Effective embedding initialization for monolingual specialization of multilingual models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13440–13454, Singapore. Association for Computational Linguistics.
Fan et al. (2021) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, et al. 2021. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research (JMLR).
Faysse et al. (2024) Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F. T. Martins, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. Croissantllm: A truly bilingual french-english language model.
Fujii et al. (2024) Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. 2024. Continual pre-training for cross-lingual llm adaptation: Enhancing japanese language capabilities.
Goodfellow et al. (2013) Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. 2013. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
Goyal et al. (2022) Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. 2022. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
Guo et al. (2024) Ping Guo, Yubing Ren, Yue Hu, Yunpeng Li, Jiarui Zhang, Xingsheng Zhang, and Heyan Huang. 2024. Teaching large language models to translate on low-resource languages with textbook prompting. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15685–15697, Torino, Italia. ELRA and ICCL.
Gupta et al. (2023) Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual pre-training of large language models: How to (re)warm your model?
Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021a. Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR).
Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021b. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
Hendrycks et al. (2021c) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021c. Measuring mathematical problem solving with the math dataset. NeurIPS.
Hendy et al. (2023) Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation.
Huang et al. (2024) Kaiyu Huang, Fengran Mo, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi, Yulong Mao, Jinchen Liu, Yuzhuang Xu, Jinan Xu, Jian-Yun Nie, and Yang Liu. 2024. A survey on large language models with multilingualism: Recent advances and new frontiers.
Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H’erve J’egou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
Kabir and Carpuat (2021) Tasnim Kabir and Marine Carpuat. 2021. The UMD submission to the explainable MT quality estimation shared task: Combining explanation models with sequence labeling. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 230–237, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Kocmi and Federmann (2023) Tom Kocmi and Christian Federmann. 2023. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, pages 768–775, Singapore. Association for Computational Linguistics.
Kudugunta et al. (2024) Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2024. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems (NeurIPS).
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
Lample et al. (2018) Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations.
Li et al. (2024) Zihao Li, Yucheng Shi, Zirui Liu, Fan Yang, Ninghao Liu, and Mengnan Du. 2024. Quantifying multilingual performance of large language models across languages.
Lin et al. (2021a) Bill Yuchen Lin, Seyeon Lee, Xiaoyang Qiao, and Xiang Ren. 2021a. Common sense beyond english: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1274–1287. Association for Computational Linguistics.
Lin et al. (2021b) Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona T. Diab, Veselin Stoyanov, and Xian Li. 2021b. Few-shot learning with multilingual language models. CoRR, abs/2112.10668.
Luo et al. (2023) Yin Luo, Qingchao Kong, Nan Xu, Jia Cao, Bao Hao, Baoyu Qu, Bo Chen, Chao Zhu, Chenyang Zhao, Donglei Zhang, et al. 2023. Yayi 2: Multilingual open-source large language models. arXiv preprint arXiv:2312.14862.
Muennighoff et al. (2023) Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. 2023. Scaling data-constrained language models.
Muennighoff et al. (2022) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2022. Crosslingual generalization through multitask finetuning.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
Pan et al. (2021) Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. 2021. Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 244–258, Online. Association for Computational Linguistics.
Ponti et al. (2020) Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. Association for Computational Linguistics.
Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
Shi et al. (2023) Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2023. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramón Risco Delgado, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Timothy Telleen-Lawton, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
Team et al. (2022) NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scaling human-centered machine translation. ArXiv.
Tikhonov and Ryabinin (2021) Alexey Tikhonov and Max Ryabinin. 2021. It’s all in the heads: Using attention heads as a baseline for cross-lingual transfer in commonsense reasoning.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
Üstün et al. (2024) Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. 2024. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827.
Vilar et al. (2023) David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, and George Foster. 2023. Prompting PaLM for translation: Assessing strategies and performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15406–15427, Toronto, Canada. Association for Computational Linguistics.
Wang et al. (2019) Changhan Wang, Kyunghyun Cho, and Jiatao Gu. 2019. Neural machine translation with byte-level subwords.
Wang et al. (2022) Xinyi Wang, Sebastian Ruder, and Graham Neubig. 2022. Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 863–877, Dublin, Ireland. Association for Computational Linguistics.
Wei et al. (2023) Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, et al. 2023. Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018.
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
Xie et al. (2023) Yong Xie, Karan Aggarwal, and Aitzaz Ahmad. 2023. Efficient continual pre-training for building domain specific large language models.
Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
Yoon et al. (2024) Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone Kim, Sheikh Shafayat, and Minjoon Seo. 2024. Langbridge: Multilingual reasoning without multilingual supervision.
Yuan et al. (2023a) Fei Yuan, Yinquan Lu, Wenhao Zhu, Lingpeng Kong, Lei Li, Yu Qiao, and Jingjing Xu. 2023a. Lego-MT: Learning detachable models for massively multilingual machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11518–11533, Toronto, Canada. Association for Computational Linguistics.
Yuan et al. (2024) Fei Yuan, Chang Ma, Shuai Yuan, Qiushi Sun, and Lei Li. 2024. Ks-lottery: Finding certified lottery tickets for multilingual language models.
Yuan et al. (2023b) Fei Yuan, Shuai Yuan, Zhiyong Wu, and Lei Li. 2023b. How multilingual is multilingual llm?
Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-trained transformer language models.
Zhao et al. (2024a) Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024a. Llama beyond english: An empirical study on language capability transfer.
Zhao et al. (2024b) Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. 2024b. How do large language models handle multilingualism?
Zhu et al. (2024a) Wenhao Zhu, Shujian Huang, Fei Yuan, Shuaijie She, Jiajun Chen, and Alexandra Birch. 2024a. Question translation training for better multilingual reasoning.
Zhu et al. (2024b) Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024b. Multilingual machine translation with large language models: Empirical results and analysis. In Findings of the Association for Computational Linguistics: NAACL 2024.

Outline

•

Section A: Discussion about the generalizability and reliability of the work.
•

Section B: The comprehensive details of the training data, including monolingual and parallel data, and the evaluation benchmark (Table 9).
•

Section C: The detailed information of different models, including open-source Large Language Models (Section C.1) and well-trained translation models (Section C.2).
•

Section D: Analysis the correlation between embedding quality of LLaMA2 and fertility using Flores-101 test (Figure 6).
•

Section E: A detailed introduction to the KS-Lottery method.
•

Section F: Selection about multi-hop translation (Table 10 and Table 11).
•

Section G: The selection of the appropriate format for parallel data during training (Table 12).
•

Section H: The comparison of translation performance across all seven languages between Lego-MT and GPT-4 (Figure 7).
•

Section I: Comparison results between LLaMAX2-Alpaca with language-specific enhanced LLMs (Table 13).
•

Section J: We present comprehensive instructions utilized for all LLMs (Table D).

Appendix A Limitations

This work focuses on the discussion of some key technologies, such as the use of vocabulary lists and the determination of data augmentation schemes. However, it does not delve into further processing of the quality of open-source data. We acknowledge a gap in the literature regarding the thorough evaluation of open-source data quality, suggesting an opportunity for future research to improve data preprocessing methods for better model training outcomes.

Appendix B Data Information

Family	ISO	Language	# Mono.	# Para.	# Direct.	Family	ISO	Language	# Mono.	# Para.	# Direct.
Afro-Asiatic	ha	Hausa	420,964	3,147,704	96	Indo-European	ne	Nepali	702,334	8,907,527	97
	om	Oromo	18,895	191,319	96		or	Odia	100,530	812,235	97
	so	Somali	697,864	3,804,551	97		pa	Punjabi	513,987	3,737,780	97
	am	Amharic	269,171	4,031,552	97		sd	Sindhi	472,217	821,996	95
	ar	Arabic	716,063	9,940,756	97		ur	Urdu	711,354	4,137,619	97
	he	Hebrew	300,000	3,928,938	96		fa	Persian	721,307	4,111,536	97
	mt	Maltese	671,716	1,518,533	94		ku	Kurdish	517,239	3,597,863	97
Austroasiatic	km	Khmer	687,690	4,044,652	97		ps	Pashto	588,340	3,717,480	97
Austroasiatic	vi	Vietnamese	760,472	4,112,089	97		tg	Tajik	700,237	4,131,709	97
Austronesian	jv	Javanese	505,619	2,799,761	97		ast	Asturian	0	1,535,714	96
	id	Indonesian	707,962	4,243,235	97		ca	Catalan	724,597	4,145,004	97
	ms	Malay	711,895	4,121,713	97		es	Spanish	706,307	4,258,477	98
	mi	Maori	180,678	3,437702	97		fr	French	787,316	4,290,003	99
	ceb	Cebuano	418,058	2,217,926	91		gl	Galician	726,512	3,131,730	96
	tl	Tagalog	0	3,927,576	97		it	Italian	846,107	4,233,108	96
Dravidian	te	Telugu	708,459	4,219,702	97		oc	Occitan	36,379	1,752,951	95
	kn	Kannada	712,832	3,592,636	97		pt	Portuguese	795,818	4,258,604	97
	ml	Malayalam	715,387	4,516,012	97		ro	Romanian	702,002	4,219,414	97
	ta	Tamil	711,863	4,444,734	97	Japonic	ja	Japanese	726,455	4,207,728	97
Indo-European	hy	Armenian	712,835	3,677,780	97	Kartvelian	ka	Georgian	703,515	4,182,651	97
	lt	Lithuanian	718,382	3,946,735	96	Koreanic	ko	Korean	711,406	4,234,653	97
	lv	Latvian	700,889	4,011,628	97	Kra–Dai	lo	Lao	357,758	2,642,799	97
	be	Belarusian	708,288	4,169,719	95	Kra–Dai	th	Thai	707,719	4,437,476	97
	bg	Bulgarian	711,500	4,131,053	97	Mongolic	mn	Mongolian	701,304	3,894,353	97
	bs	Bosnian	300,000	2,953,912	97	Niger–Congo	wo	Wolof	871	802,521	97
	cs	Czech	711,179	4,135,944	97		ln	Lingala	3,325	159,684	96
	hr	Croatian	300,000	4,106,335	97		ns	Northern Sotho	0	96,288	88
	mk	Macedonian	702,035	4,009,787	97		lg	Luganda	13,030	216,135	95
	pl	Polish	792,829	4,200,001	98		ny	Nyanja	226,940	3,104,349	92
	ru	Russian	853,407	4,204,365	97		sn	Shona	386,588	3,140,063	97
	sk	Slovak	715,540	4,100,272	98		sw	Swahili	700,422	3,775,394	97
	sl	Slovenian	731,613	4,073,213	97		umb	Umbundu	0	54	2
	sr	Serbian	711,535	4,033,130	97		xh	Xhosa	122,720	3,955,426	97
	uk	Ukrainian	714,181	4,070,250	97		yo	Yoruba	98,281	3,364,040	96
	cy	Welsh	703,507	3,777,953	97		zu	Zulu	470,403	2,899,738	97
	ga	Irish	693,460	2,814,912	96		ig	Igbo	147,319	3,314,731	96
	is	Icelandic	704,159	4,088,886	97		kam	Kamba	0	8	1
	sv	Swedish	726,893	4,213,939	97		ff	Fulani	26	313,870	97
	da	Danish	721,543	4,194,587	97	Nilo-Saharan	luo	Dholuo	0	91	6
	no	Norwegian	721,715	4,045,571	97	Portuguese	kea	Kabuverdianu	0	0	0
	af	Afrikaans	703,546	4,143,358	98	Sino-Tibetan	zh	Chinese	726,112	14,215,583	96
	de	German	881,553	10,273,597	97		zhtrad	Chinese	0	3,747,297	96
	en	English	846,712	19,548,583	100		my	Burmese	579,160	3,887,841	97
	lb	Luxembourgish	574,166	1,035,619	94	Turkic	uz	Uzbek	723,096	2,344,375	95
	nl	Dutch	769,778	4,199,773	96		kk	Kazakh	701,849	3,836,259	97
	el	Greek	707,751	4,081,607	97		ky	Kyrgyz	704,438	3,725,583	97
	bn	Bengali	707,099	4,560,978	97		az	Azerbaijani	712,947	8,080,151	97
	as	Assamese	33,825	1,656,861	97		tr	Turkish	727,711	4,169,259	97
	gu	Gujarati	704,619	3,761,401	97	Uralic	et	Estonian	706,720	4,056,200	97
	hi	Hindi	715,691	4,186,127	97		fi	Finnish	719,416	40,76,885	97
	mr	Marathi	702,382	4,295,708	97		hu	Hungarian	731,479	4,154,132	97

Table 9: The detailed information of the collected monolingual and parallel datasets includes the translation directions for each supported language. Specifically, the “# Para.” represents the count of language-centric sentence pairs, while “# Mono” denotes the number of individual monolingual sentences.

In this section, we will introduce the sources of our training data (Section B.1), the evaluation benchmarks (Section B.2). For translation tasks, we apply beam search to each model with beam size=4.

B.1 Training Dataset

The dataset was compiled from three distinct open-source datasets, with details on data statistics and supported languages presented in the Table 9.

MC4 (Xue et al., 2021)

is a multilingual variant of the C4 dataset, comprising natural text in 101 languages sourced from the Common Crawl web scrape. It was introduced to support the training of massively multilingual pre-trained text-to-text transformers like mT5.

MADLAD-400 (Kudugunta et al., 2024)

is a manually audited, general domain monolingual dataset based on CommonCrawl, encompassing 419 languages and designed for document-level analysis. It is notable for its extensive language coverage and the rigorous auditing process involved in its creation.

Lego-MT Yuan et al. (2023a)

is a benchmark for massively multilingual machine translation, featuring a detachable model built upon an efficient training recipe. It includes a comprehensive translation benchmark with data from OPUS, covering 433 languages and 1.3 billion parallel data points.

B.2 Evaluation Benchmark

Flores-101 (Goyal et al., 2022)

is a benchmark for machine translation evaluation, comprising a multi-way dataset derived from English Wikipedia and produced by professional translators.

Flores-200 (Team et al., 2022)

is an extension of Flores-101 dataset and also serves as a benchmark for machine translation. This dataset contains parallel sentences for 200 languages, with each language identified by its ISO 639-3 code ( (e.g. eng)) and an additional code (e.g., "eng_Latn",) that describes the script.

WMT-23 (Kocmi and Federmann, 2023)

is also a comprehensive translation evaluation benchmark, proposed in 2023. We incorporate this dataset into our evaluation to mitigate the risk of data leakage in LLMs. Based on benchmark, we evaluate the English-centric translation task performance, including de $\rightarrow$ en, en $\rightarrow$ cs, en $\rightarrow$ de, en $\rightarrow$ he, en $\rightarrow$ ja, en $\rightarrow$ ru, en $\rightarrow$ uk, en $\rightarrow$ zh, he $\rightarrow$ en, ja $\rightarrow$ en, ru $\rightarrow$ en, uk $\rightarrow$ en, zh $\rightarrow$ en.

TICO Anastasopoulos et al. (2020)

dataset represents a joint translation effort targeting COVID-19 materials, developed in collaboration with academic, industry stakeholders, and Translators without Borders. It comprises translation memories, a glossary of translated COVID-19 terms, and functions as a benchmark for translation-related evaluations. The all evaluated translation is en $\rightarrow$ {am, bn, din, fa, fuv, hi, km, ku, ln, ms, ne, om, ps, ru, so, ta, ti_ER, tl, zh, ar, ckb, es_LA, fr, ha, id, kr, lg, mr, my, nus, prs, pt_BR, rw, sw, ti, ti_ET, ur, zu}.

TED Cettolo et al. (2012)

is a massively multilingual dataset derived from TED Talk transcripts, covering 60 languages with parallel arrays of language and text. It is designed for natural language processing tasks and filters out missing or incomplete translations. We also evaluate the English-centric translation performance. The translation direction covers all 60 languages, including en $\leftrightarrow$ {af, am, ar, arq, art-x-bork, as, ast, az, be, bg, bi, bn, bo, bs, ca, ceb, cnh, cs, da, de, el, eo, es, et, eu, fa, fi, fil, fr, fr-ca, ga, gl, gu, ha, he, hi, hr, ht, hu, hup, hy, id, ig, inh, is, it, ja, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, ltg, lv, mg, mk, ml, mn, mr, ms, mt, my, nb, ne, nl, nn, oc, pa, pl, ps, pt, pt-br, ro, ru, rup, sh, si, sk, sl, so, sq, sr, srp, sv, sw, szl, ta, te, tg, th, tl, tlh, tr, tt, ug, uk, ur, uz, vi, zh, zh-cn, zh-tw}

X-CSQA Lin et al. (2021a)

is a multilingual extension of the Commonsense Question Answering (CSQA) dataset, designed for commonsense reasoning research. It facilitates the evaluation and improvement of multilingual language models in commonsense reasoning tasks.

XStoryCloze (Lin et al., 2021b)

is a benchmark dataset that comprises the professionally translated English StoryCloze dataset (Spring 2016 version) into 10 non-English languages. It is designed to evaluate the zero- and few-shot learning capabilities of multilingual language models.

Setting	Dictionary	en-centric		ta-centric		th-centric		zh-centric
Setting	Dictionary	en $\rightarrow$ x	x $\rightarrow$ en	ta $\rightarrow$ x	x $\rightarrow$ ta	th $\rightarrow$ x	x $\rightarrow$ th	zh $\rightarrow$ x	x $\rightarrow$ zh
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}^{\prime}$ + $\mathcal{D}_{\mathrm{M}}^{\prime}$	MUSE: 1-hop	18.80	26.56	4.78	1.79	7.31	3.18	11.35	7.28
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}^{\prime}$ + $\mathcal{D}_{\mathrm{M}}^{\prime}$	MUSE: 2-hop	18.70	26.50	4.47	1.83	7.08	3.26	10.74	6.68
$\mathcal{D}_{\mathrm{P_{1}}}$ + $\mathcal{D}_{\mathrm{P_{2}}}^{\prime}$ + $\mathcal{D}_{\mathrm{M}}^{\prime}$	PanLex: 1-hop	19.33	26.54	4.40	1.83	7.57	3.31	10.86	8.08

Table 10: Select a specific data augmentation technique and evaluate various dictionary configurations, including 1-hop and 2-hop, as well as different dictionaries.

1-hop translation		2-hop translation
Direction	Example	Direction	Example
en $\rightarrow$ fr	dog $\rightarrow$ chien	en $\rightarrow$ fr $\rightarrow$ de	dog $\rightarrow$ chien $\rightarrow$ Hund
fr $\rightarrow$ de	chien $\rightarrow$ Hund

Table 11: Case of 1-hop and 2-hop translations.

XCOPA (Ponti et al., 2020)

is a benchmark dataset that assesses machine learning models’ ability to transfer commonsense reasoning across languages. It is an extension of the English COPA dataset and includes 11 languages from diverse language families and geographical regions.

XWinograd (Muennighoff et al., 2022; Tikhonov and Ryabinin, 2021)

s a benchmark dataset that consists of a multilingual collection of Winograd Schemas, designed for the evaluation of cross-lingual commonsense reasoning capabilities covering six languages.

XNLI (Conneau et al., 2018)

is a cross-lingual extension of the SNLI (Bowman et al., 2015)/MultiNLI (Williams et al., 2018), consisting of a subset of English examples translated into 14 different languages. It is used for evaluating textual entailment and classification tasks, where the goal is to determine if one sentence implies, contradicts, or is neutral to another sentence

MGSM (Shi et al., 2023)

a dataset of grade-school math problems, each translated into 10 languages by human annotators. It is derived from the GSM8K (Cobbe et al., 2021) dataset and is designed to support question answering on basic mathematical problems that require multi-step reasoning.

MMLU (Hendrycks et al., 2021a, b)

is a benchmark for evaluating language models’ capabilities in language comprehension and reasoning across diverse domains. It consists of about 16,000 multiple-choice questions spanning 57 academic subjects, designed to measure knowledge acquired during pretraining in zero-shot and few-shot settings.

BBH (Srivastava et al., 2022)

is a subset of the BIG-Bench, focusing on 23 challenging tasks that current language models struggle to perform, where they do not outperform the average human-rater. It serves as a rigorous evaluation suite to test the limits of language models’ capabilities.

HellaSwag (Zellers et al., 2019)

s a dataset designed to evaluate advanced natural language understanding and common sense reasoning, which introduces more complexity and diversity, challenging AI models to predict the ending of incomplete narratives.

WinoG (Sakaguchi et al., 2021)

is a large-scale dataset containing 44k problems inspired by the Winograd Schema Challenge, designed to improve the scale and hardness of coreference resolution tasks. It presents fill-in-the-blank questions with binary options, testing the model’s ability to understand nuanced human language.

NQ (Kwiatkowski et al., 2019)

is a dataset for question answering research, containing over 300,000 examples each consisting of a real user query and a corresponding Wikipedia page. It is designed to train and evaluate automatic question answering systems by emulating how people search for information.

HumanEval (Chen et al., 2021)

is designed to evaluate the code generation capabilities of large language models, featuring 164 hand-crafted programming challenges that include function signatures, docstrings, bodies, and unit tests. On average, each problem is accompanied by 7.7 tests to assess functional correctness.

MBPP (Austin et al., 2021)

comprises approximately 1,000 crowd-sourced Python programming problems, aimed at entry-level programmers and covering programming fundamentals and standard library functionality. Each problem includes a task description, code solution, and three automated test cases.

GSM8K (Cobbe et al., 2021)

consists of 8.5K high-quality, linguistically diverse grade school math word problems created by human problem writers. It is designed to support question answering on basic mathematical problems that require multi-step reasoning.

Math (Hendrycks et al., 2021c)

is a collection of 12,500 intricate problems derived from competition mathematics. Every problem within the Math dataset includes a comprehensive solution with step-by-step guidance, which serves as a resource for training models to produce detailed answer justifications and explanations.

Appendix C Model Information

Setting	Translation Tasks		General Tasks			Multilingual Tasks
Setting	ceb $\rightarrow$ x	x $\rightarrow$ ceb	QNLI	QQP	MRPC	XStoryCloze	XCOPA	XWinograd
splited-parallel + mono	3.36	2.74	49.46	36.82	68.38	59.20	56.82	73.72
connected-parallel + mono	4.45	3.68	49.46	36.82	68.38	59.10	56.80	74.07
Setting	ceb $\rightarrow$ ca	ceb $\rightarrow$ de	ceb $\rightarrow$ en	ceb $\rightarrow$ es	ceb $\rightarrow$ fr	ceb $\rightarrow$ it	ceb $\rightarrow$ pt	ceb $\rightarrow$ ru
splited-parallel + mono	10.32	8.94	23.19	13.30	15.96	10.01	12.66	8.05
connected-parallel + mono	10.97	11.37	27.06	14.91	18.04	12.03	15.55	10.26
Setting	ca $\rightarrow$ ceb	de $\rightarrow$ ceb	en $\rightarrow$ ceb	es $\rightarrow$ ceb	fr $\rightarrow$ ceb	it $\rightarrow$ ceb	pt $\rightarrow$ ceb	ru $\rightarrow$ ceb
splited-parallel + mono	5.90	4.91	7.44	5.14	6.02	5.54	6.12	4.24
connected-parallel + mono	7.62	6.92	9.88	6.41	7.39	6.91	7.62	6.54

Table 12: Design for the utilization of parallel data, we take ceb-centric data as an example, apply two distict approaches, and discover that treating parallel data as two independent monolingual datasets harms to translation performance.

Model details about the baseline models for comparison, including decode-only large language models (LLMs) in Section C.1 as well as translation models in Section C.2 with an encoder-decoder structure.

C.1 Large Language Models

LLaMA2 (Touvron et al., 2023b)

is a decoder-only language model that predicts the next token based on the input sequence of ordered tokens, with a collection of pre-trained and fine-tuned models ranging from 7 billion to 70 billion parameters. The LLaMA2 7B model serves as our foundational model. Unless otherwise specified, any reference to LLaMA or LLaMA2 is the LLaMA2 7B model. The model leverages a Byte-level Byte Pair Encoding (BBPE; Wang et al., 2019) tokenizer, an efficient subword tokenizer that tokenizes at the byte level, allowing it to handle any language and be robust to noise in the data. The BBPE tokenizer is particularly useful for languages with large vocabularies and many rare words.

LLaMAX2

follows the model architecture of LLaMA2 without vocabulary extension. We utilize 24 A100 80GB GPUs and extended the pre-training on the amassed data for over 60 days. We set per device training batch size to 32, the learning rate to 2e-5, and the epoch number to 1.0.

PolyLM (Wei et al., 2023)

is an open-source multilingual Large Language Model (LLM) trained on 640 billion tokens, available in two model sizes: 1.7B and 13B. It boasts proficiency in 15 major non-English languages, employing advanced training techniques to enhance its language processing capabilities.

Yayi2 (Luo et al., 2023)

is a multilingual open-source Large Language Model pre-trained from scratch on a corpus containing 2.65 trillion tokens. It is aligned with human values through supervised fine-tuning and reinforcement learning from human feedback.

TowerInstruct (Alves et al., 2024)

is a 7B parameter language model fine-tuned on translation-related tasks, supporting multiple languages including English, Portuguese, Spanish, French, and others. It is designed for tasks such as machine translation, automatic post-editing, and paraphrase generation. In our paper, we evaluate the instruction-tuned model TowerInstruct-7B-v0.2.

Aya-23 (Aryabumi et al., 2024)

is an open weights research release of an instruction fine-tuned decoder-only model with advanced multilingual capabilities, serving 23 languages. It pairs a performant pre-trained Command family of models with the Aya Collection for robust language processing tasks.

ChineseLLaMA2-Alpaca Cui et al. (2024)

is founded on LLaMA2 and enhanced with an extensive Chinese vocabulary that concentrates on Chinese languages. This is a fine-tuned version of ChineseLLaMA2 using Alpaca (Taori et al., 2023) data.

LLaMA2-SFT (Taori et al., 2023)

is a fine-tuned version of LLaMA2 model, leveraging a set of 52,000 diverse English instructions in Alpaca (Taori et al., 2023) to enhance the instruction-following capabilities of the model.

Qwen2-7B-Instruct (Bai et al., 2023)

is part of the Qwen2 series, which is a instruction-tuned language models. It demonstrates competitiveness against proprietary models across multilingual benchmarks.

Swallow (Fujii et al., 2024)

is a large language model which enhances Japanese capability based on LLaMA2. It achieves this by extending the vocabulary with Japanese characters and conducting continued pre-training on a Japanese corpus, resulting in superior performance compared to other LLMs in both English and Japanese tasks. In our paper, we evaluate the instruction-tuned model Swallow-7B-Instruct-v0.1.

C.2 Translation Models

X	LLaMA2-Alpaca		Swallow		LLaMAX2-Alpaca		X	LLaMA2-Alpaca		Swallow		LLaMAX2-Alpaca
X	$R_{\mathrm{ja}}$	$R_{\mathrm{X}}$	$R_{\mathrm{ja}}$	$R_{\mathrm{X}}$	$R_{\mathrm{ja}}$	$R_{\mathrm{X}}$	X	$R_{\mathrm{ja}}$	$R_{\mathrm{X}}$	$R_{\mathrm{ja}}$	$R_{\mathrm{X}}$	$R_{\mathrm{ja}}$	$R_{\mathrm{X}}$
af	0.20	35.28	72.23	0.00	0.59	75.69	lo	0.30	37.85	75.89	0.10	0.00	54.55
am	0.20	61.96	77.67	0.10	0.69	90.91	lt	4.74	32.41	70.85	4.55	3.66	94.76
ar	0.69	93.97	64.72	13.93	0.00	99.90	luo	0.49	0.00	71.25	0.00	0.89	0.00
as	3.66	1.38	74.01	0.00	0.10	73.22	lv	1.09	39.92	66.80	5.53	1.68	95.36
ast	0.20	1.48	71.44	0.00	0.20	34.19	mi	0.20	0.00	61.46	0.00	0.20	0.00
az	0.20	26.58	69.57	5.53	0.30	97.43	mk	0.30	17.98	78.46	0.00	0.49	98.81
be	0.40	60.18	72.92	0.00	0.20	99.11	ml	1.28	36.17	74.41	1.68	0.49	70.75
bg	1.09	60.28	77.67	0.30	0.89	98.02	mn	0.59	35.18	75.59	1.48	0.00	99.31
bn	1.78	64.62	75.69	1.78	0.00	99.90	mr	0.59	35.87	76.88	0.00	0.10	99.01
bs	0.69	1.38	73.52	0.00	1.98	3.16	ms	0.10	5.53	61.86	0.20	0.00	39.92
ca	0.40	89.92	65.02	11.07	0.49	98.12	mt	0.40	60.08	68.38	3.16	0.69	94.07
ceb	0.10	33.30	44.57	3.56	0.00	95.06	my	1.68	56.03	78.85	1.48	0.10	99.90
cs	1.19	61.46	72.13	5.24	1.68	93.38	ne	0.20	50.00	70.45	0.00	0.00	99.01
cy	0.20	30.83	66.90	2.47	0.20	98.52	nl	0.40	76.78	61.36	22.33	0.20	92.09
da	0.79	57.51	70.06	4.64	0.59	91.80	no	1.38	44.47	69.57	3.16	0.69	86.66
de	1.28	83.40	57.41	29.25	1.28	94.17	ns	1.58	0.00	62.55	0.00	1.38	0.00
el	1.09	42.00	75.20	7.41	0.00	100.00	ny	0.49	0.00	72.53	0.00	0.79	0.00
en	0.00	100.00	67.29	32.41	0.00	100.00	oc	0.20	1.09	68.97	0.00	0.59	58.10
es	0.40	97.04	57.81	20.26	0.10	99.21	om	0.30	0.00	72.53	0.00	2.57	0.00
et	0.69	14.03	68.48	8.70	4.35	89.13	or	0.69	61.86	79.45	0.00	1.09	98.52
fa	0.30	83.89	75.79	4.35	0.00	98.42	pa	0.40	77.67	72.04	1.78	0.79	98.91
ff	0.69	0.00	73.12	0.00	11.96	0.00	pl	0.79	73.32	71.54	8.40	0.49	98.02
fi	3.36	74.11	66.01	17.39	2.37	96.25	ps	0.20	43.28	75.40	0.00	0.00	98.22
fr	0.49	97.04	52.47	34.29	0.00	99.70	pt	1.09	90.71	63.14	8.20	0.20	98.22
ga	0.20	26.98	64.23	2.96	0.00	94.07	ro	0.30	45.95	68.97	4.25	0.30	89.53
gl	0.10	1.58	63.34	3.56	0.20	83.30	ru	0.30	83.10	71.44	12.45	0.20	99.41
gu	0.30	67.59	77.47	0.99	1.48	96.64	sd	0.89	2.47	74.31	0.00	0.00	92.59
ha	0.59	0.00	70.06	0.00	0.99	0.00	sk	0.49	27.27	65.42	7.81	0.59	94.57
he	1.78	76.19	63.34	16.60	0.00	100.00	sl	0.79	58.79	61.66	3.56	1.38	91.11
hi	0.69	70.75	67.98	7.91	0.00	99.90	sn	0.40	0.00	68.18	0.00	1.58	0.00
hr	0.89	54.55	69.37	1.28	1.19	66.60	so	0.10	7.71	74.31	0.20	0.99	59.19
hu	0.40	69.96	71.44	10.67	0.30	93.87	sr	1.48	15.22	75.49	1.48	1.98	44.07
hy	0.69	77.08	79.55	1.09	0.00	99.90	sv	2.57	49.90	66.01	13.34	1.68	95.16
id	0.20	84.98	70.65	7.61	0.00	97.04	sw	0.20	48.32	67.49	0.99	0.59	94.76
ig	0.10	0.00	74.80	0.00	0.20	0.00	ta	0.30	53.46	74.31	1.98	0.00	99.80
is	0.30	55.34	58.20	19.76	0.20	95.06	te	0.20	73.12	75.79	2.47	0.00	99.80
it	0.59	85.47	55.24	24.11	0.00	97.63	tg	0.69	6.23	74.01	0.00	0.40	97.33
jv	1.38	0.10	66.90	0.00	0.89	67.79	th	0.00	84.39	70.75	12.15	0.00	100.00
ka	1.28	63.14	65.91	16.01	0.00	100.00	tl	0.20	73.62	62.94	6.72	0.10	99.31
kam	0.30	0.00	73.22	0.00	3.56	0.00	tr	0.79	42.39	67.69	11.86	0.40	95.26
kea	0.20	0.00	71.25	0.00	0.99	0.00	uk	0.59	89.53	74.31	3.36	0.49	98.12
kk	0.10	55.93	76.48	0.49	0.10	99.21	umb	0.69	0.00	68.68	0.00	1.38	0.00
km	0.40	53.66	80.34	0.69	0.00	99.90	ur	1.19	25.49	76.19	2.77	0.30	97.92
kn	3.06	49.60	78.56	1.09	0.10	99.90	uz	0.40	32.71	74.51	0.20	1.78	86.36
ko	1.58	94.17	60.57	21.84	0.10	99.51	vi	0.00	95.85	56.42	13.24	0.10	99.70
ku	0.20	28.06	60.28	0.49	2.77	72.73	wo	1.09	0.00	73.32	0.00	2.96	0.00
ky	0.40	40.71	75.79	0.00	0.10	99.41	xh	0.20	0.00	70.55	0.00	0.59	0.00
lb	0.69	31.23	66.11	0.00	2.27	87.75	yo	0.10	3.95	67.00	0.00	0.10	13.93
lg	1.38	0.00	74.11	0.00	12.65	0.00	zh	23.22	70.16	37.15	35.67	5.93	93.08
ln	0.30	0.00	71.84	0.00	0.79	0.00	zhtrad	32.41	0.00	43.87	0.00	7.31	0.00
							zu	0.10	0.00	67.39	0.00	1.38	0.00

Table 13: We utilize langdetect to identify the translation outputs from ja

\rightarrow

X of LLaMA2-Alpaca, Swallow and LLaMAX2-Alpaca models on Flores-101 benchmark.

R_{\mathrm{ja}}

represents the ratio of sentence in the translation predicted result where the top predicted language is Japanese. Conversely,

R_{\mathrm{X}}

refers to the proportion where the top predicted language aligns with the target translated language.

M2M-100 (Fan et al., 2021)

encompasses multilingual machine translation models designed to translate between any pair of 100 languages directly, without the need for English as an intermediary. The M2M-100 series includes models of varying sizes, specifically 418M, 1.2B, and 12B parameters. These models are part of a groundbreaking approach in the field of machine translation, aiming to enhance direct translation efficiency across a wide array of languages.

Lego-MT (Yuan et al., 2023a)

is a novel approach to massively multilingual machine translation, featuring detachable models with individual branches for each language or group of languages. This design supports plug-and-play training and inference, enhancing flexibility and efficiency in language processing tasks.

MADLAD-400 (Kudugunta et al., 2024)

is a multilingual machine translation model that leverages the T5 architecture and has been trained on a vast corpus of 250 billion tokens, covering over 450 languages.

Aya-101 (Aryabumi et al., 2024)

is an open-source, massively multilingual generative language model that operates on the mT5 (Xue et al., 2021) architecture, covering 101 languages and designed to bridge the performance gap in non-dominant languages. It incorporates a 13B parameter base and has undergone instruction-finetuning to achieve high performance across its extensive language range.

Appendix D The correlation between fertility and representation quality.

We conduct experiments on Flores-101. Fertility is defined as the ratio of the $L_{s}$ to the $L_{T}$ , where $L_{s}$ is the number of words for space-separated languages and characters for others and $L_{T}$ is the number of tokens after applying LLaMA2 tokenizer. The quality estimation of LLaMA on Flores-101 test. Cosine similarity focuses on the similarity in the expressions of LLaMA across sentence representation of the same sentence in English and other languages. Recall@1 is often used in the context of information retrieval, which measures the quality of representation. The experimental results, as shown in Figure 6, indicate fertility has a high correlation with the representation quality.

Model	Templates
LLaMAX- Alpaca	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Translate the following sentences from English to Chinese Simpl ### Input: "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added. ### Response:UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”
LLaMA Series Models	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Translate the following sentences from English to Chinese Simpl ### Input: "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added. ### Response:UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”
yayi2	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Translate the following sentences from English to Chinese Simpl ### Input: "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added. ### Response:UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”
polylm	"We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added. Translate this sentence English to Chinese Simpl. UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”
TowerInstruct	<\|im_start\|>user Translate the following text from English into Chinese. English: "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added. Chinese:<\|im_end\|> <\|im_start\|>assistant UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”
aya23	<BOS_TOKEN><\|START_OF_TURN_TOKEN\|><\|USER_TOKEN\|>Translate the following sentences from English to Chinese: "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.<\|END_OF_TURN_TOKEN\|><\|START_OF_TURN_TOKEN\|><\|CHATBOT_TOKEN\|>UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”<\|END_OF_TURN_TOKEN\|>
Qwen2 instruct	system You are a helpful assistant. user Translate the following sentences from English to Chinese Simpl: "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added. assistant UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”
ChineseAlpaca-2	[INST] <<SYS>> You are a helpful assistant. UTF8gbsn你是一个乐于助人的助手。 <</SYS>> Translate the following sentences from English to Chinese Simpl: "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added. [/INS T] UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”
Swallow	[INST] <<SYS>> UTF8gbsnあなたは誠実で優秀な日本人のアシスタントです。 <</SYS>> Translate the following sentences from Japanese to Chinese Simpl: UTF8gbsn「我々が飼っている生後4か月のマウスはかつて糖尿病でしたが現在は糖尿病ではない、」 UTF8gbsnと彼は付け加えました。 [/INST] 「他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”」
Madlad	’<2zh> "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.’ UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”

Table 14: Examples of instruction templates utilized for all evaluated LLMs, with the translation result, UTF8gbsn他补充道：“我们现在有 4 个月大没有糖尿病的老鼠，但它们曾经得过该病。”, using the reference instead of the model’s output.

Appendix E Introduction to KS-Lottery.

KS-Lottery is a technique designed to identify a small, highly effective subset of parameters within LLMs for multilingual capability transfer. The core concept of this method involves utilizing the Kolmogorov-Smirnov Test to examine the distribution shift of parameters before and after fine-tuning. This approach helps in pinpointing the “winning tickets” or the most impactful parameters that contribute significantly to the model’s performance in multilingual tasks.

Appendix F 1-hop translation in data augmentation is enough.

We use two different multilingual dictionaries MUSE provided by Lample et al. (2018) ³³3https://github.com/facebookresearch/MUSE., and PanLex Wang et al. (2022). In the context of a multilingual dictionary, we can use “1-hop” and “2-hop” to characterize the translation relationship among different languages, an example shown in Table 10.

We use the MUSE dictionary to perform data augmentation on both parallel $\mathcal{D}_{\mathrm{P_{2}}}$ and monolingual $\mathcal{D}_{\mathrm{M}}$ data, utilizing 1-hop and 2-hop translations. As shown in Table 10, using different hop translation for augmentation does not significantly impact the final translation performance. Multi-hop translation sometimes can even result in poorer performance.

Appendix G Design of parallel format

The Usage of Parallel Data.

Parallel data can be utilized in two distinct ways: split-parallel or connected-parallel. Split-Parallel: Consider the source language data and target language data involved in parallel data as two distinct monolingual datasets, which are randomly shuffled throughout the entire training set. Connected-Parallel: In the training process, we treat each pair of source and target language sentences from the parallel dataset as a single data point by concatenating them.

Based on different forms of parallel data, supervised fine-tuning (SFT) is conducted separately on ceb-centric using both parallel and monolingual datasets. As indicated in Table 12, we observed that the form of parallel data primarily impacts translation performance, with no significant difference in general tasks and cross-lingual general tasks; however, the disparity in translation is pronounced. We specifically highlighted some high-resource translation directions and found that such gaps are quite significant.

Appendix H Comparison Results Between Our Model and GPT-4

In Figure 7, we compare the performance gap between our model and GPT-4. Considering the API cost of evaluating GPT-4, we only evaluate the mutual translation performance among seven languages (en, zh, de, ne, ar, az, ceb). Experiment results show that while our model lags behind in high-resource translation directions, it achieves on-par or even superior performance in low-resource translation.

Appendix I Comparison LLaMAX2-Alpaca with language-specific LLMs.

We perform further comparisons between LLaMAX2-Alpaca and Japanese-specific LLMs-Swallow. After using LLaMAX2-Alpaca and Swallow to generate translations from Japanese (ja) to any language in Flores-101, we apply langdetect to determine the language of each translation result and calculate the proportion of Japanese and target translated language respectively. The experimental result, as shown in Table 13, indicates that the Japanese-specific LLM tends to output Japanese, whereas LLaMAX2-Alpaca performs more accurately in producing the target language.

Appendix J Prompt Templates

We offer a comprehensive collection of prompt instruction templates, as illustrated in Table D, which are utilized for all evaluated LLMs. These templates are meticulously designed based on existing LLMs, playing a crucial role in obtaining accurate model results and ensuring fairness in comparisons. Our goal in providing these templates is to promote transparency and make it easier to reproduce our findings.

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Abstract

1 Introduction

2 Training Data Construction

2.1 Components of Training Data

Monolingual Data (𝒟monoA).

Parallel Data (𝒟paraA).

Data Generated Through Augmentation (𝒟aug).

Step 1: Building multilingual lexicons.

Step 2: Constructing pseudo-parallel data.

2.2 Training Algorithm.

3 Key Technique Design

3.1 An Appropriate Vocabulary: The Original One.

Existing Pipeline.

Setting.

Research Question 1: Why is adding new tokens considered a straightforward method for extending language support?

Research Question 2: Does adding new tokens to reduce fertility yield prompt performance improvements?

Research Question 3: What is the impact of adding new tokens on model performance?

Finding: The original vocabulary suffices to present the multilingualism of LLMs.

3.2 Data Augmentation

Setting.

Finding: The choice of dictionary is related to the number of entities for the language in the dictionary.

4 Benchmarking Results

We significantly enhances the multilingual translation capabilities of the base LLaMA2 model through massive multilingual continual pre-training.

LLaMAX outperforms other open-source decoder-only LLMs on multilingual translation by a large margin.

LLaMAX benefits unseen long-tail low-resource languages as well.

LLaMAX is closing the performance gap between open-source LLM translator and specialized encoder-decoder translation systems.

LLaMAX provides a better starting point for specialized instruction-tuning on English task data.

LLaMAX circumvents catastrophic forgetting issue.

Comparsion between LLaMAX2-Alpaca and language-specific LLMs.

Beyond the English-centric translation is more efficient and effective.

5 Related Work

Multilingual Large Language Models.

Multilinguality in LLMs.

6 Conclusion

Acknowledgments

References

Outline

Appendix A Limitations

Appendix B Data Information

B.1 Training Dataset

MC4 (Xue et al., 2021)

MADLAD-400 (Kudugunta et al., 2024)

Lego-MT Yuan et al. (2023a)

B.2 Evaluation Benchmark

Flores-101 (Goyal et al., 2022)

Flores-200 (Team et al., 2022)

WMT-23 (Kocmi and Federmann, 2023)

TICO Anastasopoulos et al. (2020)

TED Cettolo et al. (2012)

X-CSQA Lin et al. (2021a)

XStoryCloze (Lin et al., 2021b)

XCOPA (Ponti et al., 2020)

XWinograd (Muennighoff et al., 2022; Tikhonov and Ryabinin, 2021)

XNLI (Conneau et al., 2018)

MGSM (Shi et al., 2023)

MMLU (Hendrycks et al., 2021a, b)

BBH (Srivastava et al., 2022)

HellaSwag (Zellers et al., 2019)

WinoG (Sakaguchi et al., 2021)

NQ (Kwiatkowski et al., 2019)

HumanEval (Chen et al., 2021)

MBPP (Austin et al., 2021)

GSM8K (Cobbe et al., 2021)

Math (Hendrycks et al., 2021c)

Appendix C Model Information

C.1 Large Language Models

LLaMA2 (Touvron et al., 2023b)

LLaMAX2

PolyLM (Wei et al., 2023)

Yayi2 (Luo et al., 2023)

TowerInstruct (Alves et al., 2024)

Aya-23 (Aryabumi et al., 2024)

ChineseLLaMA2-Alpaca Cui et al. (2024)

LLaMA2-SFT (Taori et al., 2023)

Qwen2-7B-Instruct (Bai et al., 2023)

Swallow (Fujii et al., 2024)

C.2 Translation Models

M2M-100 (Fan et al., 2021)

Lego-MT (Yuan et al., 2023a)

Monolingual Data ( $\mathcal{D}_{\mathrm{mono}}^{A}$ ).

Parallel Data ( $\mathcal{D}_{\mathrm{para}}^{A}$ ).

Data Generated Through Augmentation ( $\mathcal{D}_{\mathrm{aug}}$ ).