基本信息

文件基本信息

名称

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

首页

https://yiyibooks.cn/arxiv/2602.00747v1/index.html

原始地址

https://arxiv.org/pdf/2602.00747

描述

确定有效的数据混合是大型语言模型 (LLM) 预训练的关键因素，其中模型必须平衡一般能力与数学和代码等困难任务的熟练程度。然而，确定最佳混合物仍然是一个开放的挑战，因为现有方法要么依赖于不可靠的小规模代理实验，要么需要极其昂贵的大规模探索。为了解决这个问题，我们提出了从训练混合中解耦搜索（DeMix），这是一种利用模型合并来预测最佳数据比率的新颖框架。 Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging.这种范例将搜索与训练成本解耦，能够在没有额外训练负担的情况下评估无限采样的混合物，从而通过更多的搜索试验促进更好的混合物发现。大量实验表明，DeMix 打破了充分性、准确性和效率之间的权衡，以更低的搜索成本获得了具有更高基准性能的最佳混合。 Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at this https URL. ...