Authors:
Bo-Wen Zhang、Liangdong Wang、Ye Yuan、Jijie Li、Shuhao Gu、Mengdi Zhao、Xinya Wu、Guang Liu、Chengwei Wu、Hanyu Zhao、Li Du、Yiming Ju、Quanyue Ma、Yulong Ao、Yingli Zhao、Songhe Zhu、Zhou Cao、Dong Liang、Yonghua Lin、Ming Zhang、Shunfei Wang、Yanxin Zhou、Min Ye、Xuekai Chen、Xinyang Yu、Xiangjun Huang、Jian Yang
Paper:
https://arxiv.org/abs/2408.06567
Introduction
Language models have become integral to modern natural language processing (NLP) systems, powering applications such as machine translation, conversational agents, text summarization, and question answering. Recent advancements in large language models (LLMs) like GPT-3, BERT, and T5 have demonstrated remarkable proficiency across numerous tasks, emphasizing the importance of pretraining on large-scale datasets to achieve state-of-the-art results. However, traditional dense models face significant challenges in scalability and efficiency as parameter sizes increase.
Mixture of Experts (MoE) models have emerged as a promising solution to these challenges. By dynamically selecting different subsets of model parameters (experts) for various inputs, MoE architectures can scale to a much larger number of parameters without a corresponding increase in computational cost. This selective activation mechanism allows MoE models to achieve higher performance while maintaining computational efficiency. However, training such large-scale MoE models presents significant challenges, including the vast amounts of data and computational power required.
Methodology
The EfficientScale pipeline is designed to efficiently train a large-scale Mixture of Experts (MoE) model by leveraging knowledge transfer from smaller models. The process involves three main phases: Preparation, Scale-Up, and Scale-Out. Each phase plays a crucial role in ensuring effective knowledge transfer and continuous learning, resulting in a highly optimized MoE model.
Preparation Phase
The preparation phase involves training a small dense model and preparing the datasets required for subsequent phases. This phase ensures that the initial model has sufficient transferable knowledge and that the data is ready for effective training and validation.
- Model Training: Train a small dense model from scratch on a substantial amount of tokens or use an already pre-trained small model. This step ensures the model has accumulated sufficient transferable knowledge to serve as a robust starting point.
- Data Preparation: Collect, clean, and preprocess the training and validation datasets. This step involves managing large datasets to ensure they are suitable for training and validation purposes.
- Validation Setup: Develop both training and validation datasets to monitor the model’s performance during subsequent phases. Continuous tracking of the language model’s loss on the validation dataset is essential to ensure the initialized models retain transferred knowledge and can learn new information effectively.
Scale-Up Phase
The Scale-Up phase involves two critical steps: initializing the weights of a larger dense model using the smaller model and performing continuous pretraining to ensure effective knowledge transfer and model enhancement.
Weight Initialization Strategies
The weights of the small dense model are used to initialize a larger dense model. There are two strategies proposed in bert2BERT: Function Preserving Initialization (FPI) and Advanced Knowledge Initialization (AKI). Both the original and our experiments show that AKI performs better. Recent research shows that it is better to use interpolation instead of stacking when expanding the depth, which is more stable for continuous training. Moreover, the original AKI method is not suitable for Group Query Attention (GQA), so we modify the transformation of the weights in attention blocks to fit GQA. Finally, we have AKI-Pro as our initialization method.
- Function Preserving Initialization (FPI): This strategy expands the intermediate dimensions of an MLP layer by copying the input and output tensors, as illustrated in Figure 1. This method ensures that the larger model can get the transferred knowledge from the smaller model.
- Advanced Knowledge Initialization (AKI): AKI breaks the symmetry by expanding the width based on not only the weights of the same layer but also the upper layer in the smaller model. This method keeps the knowledge from the smaller models and ensures better convergence.
- AKI-Pro: Our proposed improvement on AKI further refines weight initialization from two aspects: depth growing method and GQA compatibility.
Continuous Pretraining Process
The scaled-up dense model undergoes continuous pretraining on a substantial amount of tokens. This phase ensures the successful transfer of knowledge and allows the model to acquire additional information from the data, enhancing its overall performance and capability.
Scale-Out Phase
The scale-out phase involves transforming the large dense model into a Mixture of Experts (MoE) model. This phase includes initializing the MoE model’s weights and performing continuous pretraining to refine the model’s knowledge and performance.
- MoE Weight Initialization: Aquila-MoE is initialized using Sparse Upcycling. The dense model checkpoint obtained from the Aquila dense model undergoes a transformation where each MLP layer is replaced by an MoE layer. The router parameters are randomly initialized following a normal distribution with a mean of 0 and a variance of 0.02.
- Continuous Pretraining of MoE: During both training and inference, two out of eight experts are activated for each token, resulting in approximately 30B activated parameters. To prevent training collapse, additional load balancing loss and max z-loss are applied to the final training objective.
By following this structured approach, EfficientScale enables efficient training of large-scale models through systematic preparation, scaling up, and scaling out. This methodology leverages pre-trained smaller models to reduce data and computational requirements while ensuring efficient knowledge transfer and continuous learning. The result is a highly optimized MoE model capable of performing complex tasks with enhanced efficiency and performance.
Experiments
Datasets Description
A bilingual pretraining dataset of 4TB tokens in both Chinese and English was constructed. This dataset includes webpages, arXiv papers, encyclopedic data, books, codes, and QA pairs. It covers a wide range of high-quality open-source pretraining data such as RedPajama-Data-V2, falcon-refinedweb, C4, Pile, WuDaoCorporaText, ChineseWebText, etc. The data underwent language filtering, heuristic refinement, deduplication, domain-specific filtering, data quality checks, removal of toxic and explicit content, and data mixing in specified proportions.
Experimental Setups and Results
Scale-up Validation
For the scale-up experiment, a 1.3B Aquila2 architecture model was used as the baseline. This model was scaled up to a 7B model using two different methods: FPI and AKI. Additionally, a 7B model was trained from scratch to serve as a control. All three 7B models were trained using the same hyperparameters and on the same dataset for a specified number of steps. The validation loss of models with different initializations is shown in Table 1.
The loss convergence for the training process is shown in Figure 3. The experimental results indicate that the 7B models initialized using the FPI and AKI methods exhibited significantly lower loss values compared to the 7B model trained from scratch. Furthermore, these models converged at a notably faster rate. Consistent with findings in the paper, our results also demonstrate that the AKI method surpasses FPI in performance after a certain number of steps.
Scale-out Validation
For the scale-out validation experiment, a 1.8B model was trained from scratch with a training data volume of 3.6T tokens. These models were then scaled out to 8*1.8B configurations, followed by continuous pretraining with an additional 400B tokens. The respective model configurations and training hyperparameters are detailed in Table 3. The loss convergence on the training set is depicted in Figure 4.
Based on the results of the aforementioned validation experiments, the effectiveness of both scale-up and scale-out approaches on smaller-sized models was verified. Specifically, a model was trained from scratch with a size of 7B, pre-trained on 3.6T tokens, resulting in AquilaDense-7B. Subsequently, it was scaled up to a model with a size of 16B and further trained on 1.2T tokens, yielding AquilaDense-16B. Finally, it was scaled out to 8*16B and trained on 545B tokens, ultimately obtaining AquilaMoE. The configurations and training parameters of the models are presented in Table 3.
Model Evaluation
Evaluation of Foundation Models
Following OpenCompass, two types of evaluation methods were used: discriminant analysis evaluation and generative evaluation. Discriminant analysis evaluation means combining the question with candidate answers, calculating the perplexity of all combinations, and selecting the answer with the lowest perplexity as the model’s final output. Generative evaluation uses the question as the model’s original input and leaves the answer area blank for the model to complete subsequently.
The performance of AquilaDense-7B, AquilaDense-16B, and AquilaMoE(8*16B) models are presented in Table 4. Generally, as the model size increases, the scores tend to improve. The AquilaMoE models show improved performance in most tasks over AquilaDense-16B.
Evaluation of Fine-tuned Models
Table 5 presents the overall results of AquilaMoE-8*16B after fine-tuning across various benchmark datasets. The performance is measured using generative evaluation, and the results are expressed as percentages.
Comparison of Computational Efficiency
The details of the training process for both scale-up + scale-out and from-scratch approaches are presented in Table 6. The time savings factor is calculated by comparing the total training time of the from-scratch approach to the total training time of the scale-up and scale-out approach. The computational power savings factor is calculated by comparing the total GFLOPS-days of the from-scratch approach to the total GFLOPS-days of the scale-up and scale-out approach.
The method proposed in this paper significantly reduces both the computational power and the time required for training. By employing a scale-up and scale-out approach, a computational power savings factor of approximately 3.35 and a time savings factor of approximately 4.12 were achieved. Additionally, if starting with a pre-trained smaller model, the computational power and time required for the preparation phase can be further reduced. This approach not only accelerates the training process but also lowers the overall computational costs.
Conclusion and Future Work
AquilaMoE, a bilingual 8*16B mixture of experts (MoE) language model, was developed using the EfficientScale training method. EfficientScale optimizes performance while significantly reducing data requirements through a two-stage approach: Scale-Up and Scale-Out. The contributions are as follows:
- An effective training methodology that achieves knowledge transfer and continuous pretraining with significantly reduced data and computational needs.
- Innovative initialization strategies, such as Functional Progressive Initialization (FPI) and Approximate Knowledge Integration (AKI), which demonstrate substantial loss retention and reduction during continual pre-training.
- Successful training of 16B and 8*16B AquilaMoE models using these initialization strategies, enhancing performance and training efficiency.
Future work involves exploring the scalability of larger MoE models, investigating cross-linguistic knowledge transfer, developing new optimization techniques to further reduce training time and costs, fine-tuning for specific application domains, and ensuring the robustness and generalization of MoE models across diverse datasets and real-world applications.