The field of Natural Language Processing (NLP) has made significant progress in recent years due to the development of large and sophisticated deep learning models, with transformer-based language models being the most popular. These models have the ability to capture complex linguistic patterns and generalize across diverse contexts, making them suitable for a wide range of NLP tasks. However, the growing size and computational requirements of these models present significant challenges in terms of training efficiency, memory footprint, and deployment costs.
To address these challenges, models with sparsely activated Mixture of Experts (MoEs) have been proposed, which can significantly reduce the computational cost of Large Language Models (LLMs). MoE models decompose language models into smaller, specialized sub-models, or “experts”, that focus on distinct aspects of the input data, thereby enabling more efficient computation and resource allocation.
Mixture of Experts Routing
Mixture of Experts (MoE) is a technique used in natural language processing that involves dividing a model into specialized sub-models called experts, and activating only one or a few experts for each input token. Depending on how tokens are mapped to experts, MoE can be sparse or dense, with sparse MoE only selecting a subset of experts when routing each token, reducing computational cost as compared to dense MoE. Recent works have implemented sparse routing via k-means clustering, linear assignment, or hashing, and Google has announced GLaM and V-MoE, both of which advance the state of the art in natural language processing and computer vision via sparsely gated MoE with top-k token routing, demonstrating better performance scaling with sparsely activated MoE layers.
However, previous sparsely gated networks have introduced additional auxiliary losses to prevent too many tokens being routed to a single expert, but the effectiveness was limited. As a result, token choice routings need to overprovision expert capacity by a significant margin to avoid dropping tokens when there is a buffer overflow. Additionally, most prior works allocate a fixed number of experts to each token using a top-k function, regardless of the relative importance of different tokens, which can lead to load imbalance. The proposed approach suggests that different tokens should be received by a variable number of experts, conditioned on token importance or difficulty.
Expert Choice Routing
Researchers propose a new approach to Mixture of Experts (MoE) routing called Expert Choice (EC) routing, which addresses issues of load imbalance and overprovisioning of expert capacity in previous sparsely gated networks. In EC routing, instead of having tokens select the top-k experts, the experts with predetermined buffer capacity are assigned to the top-k tokens, ensuring even load balancing and allowing for a variable number of experts for each token. EC routing achieves substantial gains in training efficiency and downstream performance, speeding up training convergence by over 2x in an 8B/64E model compared to top-1 and top-2 gating counterparts in Switch Transformer, GShard, and GLaM.
To learn the token-to-expert affinity, the method produces a token-to-expert score matrix that indicates the likelihood of a token being routed to a given expert. A top-k function is applied along the token dimension for each expert to pick the most relevant tokens, and a permutation function is applied based on the generated indexes of the token, creating a hidden value with an additional expert dimension. The data is split across multiple experts such that all experts can execute the same computational kernel concurrently on a subset of tokens. By eliminating the need for overprovisioning expert capacity due to load imbalancing, EC routing significantly reduces training and inference step time by around 20% compared to GLaM. Overall, EC routing achieves better performance scaling and training convergence than previous sparsely gated models.
Model Architecture and Evaluation
Expert Choice Routing’s model design is based on a sparsely activated Mixture-of-Experts (MoE) technique. The approach utilizes a Transformer architecture and replaces the feed-forward component of every other Transformer layer with a MoE layer, which consists of a group of independent feed-forward networks called “experts”. Each MoE layer uses a gating function with a softmax activation function to model a probability distribution over experts and activate the best subset of experts using a top-k function along the token dimension.
To improve model performance and training efficiency, the proposed approach interleaves regular Transformer layers and MoE layers, replace the standard positional embedding with per-layer relative positional bias, and replace the first linear projection and activation function with the Gated Linear Unit. During training, the learnable gating network in each MoE layer is trained to activate the best subset of experts for each token using a top-k function along the token dimension. To mitigate the negative effects of skipping tokens, some shared components are forced between MoE layers.
Despite having more parameters in the MoE layer, the activated model size per token can be comparable to a dense layer because only a limited subset of experts is activated for any given token, making the approach efficient. The proposed approach has been compared to previous works such as Switch Transformer and GShard, and it has shown to achieve better performance scaling and training efficiency.
This article is drafted with the assistance of A.I. and referencing from the sources below:
The work described in this article was supported by InnoHK initiative, The Government of the HKSAR, and Laboratory for AI-Powered Financial Technologies.
(AIFT strives but cannot guarantee the accuracy and reliability of the content, and will not be responsible for any loss or damage caused by any inaccuracy or omission.)