fbpx
Wikipedia

Mixture of experts

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions.[1] It differs from ensemble techniques in that typically only one or a few expert models will be run, rather than combining results from all models.

Basic theory edit

In mixture of experts, we always have the following ingredients, but they are constructed and combined differently.

  • There are experts  , each taking in the same input  , and produces outputs  .
  • There is a single weighting function (aka gating function)  , which takes in   and produces a vector of outputs  .
  •   is the set of parameters. The parameter   is for the weighting function.
  • Given an input  , the mixture of experts produces a single combined output by combining   according to the weights   in some way.

Both the experts and the weighting function are trained by minimizing some form of loss function, generally by gradient descent. There is a lot of freedom in choosing the precise form of experts, the weighting function, and the loss function.

Meta-pi network edit

The meta-pi network, reported by Hampshire and Waibel,[2] uses   as the output. The model is trained by performing gradient descent on the mean-squared error loss  . The experts may be arbitrary functions.

In their original publication, they were solving the problem of classifying phonemes in speech signal from 6 different Japanese speakers, 2 females and 4 males. They trained 6 experts, each being a "time-delayed neural network"[3] (essentially a multilayered convolution network over the mel spectrogram). They found that the resulting mixture of experts dedicated 5 experts for 5 of the speakers, but the 6th (male) speaker does not have a dedicated expert, instead his voice was classified by a linear combination of the experts for the other 3 male speakers.

Adaptive mixtures of local experts edit

The adaptive mixtures of local experts [4][5] uses a gaussian mixture model. Each expert simply predicts a gaussian distribution, and totally ignores the input. Specifically, the  -th expert predicts that the output is  , where   is a learnable parameter. The weighting function is a linear-softmax function:

 
The mixture of experts predict that the output is distributed according to the probability density function:
 
It is trained by maximal likelihood estimation, that is, gradient ascent on  . The gradient for the  -th expert is
 

and the gradient for the weighting function is

 

For each input-output pair  , the weighting function is changed to increase the weight on all experts that performed above average, and decrease the weight on all experts that performed below average. This encourages the weighting function to learn to select only the expects that make the right predictions for each input.

The  -th expert is changed to make its prediction closer to  , but the amount of change is proportional to  . This has a Bayesian interpretation. Given input  , the prior probability that expert   is the right one is  , and   is the likelihood of evidence  . So,   is the posterior probability for expert  , and so the rate of change for the  -th expert is proportional to its posterior probability.

In words, the experts that, in hindsight, seemed like the good experts to consult, are asked to learn on the example. The experts that, in hindsight, were not, are left alone.

The combined effect is that the experts become specialized: Suppose two experts are both good at predicting a certain kind of input, but one is slightly better, then the weighting function would eventually learn to favor the better one. After that happens, the lesser expert is unable to obtain a high gradient signal, and becomes even worse at predicting such kind of input. Conversely, the lesser expert can become better at predicting other kinds of input, and increasingly pulled away into another region. This has a positive feedback effect, causing each expert to move apart from the rest and take care of a local region alone (thus the name "local experts").

Hierarchical MoE edit

Hierarchical mixtures of experts[6][7] uses multiple levels of gating in a tree. Each gating is a probability distribution over the next level of gatings, and the experts are on the leaf nodes of the tree. They are similar to decision trees.

For example, a 2-level hierarchical MoE would have a first order gating function  , and second order gating functions   and experts  . The total prediction is then  .

Variants edit

The mixture of experts, being similar to the gaussian mixture model, can also be trained by the expectation-maximization algorithm, just like gaussian mixture models. Specifically, during the expectation step, the "burden" for explaining each data point is assigned over the experts, and during the maximization step, the experts are trained to improve the explanations they got a high burden for, while the gate is trained to improve its burden assignment. This can converge faster than gradient ascent on the log-likelihood.[7][8]

The choice of gating function is often a softmax gating. Other than that, [9] proposed using gaussian distributions, and [8] proposed using exponential families.

Instead of performing a weighted sum of all the experts, in hard MoE [10] only the highest ranked expert is chosen. That is,  . This can accelerate training and inference time.[11]

The experts can use more general forms of multivariant gaussian distributions. For example, [6] proposed  , where   are learnable parameters. In words, each expert learns to do linear regression, with a learnable uncertainty estimate.

One can use different experts than gaussian distributions. For example, one can use Laplace distribution,[12] or Student's t-distribution.[13] For binary classification, it also proposed logistic regression experts, with

 
where   are learnable parameters. This is later generalized for multi-class classification, with multinomial logistic regression experts.[14]

Deep learning edit

The previous section described MoE as it was used before the era of deep learning. After deep learning, MoE found applications in running the largest models, as a simple way to perform conditional computation: only parts of the model are used, the parts chosen according to what the input is.[15]

The earliest paper that applies MoE to deep learning is,[16] which proposes to use a different gating network at each layer in a deep neural network. Specifically, each gating is a linear-ReLU-linear-softmax network, and each expert is a linear-ReLU network.

The key design desiderata for MoE in deep learning is to reduce computing cost. Consequently, for each query, only a small subset of the experts should be queried. This makes MoE in deep learning different from classical MoE. In classical MoE, the output for each query is a weighted sum of all experts' outputs. In deep learning MoE, the output for each query can only involve a few experts' outputs. Consequently, the key design choice in MoE becomes routing: given a batch of queries, how to route the queries to the best experts.

Sparsely-gated MoE layer edit

The sparsely-gated MoE layer,[17] published by researchers from Google Brain, uses feedforward networks as experts, and linear-softmax gating. Similar to the previously proposed hard MoE, they achieve sparsity by a weighted sum of only the top-k experts, instead of the weighted sum of all of them. Specifically, in a MoE layer, there are feedforward networks  , and a gating network  . The gating network is defined by  , where   is a function that keeps the top-k entries of a vector the same, but sets all other entries to  . The addition of noise helps with load balancing.

The choice of   is a hyperparameter that is chosen according to application. Typical values are  . The   version is also called the Switch Transformer.[18]

As demonstration, they trained a series of models for machine translation with alternating layers of MoE and LSTM, and compared with deep LSTM models.[19] Table 3 shows that the MoE models used less inference time compute, despite having 30x more parameters.

Vanilla MoE tend to have issues of load balancing: some experts are consulted often, while other experts rarely or not at all. To encourage the gate to select each expert with equal frequency (proper load balancing) within each batch, each MoE layer has two auxiliary loss functions. This is improved by [18] into a single auxiliary loss function. Specifically, let   be the number of experts, then for a given batch of queries  , the auxiliary loss for the batch is

 
Here,   is the fraction of time where expert   is ranked highest, and   is the fraction of weight on expert  . This loss is minimized at  , precisely when every expert has equal weight   in all situations.

Routing edit

In sparsely-gated MoE, only the top-k experts are queried, and their outputs are weighted-summed. There are other methods.[20]

In Hash MoE,[21] routing is performed deterministically by a hash function, fixed before learning begins. For example, if the model is a 4-layered Transformer, and input is a token for word "eat", and the hash of "eat" is  , then the token would be routed to the 1st expert in layer 1, 4th expert in layer 2, etc. Despite its simplicity, it achieves competitive performance as sparsely gated MoE with  .

In soft MoE, suppose in each batch, each expert can process   queries, then there are   queries that can be assigned per batch. Now for each batch of queries  , the soft MoE layer computes an array  , such that   is a probability distribution over queries, and the  -th expert's  -th query is  .[22] However, this does not work with autoregressive modelling, since the weights   over one token depends on all other tokens'.[23]

Other approaches include solving it as a constrained linear programming problem,[24] making each expert choose the top-k queries it wants (instead of each query choosing the top-k experts for it),[25] using reinforcement learning to train the routing algorithm (since picking an expert is a discrete action, like in RL).[26]

Capacity factor edit

Suppose there are   experts in a layer. For a given batch of queries  , each query is routed to one or more experts. For example, if each query is routed to one expert as in Switch Transformers, and if the experts are load-balanced, then each expert should expect on average   queries in a batch. In practice, the experts cannot expect perfect load balancing: in some batches, one expert might be underworked, while in other batches, it would be overworked.

Since the inputs cannot move through the layer until every expert in the layer has finished the queries it is assigned, load balancing is important. As a hard constraint on load balancing, there is the capacity factor: each expert is only allowed to process up to   queries in a batch. [20] found   to work in practice.

Applications to Transformer models edit

MoE layers are used in very large Transformer models, for which learning and inferring over the full model is too costly. In Transformer models, the MoE layers are often used to select the feedforward layers (typically a linear-ReLU-linear network), appearing in each Transformer block after the multiheaded attention. This is because the feedforward layers take up an increasing portion of the computing cost as models grow larger. For example, 90% of parameters in PALM-540B are in feedforward layers.[27]

A series of large language models from Google used MoE. GShard[28] uses MoE with up to top-2 experts per layer. Specifically, the top-1 expert is always selected, and the top-2th expert is selected with probability proportional to that experts' weight according to the gating function. Later, GLaM[29] demonstrated a language model with 1.2 trillion parameters, each MoE layer using top-2 out of 64 experts. Switch Transformers[18] use top-1 in all MoE layers.

The NLLB-200 by Meta AI is a machine translation model for 200 languages.[30] Each MoE layer uses a hierarchical MoE with two levels. On the first level, the gating function chooses to use either a "shared" feedforward layer, or to use the experts. If using the experts, then another gating function computes the weights and chooses the top-2 experts (see Figure 19).[31]

MoE large language models can be adapted for downstream tasks by instruction tuning.[32]

Generally, MoE are used when dense models have become too costly. As of 2023, the largest models tend to be large language models. Outside of those, Vision MoE[33] is a Transformer model with MoE layers. They demonstrated it by training a model with 15 billion parameters.

In December 2023 the french startup Mistral AI released the open source model Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. It is licensed under Apache 2.0 and outperforms Llama 2 70B on most benchmarks with 6x faster inference. Also compared to GPT-3.5 the model shows superior performance according to the companies blog post.[34] Mixtral 8x7B is noted for its cost/performance trade-offs, handling multiple languages (English, French, Italian, German, and Spanish), and strong code generation performance. The model is a decoder-only model with 46.7B total parameters, but uses only 12.9B parameters per token. Mixtral 8x7B is also available in an instructed version optimized for instruction following.

Further reading edit

  • Before deep learning era
    • McLachlan, Geoffrey J.; Peel, David (2000). Finite mixture models. Wiley series in probability and statistics applied probability and statistics section. New York Chichester Weinheim Brisbane Singapore Toronto: John Wiley & Sons, Inc. ISBN 978-0-471-00626-8.
    • Yuksel, S. E.; Wilson, J. N.; Gader, P. D. (August 2012). "Twenty Years of Mixture of Experts". IEEE Transactions on Neural Networks and Learning Systems. 23 (8): 1177–1193. doi:10.1109/TNNLS.2012.2200299. ISSN 2162-237X. PMID 24807516. S2CID 9922492.
    • Masoudnia, Saeed; Ebrahimpour, Reza (12 May 2012). "Mixture of experts: a literature survey". Artificial Intelligence Review. 42 (2): 275–293. doi:10.1007/s10462-012-9338-y. S2CID 3185688.
    • Nguyen, Hien D.; Chamroukhi, Faicel (July 2018). "Practical and theoretical aspects of mixture‐of‐experts modeling: An overview". WIREs Data Mining and Knowledge Discovery. 8 (4). doi:10.1002/widm.1246. ISSN 1942-4787. S2CID 49301452.
  • Deep learning era
    • Zoph, Barret; Bello, Irwan; Kumar, Sameer; Du, Nan; Huang, Yanping; Dean, Jeff; Shazeer, Noam; Fedus, William (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models". arXiv:2202.08906 [cs.CL].

See also edit

References edit

  1. ^ Baldacchino, Tara; Cross, Elizabeth J.; Worden, Keith; Rowson, Jennifer (2016). "Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems". Mechanical Systems and Signal Processing. 66–67: 178–200. Bibcode:2016MSSP...66..178B. doi:10.1016/j.ymssp.2015.05.009.
  2. ^ Hampshire, J.B.; Waibel, A. (July 1992). "The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 14 (7): 751–769. doi:10.1109/34.142911.
  3. ^ Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, Kevin J. Lang (1995). "Phoneme Recognition Using Time-Delay Neural Networks*". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation. Psychology Press. doi:10.4324/9780203763247. ISBN 978-0-203-76324-7.{{cite book}}: CS1 maint: multiple names: authors list (link)
  4. ^ Nowlan, Steven; Hinton, Geoffrey E (1990). "Evaluation of Adaptive Mixtures of Competing Experts". Advances in Neural Information Processing Systems. Morgan-Kaufmann. 3.
  5. ^ Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (February 1991). "Adaptive Mixtures of Local Experts". Neural Computation. 3 (1): 79–87. doi:10.1162/neco.1991.3.1.79. ISSN 0899-7667. PMID 31141872. S2CID 572361.
  6. ^ a b Jordan, Michael; Jacobs, Robert (1991). "Hierarchies of adaptive experts". Advances in Neural Information Processing Systems. Morgan-Kaufmann. 4.
  7. ^ a b Jordan, Michael I.; Jacobs, Robert A. (March 1994). "Hierarchical Mixtures of Experts and the EM Algorithm". Neural Computation. 6 (2): 181–214. doi:10.1162/neco.1994.6.2.181. ISSN 0899-7667.
  8. ^ a b Jordan, Michael I.; Xu, Lei (1995-01-01). "Convergence results for the EM approach to mixtures of experts architectures". Neural Networks. 8 (9): 1409–1431. doi:10.1016/0893-6080(95)00014-3. hdl:1721.1/6620. ISSN 0893-6080.
  9. ^ Xu, Lei; Jordan, Michael; Hinton, Geoffrey E (1994). "An Alternative Model for Mixtures of Experts". Advances in Neural Information Processing Systems. MIT Press. 7.
  10. ^ Collobert, Ronan; Bengio, Samy; Bengio, Yoshua (2001). "A Parallel Mixture of SVMs for Very Large Scale Problems". Advances in Neural Information Processing Systems. MIT Press. 14.
  11. ^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "12: Applications". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.
  12. ^ Nguyen, Hien D.; McLachlan, Geoffrey J. (2016-01-01). "Laplace mixture of linear experts". Computational Statistics & Data Analysis. 93: 177–191. doi:10.1016/j.csda.2014.10.016. ISSN 0167-9473.
  13. ^ Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv:1701.07429. doi:10.1016/j.neunet.2016.03.002. ISSN 0893-6080. PMID 27093693. S2CID 3171144.
  14. ^ Chen, K.; Xu, L.; Chi, H. (1999-11-01). "Improved learning algorithms for mixture of experts in multiclass classification". Neural Networks. 12 (9): 1229–1252. doi:10.1016/S0893-6080(99)00043-X. ISSN 0893-6080. PMID 12662629.
  15. ^ Bengio, Yoshua; Léonard, Nicholas; Courville, Aaron (2013). "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation". arXiv:1308.3432 [cs.LG].
  16. ^ Eigen, David; Ranzato, Marc'Aurelio; Sutskever, Ilya (2013). "Learning Factored Representations in a Deep Mixture of Experts". arXiv:1312.4314 [cs.LG].
  17. ^ Shazeer, Noam; Mirhoseini, Azalia; Maziarz, Krzysztof; Davis, Andy; Le, Quoc; Hinton, Geoffrey; Dean, Jeff (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer". arXiv:1701.06538 [cs.LG].
  18. ^ a b c Fedus, William; Zoph, Barret; Shazeer, Noam (2022-01-01). "Switch transformers: scaling to trillion parameter models with simple and efficient sparsity". The Journal of Machine Learning Research. 23 (1): 5232–5270. arXiv:2101.03961. ISSN 1532-4435.
  19. ^ Wu, Yonghui; Schuster, Mike; Chen, Zhifeng; Le, Quoc V.; Norouzi, Mohammad; Macherey, Wolfgang; Krikun, Maxim; Cao, Yuan; Gao, Qin; Macherey, Klaus; Klingner, Jeff; Shah, Apurva; Johnson, Melvin; Liu, Xiaobing; Kaiser, Łukasz (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv:1609.08144 [cs.CL].
  20. ^ a b Zoph, Barret; Bello, Irwan; Kumar, Sameer; Du, Nan; Huang, Yanping; Dean, Jeff; Shazeer, Noam; Fedus, William (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models". arXiv:2202.08906 [cs.CL].
  21. ^ Roller, Stephen; Sukhbaatar, Sainbayar; szlam, arthur; Weston, Jason (2021). "Hash Layers For Large Sparse Models". Advances in Neural Information Processing Systems. Curran Associates. 34: 17555–17566. arXiv:2106.04426.
  22. ^ Puigcerver, Joan; Riquelme, Carlos; Mustafa, Basil; Houlsby, Neil (2023). "From Sparse to Soft Mixtures of Experts". arXiv:2308.00951 [cs.LG].
  23. ^ Wang, Phil (2023-10-04), lucidrains/soft-moe-pytorch, retrieved 2023-10-08
  24. ^ Lewis, Mike; Bhosale, Shruti; Dettmers, Tim; Goyal, Naman; Zettlemoyer, Luke (2021-07-01). "BASE Layers: Simplifying Training of Large, Sparse Models". Proceedings of the 38th International Conference on Machine Learning. PMLR: 6265–6274. arXiv:2103.16716.
  25. ^ Zhou, Yanqi; Lei, Tao; Liu, Hanxiao; Du, Nan; Huang, Yanping; Zhao, Vincent; Dai, Andrew M.; Chen, Zhifeng; Le, Quoc V.; Laudon, James (2022-12-06). "Mixture-of-Experts with Expert Choice Routing". Advances in Neural Information Processing Systems. 35: 7103–7114. arXiv:2202.09368.
  26. ^ Bengio, Emmanuel; Bacon, Pierre-Luc; Pineau, Joelle; Precup, Doina (2015). "Conditional Computation in Neural Networks for faster models". arXiv:1511.06297 [cs.LG].
  27. ^ "Transformer Deep Dive: Parameter Counting". Transformer Deep Dive: Parameter Counting. Retrieved 2023-10-10.
  28. ^ Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun, Maxim; Shazeer, Noam; Chen, Zhifeng (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". arXiv:2006.16668 [cs.CL].
  29. ^ Du, Nan; Huang, Yanping; Dai, Andrew M.; Tong, Simon; Lepikhin, Dmitry; Xu, Yuanzhong; Krikun, Maxim; Zhou, Yanqi; Yu, Adams Wei; Firat, Orhan; Zoph, Barret; Fedus, Liam; Bosma, Maarten; Zhou, Zongwei; Wang, Tao (2021). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts". arXiv:2112.06905 [cs.CL].
  30. ^ . ai.facebook.com. 2022-06-19. Archived from the original on 2023-01-09.
  31. ^ NLLB Team; Costa-jussà, Marta R.; Cross, James; Çelebi, Onur; Elbayad, Maha; Heafield, Kenneth; Heffernan, Kevin; Kalbassi, Elahe; Lam, Janice; Licht, Daniel; Maillard, Jean; Sun, Anna; Wang, Skyler; Wenzek, Guillaume; Youngblood, Al (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv:2207.04672 [cs.CL].
  32. ^ Shen, Sheng; Hou, Le; Zhou, Yanqi; Du, Nan; Longpre, Shayne; Wei, Jason; Chung, Hyung Won; Zoph, Barret; Fedus, William; Chen, Xinyun; Vu, Tu; Wu, Yuexin; Chen, Wuyang; Webson, Albert; Li, Yunxuan (2023). "Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models". arXiv:2305.14705 [cs.CL].
  33. ^ Riquelme, Carlos; Puigcerver, Joan; Mustafa, Basil; Neumann, Maxim; Jenatton, Rodolphe; Susano Pinto, André; Keysers, Daniel; Houlsby, Neil (2021). "Scaling Vision with Sparse Mixture of Experts". Advances in Neural Information Processing Systems. 34: 8583–8595. arXiv:2106.05974.
  34. ^ "200 Mixtral of experts: A high quality Sparse Mixture-of-Experts". mistral.ai. 2023-12-11.

mixture, experts, machine, learning, technique, where, multiple, expert, networks, learners, used, divide, problem, space, into, homogeneous, regions, differs, from, ensemble, techniques, that, typically, only, expert, models, will, rather, than, combining, re. Mixture of experts MoE is a machine learning technique where multiple expert networks learners are used to divide a problem space into homogeneous regions 1 It differs from ensemble techniques in that typically only one or a few expert models will be run rather than combining results from all models Contents 1 Basic theory 1 1 Meta pi network 1 2 Adaptive mixtures of local experts 1 3 Hierarchical MoE 1 4 Variants 2 Deep learning 2 1 Sparsely gated MoE layer 2 2 Routing 2 3 Capacity factor 2 4 Applications to Transformer models 3 Further reading 4 See also 5 ReferencesBasic theory editIn mixture of experts we always have the following ingredients but they are constructed and combined differently There are experts f 1 f n displaystyle f 1 f n nbsp each taking in the same input x displaystyle x nbsp and produces outputs f 1 x f n x displaystyle f 1 x f n x nbsp There is a single weighting function aka gating function w displaystyle w nbsp which takes in x displaystyle x nbsp and produces a vector of outputs w x 1 w x n displaystyle w x 1 w x n nbsp 8 8 0 8 1 8 n displaystyle theta theta 0 theta 1 theta n nbsp is the set of parameters The parameter 8 0 displaystyle theta 0 nbsp is for the weighting function Given an input x displaystyle x nbsp the mixture of experts produces a single combined output by combining f 1 x f n x displaystyle f 1 x f n x nbsp according to the weights w x 1 w x n displaystyle w x 1 w x n nbsp in some way Both the experts and the weighting function are trained by minimizing some form of loss function generally by gradient descent There is a lot of freedom in choosing the precise form of experts the weighting function and the loss function Meta pi network edit The meta pi network reported by Hampshire and Waibel 2 uses f x i w x i f i x displaystyle f x sum i w x i f i x nbsp as the output The model is trained by performing gradient descent on the mean squared error loss L 1 N k y k f x k 2 displaystyle L frac 1 N sum k y k f x k 2 nbsp The experts may be arbitrary functions In their original publication they were solving the problem of classifying phonemes in speech signal from 6 different Japanese speakers 2 females and 4 males They trained 6 experts each being a time delayed neural network 3 essentially a multilayered convolution network over the mel spectrogram They found that the resulting mixture of experts dedicated 5 experts for 5 of the speakers but the 6th male speaker does not have a dedicated expert instead his voice was classified by a linear combination of the experts for the other 3 male speakers Adaptive mixtures of local experts edit The adaptive mixtures of local experts 4 5 uses a gaussian mixture model Each expert simply predicts a gaussian distribution and totally ignores the input Specifically the i displaystyle i nbsp th expert predicts that the output is y N m i I displaystyle y sim N mu i I nbsp where m i displaystyle mu i nbsp is a learnable parameter The weighting function is a linear softmax function w x i e k i T x b i j e k j T x b j displaystyle w x i frac e k i T x b i sum j e k j T x b j nbsp The mixture of experts predict that the output is distributed according to the probability density function f 8 y x ln i e k i T x b i j e k j T x b j N y m i I ln 2 p d 2 i e k i T x b i j e k j T x b j e 1 2 y m i 2 displaystyle f theta y x ln left sum i frac e k i T x b i sum j e k j T x b j N y mu i I right ln left 2 pi d 2 sum i frac e k i T x b i sum j e k j T x b j e frac 1 2 y mu i 2 right nbsp It is trained by maximal likelihood estimation that is gradient ascent on f y x displaystyle f y x nbsp The gradient for the i displaystyle i nbsp th expert is m i f 8 y x w x i N y m i I j w x j N y m j I y m i displaystyle nabla mu i f theta y x frac w x i N y mu i I sum j w x j N y mu j I y mu i nbsp and the gradient for the weighting function is k i b i f 8 y x x 1 w x i j w x j N y m j I f i x f 8 y x displaystyle nabla k i b i f theta y x begin bmatrix x 1 end bmatrix frac w x i sum j w x j N y mu j I f i x f theta y x nbsp For each input output pair x y displaystyle x y nbsp the weighting function is changed to increase the weight on all experts that performed above average and decrease the weight on all experts that performed below average This encourages the weighting function to learn to select only the expects that make the right predictions for each input The i displaystyle i nbsp th expert is changed to make its prediction closer to y displaystyle y nbsp but the amount of change is proportional to w x i N y m i I displaystyle w x i N y mu i I nbsp This has a Bayesian interpretation Given input x displaystyle x nbsp the prior probability that expert i displaystyle i nbsp is the right one is w x i displaystyle w x i nbsp and N y m i I displaystyle N y mu i I nbsp is the likelihood of evidence y displaystyle y nbsp So w x i N y m i I j w x j N y m j I displaystyle frac w x i N y mu i I sum j w x j N y mu j I nbsp is the posterior probability for expert i displaystyle i nbsp and so the rate of change for the i displaystyle i nbsp th expert is proportional to its posterior probability In words the experts that in hindsight seemed like the good experts to consult are asked to learn on the example The experts that in hindsight were not are left alone The combined effect is that the experts become specialized Suppose two experts are both good at predicting a certain kind of input but one is slightly better then the weighting function would eventually learn to favor the better one After that happens the lesser expert is unable to obtain a high gradient signal and becomes even worse at predicting such kind of input Conversely the lesser expert can become better at predicting other kinds of input and increasingly pulled away into another region This has a positive feedback effect causing each expert to move apart from the rest and take care of a local region alone thus the name local experts Hierarchical MoE edit Hierarchical mixtures of experts 6 7 uses multiple levels of gating in a tree Each gating is a probability distribution over the next level of gatings and the experts are on the leaf nodes of the tree They are similar to decision trees For example a 2 level hierarchical MoE would have a first order gating function w i displaystyle w i nbsp and second order gating functions w j i displaystyle w j i nbsp and experts f j i displaystyle f j i nbsp The total prediction is then i w i x j w j i x f j i x displaystyle sum i w i x sum j w j i x f j i x nbsp Variants edit The mixture of experts being similar to the gaussian mixture model can also be trained by the expectation maximization algorithm just like gaussian mixture models Specifically during the expectation step the burden for explaining each data point is assigned over the experts and during the maximization step the experts are trained to improve the explanations they got a high burden for while the gate is trained to improve its burden assignment This can converge faster than gradient ascent on the log likelihood 7 8 The choice of gating function is often a softmax gating Other than that 9 proposed using gaussian distributions and 8 proposed using exponential families Instead of performing a weighted sum of all the experts in hard MoE 10 only the highest ranked expert is chosen That is f x f arg max i w i x x displaystyle f x f arg max i w i x x nbsp This can accelerate training and inference time 11 The experts can use more general forms of multivariant gaussian distributions For example 6 proposed f i y x N y A i x b i S i displaystyle f i y x N y A i x b i Sigma i nbsp where A i b i S i displaystyle A i b i Sigma i nbsp are learnable parameters In words each expert learns to do linear regression with a learnable uncertainty estimate One can use different experts than gaussian distributions For example one can use Laplace distribution 12 or Student s t distribution 13 For binary classification it also proposed logistic regression experts withf i y x 1 1 e b i T x b i 0 y 0 1 1 1 e b i T x b i 0 y 1 displaystyle f i y x begin cases frac 1 1 e beta i T x beta i 0 amp y 0 1 frac 1 1 e beta i T x beta i 0 amp y 1 end cases nbsp where b i b i 0 displaystyle beta i beta i 0 nbsp are learnable parameters This is later generalized for multi class classification with multinomial logistic regression experts 14 Deep learning editThe previous section described MoE as it was used before the era of deep learning After deep learning MoE found applications in running the largest models as a simple way to perform conditional computation only parts of the model are used the parts chosen according to what the input is 15 The earliest paper that applies MoE to deep learning is 16 which proposes to use a different gating network at each layer in a deep neural network Specifically each gating is a linear ReLU linear softmax network and each expert is a linear ReLU network The key design desiderata for MoE in deep learning is to reduce computing cost Consequently for each query only a small subset of the experts should be queried This makes MoE in deep learning different from classical MoE In classical MoE the output for each query is a weighted sum of all experts outputs In deep learning MoE the output for each query can only involve a few experts outputs Consequently the key design choice in MoE becomes routing given a batch of queries how to route the queries to the best experts Sparsely gated MoE layer edit The sparsely gated MoE layer 17 published by researchers from Google Brain uses feedforward networks as experts and linear softmax gating Similar to the previously proposed hard MoE they achieve sparsity by a weighted sum of only the top k experts instead of the weighted sum of all of them Specifically in a MoE layer there are feedforward networks f 1 f n displaystyle f 1 f n nbsp and a gating network w displaystyle w nbsp The gating network is defined by w x s o f t m a x t o p k W x noise displaystyle w x mathrm softmax mathrm top k Wx text noise nbsp where t o p k displaystyle mathrm top k nbsp is a function that keeps the top k entries of a vector the same but sets all other entries to displaystyle infty nbsp The addition of noise helps with load balancing The choice of k displaystyle k nbsp is a hyperparameter that is chosen according to application Typical values are k 1 2 displaystyle k 1 2 nbsp The k 1 displaystyle k 1 nbsp version is also called the Switch Transformer 18 As demonstration they trained a series of models for machine translation with alternating layers of MoE and LSTM and compared with deep LSTM models 19 Table 3 shows that the MoE models used less inference time compute despite having 30x more parameters Vanilla MoE tend to have issues of load balancing some experts are consulted often while other experts rarely or not at all To encourage the gate to select each expert with equal frequency proper load balancing within each batch each MoE layer has two auxiliary loss functions This is improved by 18 into a single auxiliary loss function Specifically let n displaystyle n nbsp be the number of experts then for a given batch of queries x 1 x 2 x T displaystyle x 1 x 2 x T nbsp the auxiliary loss for the batch isn i 1 n f i P i displaystyle n sum i 1 n f i P i nbsp Here f i 1 T queries sent to expert i displaystyle f i frac 1 T text queries sent to expert i nbsp is the fraction of time where expert i displaystyle i nbsp is ranked highest and P i 1 T j 1 T w i x j displaystyle P i frac 1 T sum j 1 T w i x j nbsp is the fraction of weight on expert i displaystyle i nbsp This loss is minimized at 1 displaystyle 1 nbsp precisely when every expert has equal weight 1 n displaystyle 1 n nbsp in all situations Routing edit In sparsely gated MoE only the top k experts are queried and their outputs are weighted summed There are other methods 20 In Hash MoE 21 routing is performed deterministically by a hash function fixed before learning begins For example if the model is a 4 layered Transformer and input is a token for word eat and the hash of eat is 1 4 2 3 displaystyle 1 4 2 3 nbsp then the token would be routed to the 1st expert in layer 1 4th expert in layer 2 etc Despite its simplicity it achieves competitive performance as sparsely gated MoE with k 1 displaystyle k 1 nbsp In soft MoE suppose in each batch each expert can process p displaystyle p nbsp queries then there are n p displaystyle n times p nbsp queries that can be assigned per batch Now for each batch of queries x 1 x 2 x T displaystyle x 1 x 2 x T nbsp the soft MoE layer computes an array w i j k displaystyle w i j k nbsp such that w i j 1 w i j T displaystyle w i j 1 w i j T nbsp is a probability distribution over queries and the i displaystyle i nbsp th expert s j displaystyle j nbsp th query is k w i j k x k displaystyle sum k w i j k x k nbsp 22 However this does not work with autoregressive modelling since the weights w i j k displaystyle w i j k nbsp over one token depends on all other tokens 23 Other approaches include solving it as a constrained linear programming problem 24 making each expert choose the top k queries it wants instead of each query choosing the top k experts for it 25 using reinforcement learning to train the routing algorithm since picking an expert is a discrete action like in RL 26 Capacity factor edit Suppose there are n displaystyle n nbsp experts in a layer For a given batch of queries x 1 x 2 x T displaystyle x 1 x 2 x T nbsp each query is routed to one or more experts For example if each query is routed to one expert as in Switch Transformers and if the experts are load balanced then each expert should expect on average T n displaystyle T n nbsp queries in a batch In practice the experts cannot expect perfect load balancing in some batches one expert might be underworked while in other batches it would be overworked Since the inputs cannot move through the layer until every expert in the layer has finished the queries it is assigned load balancing is important As a hard constraint on load balancing there is the capacity factor each expert is only allowed to process up to c T n displaystyle c cdot T n nbsp queries in a batch 20 found c 1 25 2 displaystyle c in 1 25 2 nbsp to work in practice Applications to Transformer models edit MoE layers are used in very large Transformer models for which learning and inferring over the full model is too costly In Transformer models the MoE layers are often used to select the feedforward layers typically a linear ReLU linear network appearing in each Transformer block after the multiheaded attention This is because the feedforward layers take up an increasing portion of the computing cost as models grow larger For example 90 of parameters in PALM 540B are in feedforward layers 27 A series of large language models from Google used MoE GShard 28 uses MoE with up to top 2 experts per layer Specifically the top 1 expert is always selected and the top 2th expert is selected with probability proportional to that experts weight according to the gating function Later GLaM 29 demonstrated a language model with 1 2 trillion parameters each MoE layer using top 2 out of 64 experts Switch Transformers 18 use top 1 in all MoE layers The NLLB 200 by Meta AI is a machine translation model for 200 languages 30 Each MoE layer uses a hierarchical MoE with two levels On the first level the gating function chooses to use either a shared feedforward layer or to use the experts If using the experts then another gating function computes the weights and chooses the top 2 experts see Figure 19 31 MoE large language models can be adapted for downstream tasks by instruction tuning 32 Generally MoE are used when dense models have become too costly As of 2023 the largest models tend to be large language models Outside of those Vision MoE 33 is a Transformer model with MoE layers They demonstrated it by training a model with 15 billion parameters In December 2023 the french startup Mistral AI released the open source model Mixtral 8x7B a high quality sparse mixture of experts model SMoE with open weights It is licensed under Apache 2 0 and outperforms Llama 2 70B on most benchmarks with 6x faster inference Also compared to GPT 3 5 the model shows superior performance according to the companies blog post 34 Mixtral 8x7B is noted for its cost performance trade offs handling multiple languages English French Italian German and Spanish and strong code generation performance The model is a decoder only model with 46 7B total parameters but uses only 12 9B parameters per token Mixtral 8x7B is also available in an instructed version optimized for instruction following Further reading editBefore deep learning era McLachlan Geoffrey J Peel David 2000 Finite mixture models Wiley series in probability and statistics applied probability and statistics section New York Chichester Weinheim Brisbane Singapore Toronto John Wiley amp Sons Inc ISBN 978 0 471 00626 8 Yuksel S E Wilson J N Gader P D August 2012 Twenty Years of Mixture of Experts IEEE Transactions on Neural Networks and Learning Systems 23 8 1177 1193 doi 10 1109 TNNLS 2012 2200299 ISSN 2162 237X PMID 24807516 S2CID 9922492 Masoudnia Saeed Ebrahimpour Reza 12 May 2012 Mixture of experts a literature survey Artificial Intelligence Review 42 2 275 293 doi 10 1007 s10462 012 9338 y S2CID 3185688 Nguyen Hien D Chamroukhi Faicel July 2018 Practical and theoretical aspects of mixture of experts modeling An overview WIREs Data Mining and Knowledge Discovery 8 4 doi 10 1002 widm 1246 ISSN 1942 4787 S2CID 49301452 Deep learning era Zoph Barret Bello Irwan Kumar Sameer Du Nan Huang Yanping Dean Jeff Shazeer Noam Fedus William 2022 ST MoE Designing Stable and Transferable Sparse Expert Models arXiv 2202 08906 cs CL See also editProduct of experts Mixture models Mixture of gaussians Ensemble learningReferences edit Baldacchino Tara Cross Elizabeth J Worden Keith Rowson Jennifer 2016 Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems Mechanical Systems and Signal Processing 66 67 178 200 Bibcode 2016MSSP 66 178B doi 10 1016 j ymssp 2015 05 009 Hampshire J B Waibel A July 1992 The Meta Pi network building distributed knowledge representations for robust multisource pattern recognition PDF IEEE Transactions on Pattern Analysis and Machine Intelligence 14 7 751 769 doi 10 1109 34 142911 Alexander Waibel Toshiyuki Hanazawa Geoffrey Hinton Kiyohiro Shikano Kevin J Lang 1995 Phoneme Recognition Using Time Delay Neural Networks In Chauvin Yves Rumelhart David E eds Backpropagation Psychology Press doi 10 4324 9780203763247 ISBN 978 0 203 76324 7 a href Template Cite book html title Template Cite book cite book a CS1 maint multiple names authors list link Nowlan Steven Hinton Geoffrey E 1990 Evaluation of Adaptive Mixtures of Competing Experts Advances in Neural Information Processing Systems Morgan Kaufmann 3 Jacobs Robert A Jordan Michael I Nowlan Steven J Hinton Geoffrey E February 1991 Adaptive Mixtures of Local Experts Neural Computation 3 1 79 87 doi 10 1162 neco 1991 3 1 79 ISSN 0899 7667 PMID 31141872 S2CID 572361 a b Jordan Michael Jacobs Robert 1991 Hierarchies of adaptive experts Advances in Neural Information Processing Systems Morgan Kaufmann 4 a b Jordan Michael I Jacobs Robert A March 1994 Hierarchical Mixtures of Experts and the EM Algorithm Neural Computation 6 2 181 214 doi 10 1162 neco 1994 6 2 181 ISSN 0899 7667 a b Jordan Michael I Xu Lei 1995 01 01 Convergence results for the EM approach to mixtures of experts architectures Neural Networks 8 9 1409 1431 doi 10 1016 0893 6080 95 00014 3 hdl 1721 1 6620 ISSN 0893 6080 Xu Lei Jordan Michael Hinton Geoffrey E 1994 An Alternative Model for Mixtures of Experts Advances in Neural Information Processing Systems MIT Press 7 Collobert Ronan Bengio Samy Bengio Yoshua 2001 A Parallel Mixture of SVMs for Very Large Scale Problems Advances in Neural Information Processing Systems MIT Press 14 Goodfellow Ian Bengio Yoshua Courville Aaron 2016 12 Applications Deep learning Adaptive computation and machine learning Cambridge Mass The MIT press ISBN 978 0 262 03561 3 Nguyen Hien D McLachlan Geoffrey J 2016 01 01 Laplace mixture of linear experts Computational Statistics amp Data Analysis 93 177 191 doi 10 1016 j csda 2014 10 016 ISSN 0167 9473 Chamroukhi F 2016 07 01 Robust mixture of experts modeling using the t distribution Neural Networks 79 20 36 arXiv 1701 07429 doi 10 1016 j neunet 2016 03 002 ISSN 0893 6080 PMID 27093693 S2CID 3171144 Chen K Xu L Chi H 1999 11 01 Improved learning algorithms for mixture of experts in multiclass classification Neural Networks 12 9 1229 1252 doi 10 1016 S0893 6080 99 00043 X ISSN 0893 6080 PMID 12662629 Bengio Yoshua Leonard Nicholas Courville Aaron 2013 Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation arXiv 1308 3432 cs LG Eigen David Ranzato Marc Aurelio Sutskever Ilya 2013 Learning Factored Representations in a Deep Mixture of Experts arXiv 1312 4314 cs LG Shazeer Noam Mirhoseini Azalia Maziarz Krzysztof Davis Andy Le Quoc Hinton Geoffrey Dean Jeff 2017 Outrageously Large Neural Networks The Sparsely Gated Mixture of Experts Layer arXiv 1701 06538 cs LG a b c Fedus William Zoph Barret Shazeer Noam 2022 01 01 Switch transformers scaling to trillion parameter models with simple and efficient sparsity The Journal of Machine Learning Research 23 1 5232 5270 arXiv 2101 03961 ISSN 1532 4435 Wu Yonghui Schuster Mike Chen Zhifeng Le Quoc V Norouzi Mohammad Macherey Wolfgang Krikun Maxim Cao Yuan Gao Qin Macherey Klaus Klingner Jeff Shah Apurva Johnson Melvin Liu Xiaobing Kaiser Lukasz 2016 Google s Neural Machine Translation System Bridging the Gap between Human and Machine Translation arXiv 1609 08144 cs CL a b Zoph Barret Bello Irwan Kumar Sameer Du Nan Huang Yanping Dean Jeff Shazeer Noam Fedus William 2022 ST MoE Designing Stable and Transferable Sparse Expert Models arXiv 2202 08906 cs CL Roller Stephen Sukhbaatar Sainbayar szlam arthur Weston Jason 2021 Hash Layers For Large Sparse Models Advances in Neural Information Processing Systems Curran Associates 34 17555 17566 arXiv 2106 04426 Puigcerver Joan Riquelme Carlos Mustafa Basil Houlsby Neil 2023 From Sparse to Soft Mixtures of Experts arXiv 2308 00951 cs LG Wang Phil 2023 10 04 lucidrains soft moe pytorch retrieved 2023 10 08 Lewis Mike Bhosale Shruti Dettmers Tim Goyal Naman Zettlemoyer Luke 2021 07 01 BASE Layers Simplifying Training of Large Sparse Models Proceedings of the 38th International Conference on Machine Learning PMLR 6265 6274 arXiv 2103 16716 Zhou Yanqi Lei Tao Liu Hanxiao Du Nan Huang Yanping Zhao Vincent Dai Andrew M Chen Zhifeng Le Quoc V Laudon James 2022 12 06 Mixture of Experts with Expert Choice Routing Advances in Neural Information Processing Systems 35 7103 7114 arXiv 2202 09368 Bengio Emmanuel Bacon Pierre Luc Pineau Joelle Precup Doina 2015 Conditional Computation in Neural Networks for faster models arXiv 1511 06297 cs LG Transformer Deep Dive Parameter Counting Transformer Deep Dive Parameter Counting Retrieved 2023 10 10 Lepikhin Dmitry Lee HyoukJoong Xu Yuanzhong Chen Dehao Firat Orhan Huang Yanping Krikun Maxim Shazeer Noam Chen Zhifeng 2020 GShard Scaling Giant Models with Conditional Computation and Automatic Sharding arXiv 2006 16668 cs CL Du Nan Huang Yanping Dai Andrew M Tong Simon Lepikhin Dmitry Xu Yuanzhong Krikun Maxim Zhou Yanqi Yu Adams Wei Firat Orhan Zoph Barret Fedus Liam Bosma Maarten Zhou Zongwei Wang Tao 2021 GLaM Efficient Scaling of Language Models with Mixture of Experts arXiv 2112 06905 cs CL 200 languages within a single AI model A breakthrough in high quality machine translation ai facebook com 2022 06 19 Archived from the original on 2023 01 09 NLLB Team Costa jussa Marta R Cross James Celebi Onur Elbayad Maha Heafield Kenneth Heffernan Kevin Kalbassi Elahe Lam Janice Licht Daniel Maillard Jean Sun Anna Wang Skyler Wenzek Guillaume Youngblood Al 2022 No Language Left Behind Scaling Human Centered Machine Translation arXiv 2207 04672 cs CL Shen Sheng Hou Le Zhou Yanqi Du Nan Longpre Shayne Wei Jason Chung Hyung Won Zoph Barret Fedus William Chen Xinyun Vu Tu Wu Yuexin Chen Wuyang Webson Albert Li Yunxuan 2023 Mixture of Experts Meets Instruction Tuning A Winning Combination for Large Language Models arXiv 2305 14705 cs CL Riquelme Carlos Puigcerver Joan Mustafa Basil Neumann Maxim Jenatton Rodolphe Susano Pinto Andre Keysers Daniel Houlsby Neil 2021 Scaling Vision with Sparse Mixture of Experts Advances in Neural Information Processing Systems 34 8583 8595 arXiv 2106 05974 200 Mixtral of experts A high quality Sparse Mixture of Experts mistral ai 2023 12 11 Retrieved from https en wikipedia org w index php title Mixture of experts amp oldid 1189942570, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.