Resources Article Mixture of Experts: How an Ensemble of AI Models Decide As One

Mixture of Experts: How an Ensemble of AI Models Decide As One

Zian (Andy) Wang

Published on 09/22/23Updated on 01/23/24

Table of Contents

Mixture-of-Experts: The Classic Approach The Deep Learning Way Expert Choice Routing Implications and Outlooks The “How” Behind MoE What’s Next?

Share this guide

Artificial neural networks have emerged as the cornerstone of deep learning, offering a remarkable way of drawing valuable insights from a plethora of data. However, the efficacy of these neural networks hinges heavily on their parameter count. Mixture-of-Experts (MoE) presents an efficient approach to dramatically increasing a model’s capabilities without introducing a proportional amount of computational overhead.

Originally proposed in 1991 by Robert A. Jacobs et al., MoE adopts a conditional computation paradigm by only selecting parts of an ensemble, referred to as experts, and activating them depending on the data at hand. The MoE structure appeared long before the popularization of deep learning.