Artificial neural networks have emerged as the cornerstone of deep learning, offering a remarkable way of drawing valuable insights from a plethora of data. However, the efficacy of these neural networks hinges heavily on their parameter count. Mixture-of-Experts (MoE) presents an efficient approach to dramatically increasing a model’s capabilities without introducing a proportional amount of computational overhead.

Originally proposed in 1991 by Robert A. Jacobs et al., MoE adopts a conditional computation paradigm by only selecting parts of an ensemble, referred to as experts, and activating them depending on the data at hand. The MoE structure appeared long before the popularization of deep learning.

Mixture-of-Experts: The Classic Approach