Get the latest tech news

Sparsely-Gated Mixture of Experts (MoE)


In transformer models, the attention block is typically followed by a feed forward layer (FF), which is a simple fully-connected NN with a hidden layer and nonlinearity. Here's the code for such a block that uses ReLU: def feed_forward_relu(x, W1, W2): """Feed-forward layer with ReLU activation.

This layer typically holds most of the weights in the transformer, because the hidden dimension (DH in this post, hidden_dim in some papers) is large - 4x the embedding depth D is common. Transformer blocks are repeated dozens of times in a model, so the total size of these layers becomes problematic. This is the goal of the MoE architecture - we increase the overall model size, but keep the computational cost in check by only using a portion of the parameters for every single token.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Experts

Experts

Photo of MoE

MoE

Photo of Gated Mixture

Gated Mixture

Related news:

News photo

Google’s latest AI model report lacks key safety details, experts say

News photo

CT scans could cause 5% of cancers, study finds; experts note uncertainty

News photo

Revolt brews against RFK Jr. as experts pen rally cries in top medical journal