Get the latest tech news
Sparsely-Gated Mixture of Experts (MoE)
In transformer models, the attention block is typically followed by a feed forward layer (FF), which is a simple fully-connected NN with a hidden layer and nonlinearity. Here's the code for such a block that uses ReLU: def feed_forward_relu(x, W1, W2): """Feed-forward layer with ReLU activation.
This layer typically holds most of the weights in the transformer, because the hidden dimension (DH in this post, hidden_dim in some papers) is large - 4x the embedding depth D is common. Transformer blocks are repeated dozens of times in a model, so the total size of these layers becomes problematic. This is the goal of the MoE architecture - we increase the overall model size, but keep the computational cost in check by only using a portion of the parameters for every single token.
Or read this on Hacker News