Get the latest tech news
Whats better: Neural nets wider with less layers or thinner with more layers
This post details my experiments on whether Transformers with more thin layers are better than Transformers with fewer wide layers. I tested 5 different configurations to conclude that an optimal ratio between is the best config, and in my experiments, 4 layers with an embd_dim of 1024 worked the best.
Config n_headn_layerembd_dim Final Train lossFinal Val loss[3] 1 (Purple)2120481.591.6732 (Blue)2410240.840.9533 (Pink)2165120.951.1034 (Orange)4642561.061.2455 (Red)42561281.371.467We can mathematically resent the relation between embd and layers as Testing five different setups, each with 50 million parameters, revealed that a model with four layers and an embedding dimension of 1024 (Config 2) had the lowest final validation loss. While deeper models can give more detailed feature representations, adding too many layers, as seen in Configs 4 and 5, leads to diminishing returns and higher computational costs without much improvement.
Or read this on Hacker News