Get the latest tech news

Whats better: Neural nets wider with less layers or thinner with more layers


This post details my experiments on whether Transformers with more thin layers are better than Transformers with fewer wide layers. I tested 5 different configurations to conclude that an optimal ratio between is the best config, and in my experiments, 4 layers with an embd_dim of 1024 worked the best.

Config n_headn_layerembd_dim Final Train lossFinal Val loss[3] 1 (Purple)2120481.591.6732 (Blue)2410240.840.9533 (Pink)2165120.951.1034 (Orange)4642561.061.2455 (Red)42561281.371.467We can mathematically resent the relation between embd and layers as Testing five different setups, each with 50 million parameters, revealed that a model with four layers and an embedding dimension of 1024 (Config 2) had the lowest final validation loss. While deeper models can give more detailed feature representations, adding too many layers, as seen in Configs 4 and 5, leads to diminishing returns and higher computational costs without much improvement.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of layers

layers

Photo of neural nets

neural nets

Related news:

News photo

Bomb Jack display hardware

News photo

Building Containers from Scratch: Layers

News photo

Penzai: JAX research toolkit for building, editing, and visualizing neural nets