Get the latest tech news
Walkthrough of a Minimal Vision Transformer (ViT)
In this post, I explain the vision transformer (ViT) architecture, which has found its way into computer vision as a powerful alternative to Convolutional Neural Networks (CNNs).
Instead, ViTs are often pretrained on very large image datasets before being finetuned to downstream tasks, similar to GPT-style LLMs. More technically, fine-tuning a ViT model typically involves turning the classification head into an MLP with no hidden layer ($D \times K$, where $D$ is the embedding dimension and $K$ is the number of downstream classes). More advanced fine-tuning, involving larger pre-trained models, learning rate schedules, and image resolution adjustments all bring this to what has been SOTA performance since 2020, 99.5% top-1 accuracy.
Or read this on Hacker News