Get the latest tech news
Karpathy: Let's reproduce GPT-2 (1.6B): one 8XH100 node 24h $672 in llm.c
In this post we are reproducing GPT-2 in llm.c. This is "the GPT-2", the full, 1558M parameter version that was introduced in OpenAI's blog post Better Language Models and their Implications in Feb...
This is"the GPT-2", the full, 1558M parameter version that was introduced in OpenAI's blog post Better Language Models and their Implications in February 14, 2019. llm.c does so directly in C/CUDA (total of ~5,000 lines of code), without the typical training stack that would involve the Python interpreter and a significantly more complex deep learning library like PyTorch/JAX, huggingface/transformers, or etc. If you are running out of memory, decrease this value, e.g. try 8, 4, 2, all the way down to 1 potentially.-t 1024 sets the maximum sequence length to 1024, as GPT-2 did-d 1048576 asks that the total batch size be 2 to the power 20, following the GPT-3 paper hyperparameters table. The goal of llm.c remains to have a simple, minimal, clean training stack for a full-featured LLM agent, in direct C/CUDA, and companion educational materials to bring many people up to speed in this awesome field.
Or read this on Hacker News