Get the latest tech news
TScale – Distributed training on consumer GPUs
Contribute to Foreseerr/TScale development by creating an account on GitHub.
Optimized transformer architecture with faster convergence and ~2x reduced attention costs Support for fp8 and int8 model weights and activations precision Optimized for consumer nVidia GPUs including fast reduced precision training without sacrificing model quality CPU offload reduces GPU memory requirements for training Sync distributed training on several same config hosts 1-bit gradient compression allowing using regular ethernet links for interconnect Async distributed training on arbitrary hosts with negligible network traffic. By using inexpensive GPUs and async distributed mode TScale trains LLMs fast and affordable. To use multiple GPU devices set DEVICE_COUNT variable in train script to number of GPUs to use.
Or read this on Hacker News