Get the latest tech news
Infrastructure setup and open-source scripts to train 70B model from bare metal
We would like to thank Voltage Park, Dell, H5, and NVIDIA for their invaluable partnership and help with setting up our cluster. A special…
While basic, we found that it was critical to ensure that launching training was reproducible and easily inspectable, especially since intermediate abstractions like Docker image caching or opaque secrets configurations could muddy the waters. Using this method, we were able to catch one particular issue where, due to a misconfiguration in the Python threading settings, we were unable to launch the eight multi-threaded NCCL GPU processes properly on certain hosts which hit a race condition during pre-PyTorch initialization code. Good performance but a bit “noisier” than usual (high-frequency white noise variance between 90% and 100% of expected MFU) This was also InfiniBand hardware related, but typically due to moderately degraded or flapping links higher up in the network rather than at the less redundant host to T2 layer.
Or read this on Hacker News