Get the latest tech news
GPU utilization can be a misleading metric
Most ML teams use GPU Utilization as their main performance metric, but we found this can be quite misleading.
A quick refresher: MFUs, or Model FLOPS (Floating point Operations Per Second) utilization, is one of the best metrics to understand GPU performance, as introduced in Google’s PaLM paper. A better definition can (surprisingly) be found on Datadog’s NVML docs, "Percent of time over the past sample period during which one or more kernels was executing on the GPU.” To determine why this is misleading, we need a quick primer on how GPUs work. Now this was already sounding the alarm for us because naive softmax is a notorious bottleneck for LLMs, with many kernel fusions such as FlashAttention coming out to address its memory-bound nature.
Or read this on Hacker News