Get the latest tech news

GPU utilization can be a misleading metric


Most ML teams use GPU Utilization as their main performance metric, but we found this can be quite misleading.

A quick refresher: MFUs, or Model FLOPS (Floating point Operations Per Second) utilization, is one of the best metrics to understand GPU performance, as introduced in Google’s PaLM paper. A better definition can (surprisingly) be found on Datadog’s NVML docs, "Percent of time over the past sample period during which one or more kernels was executing on the GPU.” To determine why this is misleading, we need a quick primer on how GPUs work. Now this was already sounding the alarm for us because naive softmax is a notorious bottleneck for LLMs, with many kernel fusions such as FlashAttention coming out to address its memory-bound nature.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of GPU

GPU

Photo of GPU utilization

GPU utilization

Related news:

News photo

Taichi: Productive, portable, and performant GPU programming in Python

News photo

UK chip giant ARM developing GPU in Israel

News photo

A practitioner's guide to testing and running GPU clusters