Get the latest tech news
Dynolog: Open-Source System Observability (2022)
The scale and complexity of advanced AI models makes it necessary to distribute AI training across multiple server nodes.
While there are existing solutions for monitoring ( Open telemetry) as well as profiling CPUs and GPUs, it is challenging to assemble them together to get a holistic view of the system. It also manages counters for micro-architecture specific performance events related to CPU Cache, TLBs, memory controllers on Intel and AMD CPUs. We are actively implementing new features, including support for Intel Processor Trace in conjunction with our contributions to LLVM, as well as memory latency and bandwidth monitoring.
Or read this on Hacker News