Get the latest tech news
EGPU: Extending eBPF Programmability and Observability to GPUs
DOI: https://doi.org/10.1145/3723851.3726984 HCDS '25: 4th Workshop on Heterogeneous Composable and Disaggregated Systems, Rotterdam, Netherlands, March 2025 Precise GPU observability and programmability are essential for optimizing performance in AI workloads and other computationally intensive high-performance computing (HPC) applications. In this paper, we introduce eGPU, the first framework and eBPF runtime that dynamically offloads eBPF bytecode onto GPUs via dynamic PTX injection.
We detail the design and implementation of eGPU, which integrates kernel-level and user-space eBPF instrumentation hooks, runtime PTX generation, and shared-memory synchronization, providing a seamless, low-overhead observability platform for modern HPC and AI workloads. Finally, fleet-level resource attribution dashboards break down GPU hours or FLOPs usage by model, user, or product group, ensuring that optimization efforts target the largest consumers of compute time. Collectively, these workflows and tools from Meta's observability stack demonstrate how systematic performance monitoring, automated data analysis, and a layered telemetry architecture can enable large-scale AI system efficiency, aligning closely with related research on kernels, dynamic instrumentation, and just-in-time optimization strategies in data-intensive computing environments.
Or read this on Hacker News