Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Wednesday April 2, 2025 17:45 - 18:15 BST
Big training jobs and muti-host inference need a lot of nodes and accelerators. More nodes and accelerators mean more chances for failures. How can we be sure to have enough working GPUs for our job? How can we utilize the healthy portions of a 16x16 TPU cluster if one node fails? Simple node labels won’t cut it.

DRA is beta in Kubernetes 1.32. Usually, it’s used for managing individual devices on a node. But did you know that DRA supports modeling resources that are accessible across many nodes? This powerful abstraction can model clusters of nodes and devices. Combining it with the alpha partitionable device model in 1.33, we can correctly model complex multi-host, multi-accelerator topologies, and schedule workloads to them as a unit! This is a real game changer for AI/ML workloads on K8s.

Come learn about these current and upcoming technologies, and how the K8s community is applying them to massive compute clusters like the NVIDIA GB200 and ultra powerful multi-host TPU slices.
Speakers
avatar for John Belamaric

John Belamaric

Senior Staff Software Engineer, Google
John is a Sr Staff SWE, co-chair of K8s SIG Architecture and of K8s WG Device Management, helping lead efforts to improve how GPUs, TPUs, NICs and other devices are selected, shared, and configured in Kubernetes. He is also co-founder of Nephio, an LF project for K8s-based automation... Read More →
avatar for Yash Sonthalia

Yash Sonthalia

Google, Staff Software Engineer, Google
7 years of experience working as a software engineer in Google. Tech Lead for TPUs/GPUs in GKE AI.
Wednesday April 2, 2025 17:45 - 18:15 BST
Level 1 | Hall Entrance S10 | Room A
  AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link