The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.
Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.
Come take a behind-the-scenes look at NVIDIA’s large-scale GPU deployment. NVIDIA’s GPU Cloud has taken on the challenges of day-2 maintenance for 60,000+ GPUs in production, uncovering hard truths and surprising revelations along the way. From problems we didn’t even know existed, to pushing the limits of device uptime. We’ve spent years experimenting, fine-tuning, and learning what works—and what doesn’t.
As Kubernetes is increasing support for allocating accelerators with DRA, day-2 device management is becoming more important. We’ll speak about: - Techniques we use to uncover device failures - How we keep devices healthy - How we remediate failures with operational transparency and without impacting running workloads.
Natalie is a Senior Software Engineer at NVIDIA. She works on building software for cloud infrastructure powered by Kubernetes, KubeVirt and strong coffee.