Name: Scaling GPU Clusters Without Melting Down! - Alay Patel & Ryan Hallisey, NVIDIA
Start: 2025-04-02T11:15:00+0100
End: 2025-04-02T11:45:00+0100

In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Wednesday April 2, 2025 11:15 - 11:45 BST

Level 1 | Hall Entrance S10 | Room A

As GPUs become more powerful, their capacity to handle concurrent workloads increases, presenting new scaling challenges for Kubernetes clusters. In this session, we will share insights and strategies from NVIDIA’s experience right-sizing a Kubernetes control plane, while scaling up to meet business demand.

We will demonstrate how we measure the control plane resource consumption and share techniques and configuration parameters used that improved control-plane performance and scalability, such as: changing golang tunables, the goaway-chance parameter in kube-apiserver and some scheduler configurations. We will also share an often overlooked factor - the volume of YAML per API call. Finally, we will share how we use simulation techniques like KWOK (Kubernetes WithOut Kubelet) to measure new Kubernetes features, like DRA (Dynamic Resource Allocation), for control-plane scalability and performance before we roll it out in production.

Speakers

Ryan Hallisey

Software Engineer, NVIDIA

Ryan is a software engineer at NVIDIA. He works on building data centers powered by Kubernetes and KubeVirt for NVIDIA products.

Alay Patel

Senior Software Engineer, Nvidia

Alay is a Senior Software Engineer at Nvidia where he works on cloud gaming service, managing infrastructure for GPU workloads. He is passionate about open source with a focus on Kubernetes and platform engineering.

Wednesday April 2, 2025 11:15 - 11:45 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Beginner

KubeCon + CloudNativeCon Europe 2025

Ryan Hallisey

Alay Patel

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!