Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 

Wednesday April 2, 2025 17:00 - 17:30 BST
Large Language Models are increasing in popularity and the training performance in Kubernetes at scale has become the biggest challenges for enterprises. How to achieve the optimal performance and linearity for a huge training job, such as 100k GPUs? What are the three most critical factors that affect performance? How to optimize performance step by step?

In this talk we will present an end to end analysis of the bottleneck of LLM training in Kubernetes at scale. And then show how the insufficient resource management and network topology awareness in Kubernetes affect the performance. Finally we will introduce the new resource management model, LLM dedicated training workload and scheduling solution which are initiated in the Volcano open source community and demonstrate how to use it to get optimal performance and linearity.
Speakers
avatar for Peng Gu

Peng Gu

Software Architect, Tech Startup
Peng Gu holds a PhD degree in Computer Engineering from the University of Central Florida, specializing in high-performance computing. As a tech lead and cloud software architect at an AI infrastructure startup, he designs scalable, cutting-edge solutions to support highly demanding... Read More →
avatar for Klaus Ma

Klaus Ma

Senior Software Manager, NVIDIA
Team leader, system architect, designer, software developer with 10+ years of experience across a variety of industries and technology bases, including cloud computing, machine learning, bigdata and financial services. Founding Volcano & kube-batch, Kubernetes SIG-Scheduling co-Leader... Read More →
Wednesday April 2, 2025 17:00 - 17:30 BST
Level 1 | Hall Entrance S10 | Room A
  AI + ML
Log in to leave feedback.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link