Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Wednesday April 2, 2025 17:00 - 17:30 BST
Large Language Models are increasing in popularity and the training performance in Kubernetes at scale has become the biggest challenges for enterprises. How to achieve the optimal performance and linearity for a huge training job, such as 100k GPUs? What are the three most critical factors that affect performance? How to optimize performance step by step?

In this talk we will present an end to end analysis of the bottleneck of LLM training in Kubernetes at scale. And then show how the insufficient resource management and network topology awareness in Kubernetes affect the performance. Finally we will introduce the new resource management model, LLM dedicated training workload and scheduling solution which are initiated in the Volcano open source community and demonstrate how to use it to get optimal performance and linearity.
Speakers
avatar for Peng Gu

Peng Gu

Software Architect, Tech Starup
Peng Gu holds a PhD degree in Computer Engineering from the University of Central Florida, specializing in high-performance computing. As a tech lead and cloud software architect at an AI infrastructure startup, he designs scalable, cutting-edge solutions to support highly demanding... Read More →
avatar for William Wang (Leibo Wang)

William Wang (Leibo Wang)

Senior software engineer, Nvidia
Cloud native architect, open-source enthusiast, technical lead and maintainer of CNCF Volcano, software developer with a decade of experience in diverse domains including cloud native technology, large-scale cluster resource management, batch scheduling, BigData, and AI acceleration... Read More →
Wednesday April 2, 2025 17:00 - 17:30 BST
Level 1 | Hall Entrance S10 | Room A
  AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link