Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Thursday April 3, 2025 16:45 - 17:15 BST
We explore the challenges of building and running a large-scale AI/ML cluster in cloud that can handle high-performance ML training jobs. We will cover the benefits of using a container orchestration platform like Kubernetes for managing AI/ML workloads and how Slurm can be used to schedule and manage jobs on a cluster. We will also dive into cluster health management and meeting performance expectations.

Share lessons from building a 12K GPU state-of-the-art HPC cluster, with high performance storage systems, and Infiniband network fabric, playing host to workloads ranging from 10s to thousands of GPUs lasting days to weeks.

We highlight the importance of health-checks and telemetry in understanding and reacting to various failure modes experienced in HPC clusters and how to mitigate impact on AI training jobs.

Finally, we share insights from operating the cluster for over a period of more than 6 months, and share pitfalls and best practices.
Speakers
avatar for Kalyan Saladi

Kalyan Saladi

Software Engineer, Meta Platforms Inc.
Kalyan is a software engineering lead at Meta Platforms in the research org(FAIR). He has built and operated multiple large AI clusters, both bare-metal as well as on the cloud. He supported several leading large model training efforts in FAIR over the years, including LLAMA-2. Kalyan... Read More →
avatar for Chandan Avdhut

Chandan Avdhut

Production Engineer, Meta Platforms Inc.
As a seasoned Production Engineer with a strong background in Kubernetes, public cloud infrastructure, and large-scale AI/ML clusters, I bring a unique blend of technical expertise and real-world experience to the table. With a proven track record of designing and operating complex... Read More →
Thursday April 3, 2025 16:45 - 17:15 BST
Level 1 | Hall Entrance S10 | Room B
  AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link