Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Friday April 4, 2025 14:30 - 15:00 BST
Message Passing Interface (MPI) is a foundational technology in distributed computing essential for ML frameworks like MLX, DeepSpeed, and NVIDIA NeMo. It powers efficient communication for large-scale AI workloads using high-speed interconnects via InfiniBand. However, running MPI on Kubernetes presents challenges, such as ensuring high-throughput pod-to-pod communication, managing MPI Job initialization in containerized environments, and supporting diverse MPI implementations, including OpenMPI, IntelMPI, and MPICH.

This talk will introduce the Kubeflow MPI Runtime integrated with Kubeflow TrainJob, featuring distributed training with MLX and LLMs fine-tuning with DeepSpeed on Kubernetes. Speakers will highlight SSH-based optimization to boost MPI performance. Attendees will discover how this solution simplifies, scales, and optimizes AI workloads while addressing key challenges and combining MPI's efficiency with Kubernetes' orchestration power.
Speakers
avatar for Andrey Velichkevich

Andrey Velichkevich

Senior Software Engineer, Apple
Andrey Velichkevich is a Senior Software Engineer at Apple and is a key contributor to the Kubeflow open-source project. He is a member of Kubeflow Steering Committee and a co-chair of Kubeflow AutoML and Training WG. Additionally, Andrey is an active member of the CNCF WG AI. He... Read More →
avatar for Yuki Iwai

Yuki Iwai

Software Engineer, CyberAgent, inc
Yuki is a Software Engineer at CyberAgent, Inc. He works on the internal platform for machine-learning applications and high-performance computing. He is currently a Technical Lead for Kubeflow WG AutoML / Training. He is also a Kubernetes WG Batch active member, Job API reviewer... Read More →
Friday April 4, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room A
  AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link