Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Wednesday April 2, 2025 14:30 - 15:00 BST
With the development of large model technology, industry-leading large models now have the capability to train at a scale of up to 100,000 GPUs. This scale often exceeds the capacity limits of a single K8s cluster. A feasible solution is to adopt a multi-K8s cluster joint training approach.
To achieve multi-K8s cluster joint training, two key challenges need to be addressed: adapting single K8s cluster training tasks to run in a multi-K8s cluster environment, and ensuring the synchronization and efficient transmission of training parameters and checkpoint data across clusters.
In this presentation, we will share China Mobile’s practical experience in achieving parallel training on cross-region multi-K8s clusters, utilizing over 10,000 GPUs with Kubeflow’s Training Operator and VolcanoJob, with no modifications required. Additionally, we will introduce optimized methods to accelerate cross-region data synchronization during training.
Speakers
avatar for Rongrong Wu

Rongrong Wu

China Mobile Cloud
avatar for Meng Duan

Meng Duan

Senior Software Engineer, China Mobile Cloud
I work as a software engineer in the Cloud Native team at China Mobile Cloud, participating in the architectural design of the Cloud Native infrastructure for China Mobile Cloud. Throughout my career, I have made contributions to the CNCF open-source community and have held positions... Read More →
avatar for Yongxi Zhang

Yongxi Zhang

Senior Software Engineer, China Mobile (Suzhou) Software Technology Co., Ltd.
I am a Software Engineer in the Cloud Native team at Ecloud,I works on Multi-cluster Kubernetes within the Multi-cluster Kubernetes project.Throughout my career, I have made some contributions to the open-source community. In particular, I have contributed to Clusterpedia, a renowned... Read More →
Wednesday April 2, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room A
  AI + ML
  • Content Experience Level Any

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link