Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Friday April 4, 2025 14:30 - 15:00 BST
While model checkpointing at the application framework level provides basic failure recovery for AI/ML training, it burdens developers with complex config requirements. As the scale of production workload increases, infra-level checkpointing using Checkpoint/Restore in Userspace (CRIU) can provide fault-tolerance and live migration transparently to the end user. We will demonstrate with a k8s operator how to checkpoint and restore distributed ML workloads, showcasing novel extensions across CRIU, CRI-O, and cuda-checkpoint.

Our talk focuses on implementing synchronization mechanisms for JobSets running stateful workloads to be checkpointed in unison, while minimizing interruption overhead. The presentation explores how this infra-level approach accelerates recovery times, and workload reprioritization. Key topics include network state handling in distributed training and GPU memory checkpoint management, highlighting benefits for stateful applications requiring higher resiliency.
Speakers
avatar for Bernie Wu

Bernie Wu

VP Technology Partnerships, MemVerge
Bernie is VP of Technology Partnerships and leads the Kubernetes, AI/ML, and CXL Memory initiatives for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including Conner/Seagate, Cheyenne Software, Trend... Read More →
avatar for Ganeshkumar Ashokavardhanan

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft
Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, and is the lead for the GPU workload experience and error handling on this kubernetes platform. He collaborates with partners in the ecosystem to support operator models for machine learning workloads... Read More →
Friday April 4, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room B
  Emerging + Advanced

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link