Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 

Friday April 4, 2025 14:30 - 15:00 BST
While model checkpointing at the application framework level provides basic failure recovery for AI/ML training, it is not sufficient to address scheduling and resilience issues efficiently and, in an application-agnostic manner. As the scale of production workload increases, infra-level checkpointing using Checkpoint/Restore in Userspace (CRIU) can potentially be used to address these issues. We will demonstrate a k8s operator to checkpoint and hot-restart distributed ML workloads, leveraging CRIU, CRI-O, and cuda-checkpoint.

Our talk includes an example of synchronization mechanisms for JobSets running stateful workloads to be checkpointed and hot-restarted as part of a node maintenance scenario.

Key topics include use cases and limitations of using checkpoint/restore for stateful ML applications at the platform layer, an overview of how it works, and next steps needed to productionize this emerging technology.
Speakers
avatar for Bernie Wu

Bernie Wu

VP Technology Partnerships, MemVerge
Bernie is VP of Technology Partnerships and leads the Kubernetes, AI/ML, and CXL Memory initiatives for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including Conner/Seagate, Cheyenne Software, Trend... Read More →
avatar for Ganeshkumar Ashokavardhanan

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft
Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, and is the lead for the GPU workload experience and error handling on this kubernetes platform. He collaborates with partners in the ecosystem to support operator models for machine learning workloads... Read More →
Friday April 4, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room B
  Emerging + Advanced
Log in to leave feedback.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link