Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Thursday April 3, 2025 17:30 - 18:00 BST
As long-running AI/ML workloads become more common in cloud-native environments, the need for efficient checkpointing mechanisms to provide fault tolerance becomes increasingly important. However, current state-of-the-art techniques for transparent GPU checkpointing rely on intercepting and logging device API calls (e.g., CUDA runtime) as well as capturing input data and object handles (e.g., events, streams). This approach inevitably introduces steady-state overhead and requires replaying the entire recorded execution, potentially with nondeterministic operations, to recover from failures.

This talk will cover how the Kubernetes container checkpointing functionality has been extended with recently introduced CRIU plugins to enable transparent checkpoint/restore of GPU computations without the overhead of API interception, logging, or re-execution. This talk will also discuss how these mechanisms can be utilized to improve resource utilization in large-scale GPU clusters.
Speakers
avatar for Adrian Reber

Adrian Reber

Senior Principal Software Engineer, Red Hat
Adrian is a Senior Principal Software Engineer at Red Hat and is migrating processes at least since 2010. He started to migrate processes in a high performance computing environment and at some point he migrated so many processes that he got a PhD for that. Most of the time he is... Read More →
avatar for Radostin Stoyanov

Radostin Stoyanov

PhD Student, University of Oxford
Radostin Stoyanov is a PhD student at the Scientific Computing research group at the University of Oxford, and a Software Engineer at the Core Kernel Team at Red Hat. His research focuses on improving the resilience and performance of HPC and cloud computing systems.
Thursday April 3, 2025 17:30 - 18:00 BST
Level 1 | Hall Entrance S10 | Room B
  AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link