Loading…
In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Wednesday April 2, 2025 16:15 - 17:30 BST
With GPUs being scarce and costly, multi-tenant Kubernetes clusters that can queue and prioritize complex, heterogeneous AI/ML workloads while achieving both high utilization and fair sharing, are a necessity for many organizations. This tutorial will teach the audience how to build, operate, and use an AI cluster. Starting from either a managed or on-premise Kubernetes cluster, we will demonstrate how to install and configure a number of open source projects (and only open source projects) such as Kueue, Kubeflow, PyTorch, Ray, vLLM, and Autopilot to support the full AI model lifecycle (from data preprocessing to LLM training and inference), configure teams and quotas, monitor GPUs, and to a large degree automate fault detection and recovery. By the end of the tutorial the participants will have a thorough understanding of the AI software stack refined by IBM Research over several years to effectively manage and utilize thousands of GPUs. Come to learn the recipe and try it at home!
Speakers
avatar for David Grove

David Grove

Distinguished Research Scientist, IBM Research
David Grove is a Distinguished Research Scientist at IBM T.J. Watson, NY, USA. He has been a software systems researcher at IBM since 1998, specializing in programming language implementation and scalable runtime systems. His current research focuses on cloud-related technologies... Read More →
avatar for Olivier Tardieu

Olivier Tardieu

Principal Research Scientist, Manager, IBM Research
Dr. Olivier Tardieu is a Principal Research Scientist and Manager at IBM T.J. Watson, NY, USA. He joined IBM Research in 2007. His current research focuses on cloud-related technologies, including Serverless Computing and Kubernetes, as well as their application to Machine Learning... Read More →
avatar for Claudia Misale

Claudia Misale

Staff Research Scientist, IBM Research
Claudia Misale is a Staff Research Scientist in the Hybrid Cloud Infrastructure Software group at IBM T.J. Watson Research Center (NY). Her research is focused on Kubernetes and targets monitoring, observability and scheduling for HPC and AI training workloads. She is mainly interested... Read More →
Wednesday April 2, 2025 16:15 - 17:30 BST
Level 1 | Hall Entrance N11
  Tutorials, AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link