KubeCon + CloudNativeCon Europe 2025: Full Schedule

In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

11:15 BST

Scaling GPU Clusters Without Melting Down! - Alay Patel & Ryan Hallisey, NVIDIA

Wednesday April 2, 2025 11:15 - 11:45 BST

Level 1 | Hall Entrance S10 | Room A

As GPUs become more powerful, their capacity to handle concurrent workloads increases, presenting new scaling challenges for Kubernetes clusters. In this session, we will share insights and strategies from NVIDIA’s experience right-sizing a Kubernetes control plane, while scaling up to meet business demand.

We will demonstrate how we measure the control plane resource consumption and share techniques and configuration parameters used that improved control-plane performance and scalability, such as: changing golang tunables, the goaway-chance parameter in kube-apiserver and some scheduler configurations. We will also share an often overlooked factor - the volume of YAML per API call. Finally, we will share how we use simulation techniques like KWOK (Kubernetes WithOut Kubelet) to measure new Kubernetes features, like DRA (Dynamic Resource Allocation), for control-plane scalability and performance before we roll it out in production.

Speakers

Ryan Hallisey

Software Engineer, NVIDIA

Ryan is a software engineer at NVIDIA. He works on building data centers powered by Kubernetes and KubeVirt for NVIDIA products.

Alay Patel

Senior Software Engineer, Nvidia

Alay is a Senior Software Engineer at Nvidia where he works on cloud gaming service, managing infrastructure for GPU workloads. He is passionate about open source with a focus on Kubernetes and platform engineering.

Wednesday April 2, 2025 11:15 - 11:45 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Beginner

12:00 BST

Slinky: Slurm in Kubernetes, Performant AI and HPC Workload Management in Kubernetes - Marlow Warnicke (Weston) & Tim Wickberg, SchedMD

Wednesday April 2, 2025 12:00 - 12:30 BST

Level 1 | Hall Entrance S10 | Room A

Kubernetes was designed for microservices. With AI rapidly advancing, Kubernetes must adapt to also support both AI training and multi-node inference. It needs to improve not only at scheduling these workloads within the cluster, but also at fine-grained resource assignment on the nodes.

High Performance Computing (HPC) systems use workload managers such as Slurm. Slurm, the most used HPC workload manager with over two decades of development, excels at gang scheduling, fair usage, job planning, and batch scheduling.

We will show the current state of Slinky, a fully open-source toolset designed to integrate Slurm with Kubernetes and to solve the difficulties of getting AI clusters working more performantly and efficiently. Slinky includes a Slurm operator, a Slurm client library, and a metrics exporter. Here, we will outline our architecture and discuss the challenges of achieving the fine-grained control needed in Kubernetes for full functionality for AI and HPC workloads.

Speakers

Tim Wickberg

CTO, SchedMD LLC

Tim Wickberg is the Chief Technology Officer of SchedMD, and is responsible for the technical direction and development of the open-source Slurm Workload Manager.

Marlow Warnicke (Weston)

Principal Cloud Architect, SchedMD

Marlow is a Principal Cloud Engineer working on scheduling at SchedMD. She also is a chair for the CNCF Environmental Sustainability TAG. Marlow has expertise in resource management, the AI/ML Kubernetes cloud compute ecosystem, embedded systems, high performance compute system tools... Read More →

Wednesday April 2, 2025 12:00 - 12:30 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Any

14:30 BST

Scaling To Thousands of GPUs With Ease: Multi-Region Large Model Training on Kubernetes - Yongxi Zhang, Meng Duan & Rongrong Wu, China Mobile

Wednesday April 2, 2025 14:30 - 15:00 BST

Level 1 | Hall Entrance S10 | Room A

With the development of large model technology, industry-leading large models now have the capability to train at a scale of up to 100,000 GPUs. This scale often exceeds the capacity limits of a single K8s cluster. A feasible solution is to adopt a multi-K8s cluster joint training approach.
To achieve multi-K8s cluster joint training, two key challenges need to be addressed: adapting single K8s cluster training tasks to run in a multi-K8s cluster environment, and ensuring the synchronization and efficient transmission of training parameters and checkpoint data across clusters.
In this presentation, we will share China Mobile’s practical experience in achieving parallel training on cross-region multi-K8s clusters, utilizing over 10,000 GPUs with Kubeflow’s Training Operator and VolcanoJob, with no modifications required. Additionally, we will introduce optimized methods to accelerate cross-region data synchronization during training.

Speakers

Rongrong Wu

China Mobile Cloud

Meng Duan

Senior Software Engineer, China Mobile Cloud

I work as a software engineer in the Cloud Native team at China Mobile Cloud, participating in the architectural design of the Cloud Native infrastructure for China Mobile Cloud. Throughout my career, I have made contributions to the CNCF open-source community and have held positions... Read More →

Yongxi Zhang

Senior Software Engineer, China Mobile (Suzhou) Software Technology Co., Ltd.

I am a Software Engineer in the Cloud Native team at Ecloud,I works on Multi-cluster Kubernetes within the Multi-cluster Kubernetes project.Throughout my career, I have made some contributions to the open-source community. In particular, I have contributed to Clusterpedia, a renowned... Read More →

Wednesday April 2, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Any

15:15 BST

Production-Ready LLMs on Kubernetes: Patterns, Pitfalls, and Performance - Priya Samuel, Elsevier & Luke Marsden, MLOps Consulting

Wednesday April 2, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance S10 | Room A

Many orgs are evaluating running open source LLMs on their own infrastructure, and Kubernetes is a natural platform choice. However, running open source LLMs in production on Kubernetes is, honestly, a bit of an undocumented mess.

This technical presentation shares the experience of both speakers in deploying production-grade LLM infrastructure on Kubernetes. Through practical demonstrations, we'll explore the complete deployment lifecycle, from GPU setup to optimization techniques like Flash Attention, quantization tradeoffs and GPU sharing.

You'll learn:

* Architectural patterns for efficient LLM deployment using Ollama and vLLM
* Solutions for model weight management and context length optimization
* Techniques for GPU sharing and improving resource utilization
* Production approaches to fine-tuning with Axolotl and serving multiple models with LoRAX

You'll leave with a complete blueprint for building reliable, scalable LLM infrastructure on Kubernetes.

Speakers

Priya Samuel

Full stack engineer, Software Architect, Elsevier

Priya Samuel is a seasoned technology leader with a passion for transforming complex challenges into actionable solutions. With extensive expertise in DevOps, and cloud-native technologies, and Identity and Access Management (IAM). Priya has helped organizations scale their data and... Read More →

Luke Marsden

Founder, MLOps Consulting

Technical leader and startup founder who participated in the early development of Docker and Kubernetes. Former SIG lead for SIG-cluster-lifecycle.

Wednesday April 2, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

16:15 BST

Orchestrating AI Models in Kubernetes: Deploying Ollama as a Native Container Runtime - Samuel Veloso, Cast AI & Lucas Fernández, Red Hat

Wednesday April 2, 2025 16:15 - 16:45 BST

Level 1 | Hall Entrance S10 | Room A

Existing solutions for serving AI models in Kubernetes are often difficult to deploy and manage with complex workflows and a lack of user-friendly design. This talk introduces a custom container runtime that leverages Ollama as the serving backend, simplifying the deployment and operation of AI models in Kubernetes environments.

A custom container runtime extends the standard container execution workflow by integrating additional capabilities directly into the container lifecycle. Solutions like gVisor and Kata Containers are prominent examples, leveraging this technology to enhance container security by isolating workloads or providing lightweight virtualized environments. In our case, we apply the same principle to AI model serving, enabling native deployment of open-source AI models within Kubernetes.

Speakers

Samuel Veloso

Software Engineer, Cast AI

Samu Veloso is a Software Engineer at Cast AI where he contributes to the future of Kubernetes security.

Lucas Fernández

Senior Software Engineer, Red Hat

I'm a technology fan and I love to explore as many fields as I can, such as Development, Ciber-Security or Artificial Intelligence. You can see what I am up to on lucferbux.dev. Feel free to contact me on my linkedin.

Wednesday April 2, 2025 16:15 - 16:45 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

17:00 BST

Optimizing Training Performance for Large Language Model(LLM) in Kubernetes - William Wang, Huawei Cloud Technologies Co., LTD & Peng Gu, Tech Starup

Wednesday April 2, 2025 17:00 - 17:30 BST

Level 1 | Hall Entrance S10 | Room A

Large Language Models are increasing in popularity and the training performance in Kubernetes at scale has become the biggest challenges for enterprises. How to achieve the optimal performance and linearity for a huge training job, such as 100k GPUs? What are the three most critical factors that affect performance? How to optimize performance step by step?

In this talk we will present an end to end analysis of the bottleneck of LLM training in Kubernetes at scale. And then show how the insufficient resource management and network topology awareness in Kubernetes affect the performance. Finally we will introduce the new resource management model, LLM dedicated training workload and scheduling solution which are initiated in the Volcano open source community and demonstrate how to use it to get optimal performance and linearity.

Speakers

Peng Gu

Software Architect, Tech Starup

Peng Gu holds a PhD degree in Computer Engineering from the University of Central Florida, specializing in high-performance computing. As a tech lead and cloud software architect at an AI infrastructure startup, he designs scalable, cutting-edge solutions to support highly demanding... Read More →

William Wang (Leibo Wang)

Senior software engineer, Nvidia

Cloud native architect, open-source enthusiast, technical lead and maintainer of CNCF Volcano, software developer with a decade of experience in diverse domains including cloud native technology, large-scale cluster resource management, batch scheduling, BigData, and AI acceleration... Read More →

Wednesday April 2, 2025 17:00 - 17:30 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

17:45 BST

More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling With Dynamic Resource Allocation - John Belamaric & Yash Sonthalia, Google

Wednesday April 2, 2025 17:45 - 18:15 BST

Level 1 | Hall Entrance S10 | Room A

Big training jobs and muti-host inference need a lot of nodes and accelerators. More nodes and accelerators mean more chances for failures. How can we be sure to have enough working GPUs for our job? How can we utilize the healthy portions of a 16x16 TPU cluster if one node fails? Simple node labels won’t cut it.

DRA is beta in Kubernetes 1.32. Usually, it’s used for managing individual devices on a node. But did you know that DRA supports modeling resources that are accessible across many nodes? This powerful abstraction can model clusters of nodes and devices. Combining it with the alpha partitionable device model in 1.33, we can correctly model complex multi-host, multi-accelerator topologies, and schedule workloads to them as a unit! This is a real game changer for AI/ML workloads on K8s.

Come learn about these current and upcoming technologies, and how the K8s community is applying them to massive compute clusters like the NVIDIA GB200 and ultra powerful multi-host TPU slices.

Speakers

John Belamaric

Senior Staff Software Engineer, Google

John is a Sr Staff SWE, co-chair of K8s SIG Architecture and of K8s WG Device Management, helping lead efforts to improve how GPUs, TPUs, NICs and other devices are selected, shared, and configured in Kubernetes. He is also co-founder of Nephio, an LF project for K8s-based automation... Read More →

Yash Sonthalia

Google, Staff Software Engineer, Google

7 years of experience working as a software engineer in Google. Tech Lead for TPUs/GPUs in GKE AI.

Wednesday April 2, 2025 17:45 - 18:15 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

11:00 BST

Development Environments on Kubernetes: Lessons From Six Years at Internet Scale - Christian Weichel & Alejandro de Brito Fontes, Gitpod

Thursday April 3, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance S10 | Room A

Running dev environments at scale presents unique challenges that push Kubernetes to the limit. After 6 years of operating development environments for 1.5 million users and as long-time contributors to the Kubernetes community, we encountered fundamental limitations with our use-case that led us to rearchitect Gitpod away from Kubernetes. Our recent technical deep-dive blog ended up on Hacker News and sparked quite the intense debate (speakers are the OP).

This talk dives into our journey of kernel modifications, custom controllers, implementations of user namespaces with shiftfs for UID mapping, seccomp notify for proc masking, and custom device policies for FUSE, tackling CPU throttling with custom CFS controllers, experiments with cgroupv2, and why 1.26's dynamic resource allocation didn’t solve our challenges. These are our hard-won insights to share with the community and continue the discussion around development environment infrastructure both on, or even off Kubernetes.

Speakers

Alejandro de Brito Fontes

Senior Engineer, Gitpod

Alejandro is a software entrepreneur and systems architect with more than 20 years of experience designing, building, and operating mission-critical IT infrastructure.

Christian Weichel

Chief Technology Officer, Gitpod

Chris Weichel is the Chief Technology Officer at Gitpod, where he leads the engineering team that builds and maintains the cloud-native platform for software development. With over 20 years of experience in software engineering and human-computer interaction, he has a comprehensive... Read More →

Thursday April 3, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Advanced

11:45 BST

Dancing With the Pods: Live Migration of a Database Fleet While Serving Millions of Queries - Jayme Bird & Manish Gill, ClickHouse

Thursday April 3, 2025 11:45 - 12:15 BST

Level 1 | Hall Entrance S10 | Room A

At ClickHouse, we recently changed the way we orchestrate databases provisioned by customers, specifically the way we use StatefulSets. There was just one big problem: we wanted to migrate our legacy fleet of thousands of services from the old orchestration code-path to the new one without any downtime - even the queries should continue to run as they are.

If there is one thing that people hate doing - it is migrations. They are painful, have lots of corner cases, and take a long time. In our case, it took us almost 6 months to migrate the entire fleet. But we encountered lots of interesting challenges along the way. This talk will walk you through these challenges of live migrating the entire ClickHouse Cloud Fleet's orchestration while continuing to serve customer queries and ingest. The story involves our Operator, deep-dive into StatefulSets, a custom migration controller, durable execution workflows, and many, many database synchronisation challenges.

Speakers

Manish Gill

Engineering Manager, ClickHouse Inc

Manish Gill works at ClickHouse Inc, where he is managing the AutoScaling team for ClickHouse Cloud. He is based out of Berlin and is deeply interested in Databases and Cloud challenges and still considers himself new to Kubernetes. In a past life, he worked in an ML research team... Read More →

Jayme Bird

Senior Software Engineer, ClickHouse

Jayme Bird is a Senior Software Engineer at ClickHouse Inc, working on the development of horizontal and vertical autoscaling solutions for ClickHouse Cloud, a stateful analytics DBaaS running on Kubernetes.

Thursday April 3, 2025 11:45 - 12:15 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Intermediate

14:15 BST

Beyond Security: Leveraging OPA for FinOps in Kubernetes - Sathish Kumar Venkatesan, Royal Bank of Canada

Thursday April 3, 2025 14:15 - 14:45 BST

Level 1 | Hall Entrance S10 | Room A

The Open Policy Agent (OPA) is widely known for enforcing security policies, but its capabilities extend far beyond compliance. This session explores how OPA can be harnessed for FinOps practices in Kubernetes. Learn how to create policies to enforce cost-efficient resource requests, limit the use of high-cost instance types, and ensure workloads adhere to budget constraints. Discover how to integrate OPA with tools like Gatekeeper and OpenCost to provide real-time cost visibility and actionable insights. Through practical examples, attendees will gain the skills to use OPA for both security and cost optimization in Kubernetes environments.

Speakers

Sathish Kumar Venkatesan

Principal Cloud Customer Engineer, Royal Bank of Canada

A Kubestronaut with 17 years of IT experience and 8 years in cloud-native technologies. As Cloud Engineer, DevOps practitioner, and SRE, I focus on extending CNCF projects beyond traditional uses. Currently transforming OPA from security into FinOps, combining KEDA and virtual clusters... Read More →

Thursday April 3, 2025 14:15 - 14:45 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Intermediate

15:00 BST

A Huge Cluster or Multi-Clusters? Identifying the Bottleneck - Paco Xu, DaoCloud & Saiyam Pathak, Loft Labs

Thursday April 3, 2025 15:00 - 15:30 BST

Level 1 | Hall Entrance S10 | Room A

The increasing complexity of Kubernetes deployments has sparked a debate between scaling single clusters to enormous sizes and managing multiple clusters. At KubeCon NA24, the CNCF Tech Landscape Radar unveiled insights into multicluster application management, while Google showcased a 65000-node cluster powered by Spanner, bypassing etcd's limitations. Similarly, ByteDance has achieved multi-tenancy at scale with Kubebrain.

This talk examines the challenges of large clusters (5,000+ nodes and beyond) and the trade-offs of multicluster solutions. Key topics include API server options, etcd tuning and alternatives (e.g., Kubebrain, kine), and operational concerns such as multi-tenancy models (vCluster, kubezoo, HNC), and operator version control. In parallel, multicluster management solutions like Karmada, Clusternet, and networking challenges with tools like Submariner are explored.

Attendees will gain actionable insights into selecting the most appropriate strategy for their needs.

Speakers

Saiyam Pathak

Principal Developer Advocate, Loft Labs

Saiyam is working as Principal Developer Advocate at Loft Labs. He is the founder of Kubesimplify, focusing on simplifying cloud-native and Kubernetes technologies. Previously at Civo, Walmart Labs, Oracle, and HP, Saiyam has worked on many facets of Kubernetes, including machine... Read More →

Paco Xu

OpenSource Team Leader, DaoCloud

Paco is a member of Kubernetes Steering Committee and the lead of the DaoCloud open-source team. In community, Paco mainly work as a Kubeadm Maintaine and SIG-Node Reviewer. He is co-chair of KubeCon China 2024 and organized Kubernetes Contributor Summit China 2023 and KCD Chengdu 2022, and speaked at KubeCon EU 2023, KubeCon China 2021 & 2023, KCD Shanghai. In 2024, he becomes LFAPAC Evangelist... Read More →

Thursday April 3, 2025 15:00 - 15:30 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Beginner

16:00 BST

Defusing the Kubernetes API Performance Minefield - Madhav Jivrajani, UIUC & Marek Siarkowicz, Google

Thursday April 3, 2025 16:00 - 16:30 BST

Level 1 | Hall Entrance S10 | Room A

Kubernetes enables a wide landscape of CNCF projects and organisations to build upon its foundation and extend its functionality through custom controllers. But anyone who has deployed an operator at scale, quickly discovers that the Kubernetes API is a performance minefield. Forget to set resourceVersion when listing pods? Your control plane explodes! This talk delves into recent enhancements in Kubernetes designed to defuse this performance minefield. We'll explore the improved storage layer that allows caching more types of requests, effectively halving request latency and reducing the load on etcd. Don't let your cluster fall victim to faulty controllers – join us to learn how these changes mitigate risks, boost performance, and contribute to a more stable and reliable Kubernetes experience. We'll explore how the storage layer improves API responsiveness and predictability, and you'll understand the impact of these changes on scalability, reliability, and overall user experience.

Speakers

Madhav Jivrajani

Kubernetes Maintainer, UIUC

Madhav is currently working at VMware on upstream Kubernetes. He has been a part of the Kubernetes community for about a year and mainly helps out with SIG-{Contribex, Node, Architecture, API-Machinery}. He was also involved with the structured logging efforts in the Kubernetes project... Read More →

Marek Siarkowicz

Senior Software Engineer, Google

Marek is a Software Engineer working at Google in Etcd team. He began his career in local startups where he loved open source and extreme programming. Currently he is a etcd maintainer and active member of SIG-instrumentation leading structured logging effort in Kubernetes. In his... Read More →

Thursday April 3, 2025 16:00 - 16:30 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Intermediate

16:45 BST

Live Migrating Stateful Batch Containers To Decrease Cluster Cost - Chris Battarbee & Ece Kayan, Metoro

Thursday April 3, 2025 16:45 - 17:15 BST

Level 1 | Hall Entrance S10 | Room A

Stateless workloads have long been able to take advantage of cluster compaction and the cost savings of spot instances, but stateful workloads present unique challenges. Unlike stateless applications, stateful workloads can’t easily restart on a new node without losing their critical state, making dynamic optimization much more difficult.

This talk explores how container snapshotting using the Kubelet Checkpoint API enables live migration of stateful workloads. By capturing and restoring the state of running containers, we can now compact stateful workloads to fewer nodes and even run them on spot instances cutting costs significantly.

We’ll cover the technical details of analyzing your cluster for consolidation opportunities, snapshotting containers, and migrating them seamlessly using open source tooling.

Speakers

Chris Battarbee

Software Engineer, Metoro

Chris Battarbee is the founder of Metoro and a former engineer at Palantir, where he wrote software to manage Spark workloads on Kubernetes focussing on efficiency and cost savings.

Ece Kayan

Software Engineer, Metoro

Ece Kayan, co-founder of Metoro, is a former Amazon engineer who focused on improving the resiliency and reliability of Prime Video services.

Thursday April 3, 2025 16:45 - 17:15 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Intermediate

17:30 BST

Chaos Engineering Practice Under Ultra-large-scale Cloud Native Edge Computing - Yue Bao, Huawei & yue li, DaoCloud

Thursday April 3, 2025 17:30 - 18:00 BST

Level 1 | Hall Entrance S10 | Room A

Fast growing technologies, such as 5G networks, industrial Internet, and AI, are giving edge computing an important role in driving digital transformation. As each new technology brings benefits, it brings challenges. First, there are massive heterogeneous edge devices and it encompass a broad range of device types. Second, Edge devices are often located in unstable and complex physical and network environments, such as limited bandwidth, high latency, etc. How to overcome these challenges and build a stable, large-scale edge computing platform needs to be resolved.
KubeEdge is an open source edge computing framework that extends the power of kubernetes from central cloud to edge. Now, Kubernetes clusters powered by KubeEdge, can stably support 100,000 edge nodes and manage more than one million pods.
In this session, we will share the Key challenges of manage massive heterogeneous edge nodes and tell how using ChaosMesh to makes KubeEdge more Reliable in large-scale edge nodes.

Speakers

Yue Bao

Senior Software Engineer, Huawei

Yue Bao serves as a software engineer of Huawei Cloud. She is now working 100% on open source, focusing on lightweight edge for KubeEdge. She is the maintainer of KubeEgde and also the tech leader of KubeEdge SIG Release and Node. Before that, Yue worked on Huawei Cloud Intelligent... Read More →

yue li

Software Quality Engineer, DaoCloud

work at DaoCloud as Quality Director, more than 20 years IT industry experience, China Mobile, Siemens, HP, EMC, and startup company. Newcomer in Cloud Native and open source fans. Would like to adopt open source projects to improve enterprise software quality with fast release.

Thursday April 3, 2025 17:30 - 18:00 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Any

11:00 BST

Empowering AI-Driven Drug Discovery: Overcoming Challenges in Building a ML Platform on Kubernetes - Marius Tanawa Tsamo & Gustav Rasmussen, Novo Nordisk

Friday April 4, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance S10 | Room A

In the era of AI-driven innovation, Kubernetes is fundamental for enabling medical scientists to execute machine learning tasks within a containerized environment. However, building a scalable ML platform on Kubernetes presents challenges, especially with advanced on-premise GPU-accelerated hardware optimized for large language model (LLM) training and inference.

This session will explore the obstacles faced by ML engineers and data scientists at Novo Nordisk in creating a robust platform for AI-driven drug discovery. The presentation will discuss enabling access to GPU resources at scale, orchestrating extensive data planes, efficiently running high-performance computing (HPC) jobs, and using GPU sharing strategies and different batch scheduling job software.

Insight about experiences with GPU sharing strategies, batch scheduling job software, overcoming operational challenges, and empowering ML engineers in accelerating drug discovery will be shared.

Speakers

Gustav Rasmussen

Tech Lead, Novo Nordisk A/S

Gustav is Tech Lead in R&ED (Research & Early Development) at Novo Nordisk in Denmark, holds a MSc in Physics and really likes Cloud and Platform Engineering

Marius Tanawa Tsamo

Senior Platform Engineer, Novo Nordisk

I have a Master's degree in Systems Network and Security and seven years of IT experience. Although I'm very passionate about container environments, I'm even more passionate about meaningful contributions. I'm French, but even if I'm fairly new to Denmark, I have been moving from... Read More →

Friday April 4, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Beginner

11:45 BST

Extending Kubernetes for AI | Lessons Learned From Platform Engineering - Susan Wu, Google & Lucy Sweet, Uber

Friday April 4, 2025 11:45 - 12:15 BST

Level 1 | Hall Entrance S10 | Room A

Kubernetes and the open-source ecosystem are becoming the universal control plane not only for conventional app orchestration but also for building AI applications. Yet, developers and cluster operators struggle with cost optimization for the specialized compute and customizing Kubernetes.

In this session, hear from the platform engineers for Morgan Stanley, Uber, Trivago and learn how they designed shared platforms with infrastructure across cloud providers to support both business-critical apps and accelerated workloads.

You can expect to come away with guidance, hear of pitfalls to watch out for and learn how they extended Kubernetes with custom controls and other cloud native projects and built efficient, self-service interfaces to enable developer velocity and researcher experimentation.

Panelists:

Lucy Sweet, Senior Software Engineer Uber
Susan Wu, PM Google

Speakers

Lucy Sweet

Senior Software Engineer, Uber

Lucy is a Senior Software Engineer at Uber Denmark who works on platform infrastructure

Susan Wu

Outbound Product Manager, Google

Susan is an Outbound Product Manager for Google Cloud, focusing on GKE Networking and Network Security. She previously led product and technical marketing roles at VMware, Sun/Oracle, Canonical, Docker, Citrix and Midokura (part of Sony Group). She is a frequent speaker at conferences... Read More →

Friday April 4, 2025 11:45 - 12:15 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

13:45 BST

From Chaos To Control: Building ML Platform - George Markhulia & Steve Larkin, Volvo Cars

Friday April 4, 2025 13:45 - 14:15 BST

Level 1 | Hall Entrance S10 | Room A

One of the most significant challenges facing the ML community in large organizations is the fragmentation of the data ecosystem, compounded by organizational silos and an inconsistent technology landscape. Tackling these barriers is critical to enabling efficient, scalable, and impactful machine learning solutions. At Volvo Cars, George and Steve are deeply committed to breaking silos, empowering users and enabling collaboration via the MLOps.

In this session, they will share their experience of designing and implementing ML platform on Kubernetes that bridges these gaps. The talk will cover architectural choices, key lessons learned, and best practices to address data accessibility, streamline workflows, and ensure seamless integration across diverse teams. Attendees will also gain insights into how this cloud-native platform enables faster experimentation, greater reproducibility, knowledge sharing and scalable deployment of ML models across the organization.

Speakers

Steve Larkin

ML Platform Engineer, Volvo Cars

With over 20 years in the software industry Steve has worked with a diverse set of technologies from creating some of the first smartphones to building data and machine learning platforms for enterprises. Originally from the UK he now lives in Malmö, Sweden with his family.

George Markhulia

Engineering Manager, Volvo Cars

With extensive experience in technical problem-solving, software engineering, and data streaming, George is a tech lead with a robust background in technology and operational excellence. His career journey includes MLOps, Android Automotive infotainment, backend systems, and analytical... Read More →

Friday April 4, 2025 13:45 - 14:15 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Any

14:30 BST

From High Performance Computing To AI Workloads on Kubernetes: MPI Runtime in Kubeflow TrainJob - Andrey Velichkevich, Apple & Yuki Iwai, CyberAgent, inc

Friday April 4, 2025 14:30 - 15:00 BST

Level 1 | Hall Entrance S10 | Room A

Message Passing Interface (MPI) is a foundational technology in distributed computing essential for ML frameworks like MLX, DeepSpeed, and NVIDIA NeMo. It powers efficient communication for large-scale AI workloads using high-speed interconnects via InfiniBand. However, running MPI on Kubernetes presents challenges, such as ensuring high-throughput pod-to-pod communication, managing MPI Job initialization in containerized environments, and supporting diverse MPI implementations, including OpenMPI, IntelMPI, and MPICH.

This talk will introduce the Kubeflow MPI Runtime integrated with Kubeflow TrainJob, featuring distributed training with MLX and LLMs fine-tuning with DeepSpeed on Kubernetes. Speakers will highlight SSH-based optimization to boost MPI performance. Attendees will discover how this solution simplifies, scales, and optimizes AI workloads while addressing key challenges and combining MPI's efficiency with Kubernetes' orchestration power.

Speakers

Andrey Velichkevich

Senior Software Engineer, Apple

Andrey Velichkevich is a Senior Software Engineer at Apple and is a key contributor to the Kubeflow open-source project. He is a member of Kubeflow Steering Committee and a co-chair of Kubeflow AutoML and Training WG. Additionally, Andrey is an active member of the CNCF WG AI. He... Read More →

Yuki Iwai

Software Engineer, CyberAgent, inc

Yuki is a Software Engineer at CyberAgent, Inc. He works on the internal platform for machine-learning applications and high-performance computing. He is currently a Technical Lead for Kubeflow WG AutoML / Training. He is also a Kubernetes WG Batch active member, Job API reviewer... Read More →

Friday April 4, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

15:15 BST

Green AI in Cloud Native Ecosystems: Strategies for Sustainability and Efficiency - Vincent Caldeira, Red Hat & Tamar Eilam, IBM Research

Friday April 4, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance S10 | Room A

The rapid proliferation of AI is increasing focus on the environmental costs associated with large-scale model training and deployment. As cloud-native technologies form the backbone of modern AI systems, the Cloud Native Computing Foundation (CNCF) is spearheading efforts to balance AI innovation with sustainability. This session will provide an overview of the CNCF effort to identify key areas, techniques, and best practices for energy-efficient AI in cloud-native environments. Attendees will gain insights into a newly developed taxonomy that categorises remediation patterns and sustainable practices across AI lifecycle phases, deployment environments, and personas.

We will also explore real-world applications and discuss reference architectures that provide means to optimise resource use, such as GPU slicing for inference efficiency, power capping during training, and carbon-aware scheduling, while maintaining performance and scalability.

Speakers

Tamar Eilam

IBM Fellow, Chief Scientist Sustainable Computing, IBM Research

Dr. Tamar Eilam is an IBM Fellow and Chief Scientist for Sustainable Computing in the IBM T. J. Watson Research Center, New York. Tamar complete a Ph.D. degree in Computer Science in the Technion, Israel, in 2000. She joined the IBM T.J. Watson Research Center in New York as a Research... Read More →

Vincent Caldeira

CTO APAC, Red Hat

Vincent Caldeira, CTO of Red Hat in APAC, is responsible for strategic partnerships and technology strategy. Named a top CTO in APAC in 2023, he has 20+ years in IT, excelling in technology transformation in finance. An authority in open source, cloud computing, and digital transformation... Read More →

Friday April 4, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate