KubeCon + CloudNativeCon Europe 2025: Full Schedule

In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

09:26 BST

Keynote: Into the Black Box: Observability in the Age of LLMs - Christine Yen, CEO and Cofounder, Honeycomb

Wednesday April 2, 2025 09:26 - 09:41 BST

Level 0 | ICC Auditorium

LLMs can provide a quick injection of magic into an existing product (or product concept)! Most of us looking to build on LLMs aren't ML engineers or AI experts, after all, and this new wave of LLM offerings makes it easy for any of us to build something delightful.

But once that product or feature is shipped, in production, in front of users, the problems all collapse back into something that feels awfully familiar: performance challenges, questionable accuracy, and unhappy or confused users.

This talk will assert that building on LLMs is just like buliding on top of any other sort of black box in our architecture (APIs, DBs, etc)—this one just happens to be inherently unpredictable and probablistic.

We'll cover how to leverage observability best practices (like SLOs!) in this highly parameterized and rapidly evolving world, with nondeterministic outputs and a bunch of perceived risks—and you'll emerge more confident and ready to deal with this new AI-driven world.

Speakers

Christine Yen

CEO/Cofounder, Honeycomb

Christine is the CEO/cofounder of Honeycomb, an observability tool for teams who build and manage software that matters. She cares deeply about bridging the gap between devs and ops with technological and cultural improvements—and thinks that observability is really just a way... Read More →

Wednesday April 2, 2025 09:26 - 09:41 BST
Level 0 | ICC Auditorium

Keynote Sessions, Observability

Content Experience Level Intermediate

09:48 BST

Keynote: AI Enabled Observability Explainers - We Actually Did Something With AI! - Vijay Samuel, Principal MTS, Architect, eBay

Wednesday April 2, 2025 09:48 - 10:03 BST

Level 0 | ICC Auditorium

If folks think that this will be yet another hand wavy AI talk, prepared to be disappointed! Over the last few quarters, the Observability platform team at eBay has embarked on the journey of building "Explainers" for telemetry signals. "So, you are just shoving data into an LLM, big deal!" - one might say. The approach that we took was slightly different. Yes, an LLM does know how to interpret an OTEL trace waterfall but does it do it predictably? No! For various reasons. This is where AI and Engineering have a beautiful marriage. For each signal, we have carefully married crafty algorithms and LLMs to create more predictable and accurate AI enabled experiences. Some of which include explaining traces, metrics and logs.

We have also cumulated these building block explainers to create compound explainers that can explain dashboards. This talk describes how things like critical path detection along with LLMs are better than just giving entire traces to the LLMs and more.

Speakers

Vijay Samuel

Principal MTS, Architect, eBay

Vijay Samuel works with eBay's Reliability Engineering as its architect. During his time at eBay Vijay has transformed eBay's observability platform into a cloud native offering that is primarily built on top of open source technologies. He loves to code in Go and play video game... Read More →

Wednesday April 2, 2025 09:48 - 10:03 BST
Level 0 | ICC Auditorium

Keynote Sessions, Observability

Content Experience Level Intermediate

10:26 BST

Keynote: Empowering Accessibility Through Kubernetes: The Future of Real-Time Sign Language Interpretation - Rob Koch, Principal, Slalom Build

Wednesday April 2, 2025 10:26 - 10:41 BST

Level 0 | ICC Auditorium

Communication barriers exclude millions of people from fully participating in everyday interactions. For the deaf and hard-of-hearing community, the absence of scalable, real-time sign language interpretation remains a persistent challenge. In this session, we will demonstrate a forward-looking AI-powered application that translates sign language into spoken language, deployed and orchestrated on Kubernetes. This application leverages generative AI (LxMs) to scale for multiple users, representing a step toward a future where communication is accessible to all.
Using the sign language translation use case, the session will demonstrate how Kubernetes is well positioned to support AI workloads, how it optimizes cluster resources for video and language processing, and how it integrates seamlessly with generative AI use-cases.

Speakers

Rob Koch

Principal, Slalom Build

A tech enthusiast who thrives on steering projects from their initial spark to successful fruition, Rob Koch is Principal at Slalom Build, AWS Hero, and Co-chair of the CNCF Deaf and Hard of Hearing Working Group. His expertise in architecting event-driven systems is firmly rooted... Read More →

Wednesday April 2, 2025 10:26 - 10:41 BST
Level 0 | ICC Auditorium

Keynote Sessions, AI + ML

Content Experience Level Intermediate

11:15 BST

Dapr + Score: Mixing the Perfect Cocktail for an Enhanced Developer Experience - Mathieu Benoit, Humanitec & Kendall Roden, Diagrid

Wednesday April 2, 2025 11:15 - 11:45 BST

Level 1 | Hall Entrance N10 | Room H

Developer Experience (DevEx) is an important concept in Platform Engineering and in the cloud native space, advocating for self-service and reduced cognitive load. Its primary goal is to empower developers, allowing them to focus on coding rather than fighting with infrastructure intricacies. What is the right level of abstraction? Which type of tooling is essential? How can teams identify the concepts and workflows that drive value and success?
Tools such as Dapr and Score are being used in innovative ways to make a wider range of developers more productive. On one hand, they allow the Developers to be abstracted from underlying infrastructure and dependencies. On the other hand, Platform Engineers can easily configure the building blocks and associated infrastructure, seamlessly for the Developers.
This talk demonstrates a practical blueprint between Dapr and Score, where you will see how to deploy any Dapr containerized workloads defined by Score, to Docker Compose or Kubernetes.

Speakers

Mathieu Benoit

Cloud Native Ambassador & Customer Success Engineer, Humanitec

I’m passionate about Cloud Native Computing technologies driven by Open Source, Cloud, Security, SRE, Containers, DevOps, Platform Engineering and Kubernetes. Based on my past experiences as software engineer, IT consultant, solution architect and customer success engineer, I now... Read More →

Kendall Roden

Technical Product Lead, Diagrid

Kendall is a Technical Product Lead at Diagrid, helping shape the future of cloud-native development through the creation of developer-centric products. After 6+ years at Microsoft in a variety of roles in the application development space, Kendall transitioned into product management... Read More →

Wednesday April 2, 2025 11:15 - 11:45 BST
Level 1 | Hall Entrance N10 | Room H

Application Development

Content Experience Level Intermediate

11:15 BST

Stateful Superpowers: Explore High Performance and Scaleable Stateful Workloads on K8s - Alex Chircop & Chris Milsted, Akamai

Wednesday April 2, 2025 11:15 - 11:45 BST

Level 1 | Hall Entrance S10 | Room D

There is no such thing as a stateless application - All applications need to store state somewhere!

Stateful workloads like databases and key value stores are often deployed outside of K8s, missing out on all the benefits of declarative config, scaling, failover and automatic healing.

In this talk we show how running stateful workloads in K8s is not only performant and scalable but are also resilient, and can facilitate Disaster Recovery.

We will discuss the cloud native ecosystem and provide live demos of:
* Running a million RPS on a KV store with TiKV
* Running scalable, replicated and resilient Postgres databases with CloudNativePG
* Running analytics & ML on a distributed filesystem with CubeFS
… all in K8s, using K8s features to scale, failover and run day 2 operations. Working examples for the demos will be shared to enable the audience to run their own databases and stateful workloads in K8s.

Finally, we will end with a discussion of use cases and best practices.

Speakers

Alex Chircop

Chief Architect, Akamai

Chief Architect at Akamai. Previously a founder and CTO of Ondat (formerly StorageOS), building software defined solutions for cloud native environments. Alex is also a co-chair of the CNCF Storage TAG. Before embarking on the startup adventure he spent over 25 years engineering infrastructure... Read More →

Chris Milsted

Senior Product Architect, Akamai

Chris has been working with Kubernetes since pre 1.0 when it was the Beta for OpenShift version 3 at Red Hat. Since then he has moved, via VMware and Tanzu, to Akamai (via Ondat) as a Senior Product Architect, helping to design and deliver cloud scale technologies. Outside of work... Read More →

Wednesday April 2, 2025 11:15 - 11:45 BST
Level 1 | Hall Entrance S10 | Room D

Data Processing + Storage

Content Experience Level Intermediate

11:15 BST

Lessons Learned From Architecting the Highest-scale Operational Systems in the World - Artur Bergman, Fastly

Wednesday April 2, 2025 11:15 - 11:45 BST

Level 0 | ICC Capital Hall | Room 2

Platform engineering for accelerating modern, resilient cloud-native systems requires a ruthless focus on the experience of both your customers and your developers. Restrictive vendor experiences, made worse by overreliance on single-point solutions, and the isolated bash script approaches from the past introduce unacceptable compromises to performance, security, and quality for continuous operations. As the founder and CTO of Fastly, Artur Bergman has spent decades optimizing the vendors in his stack and how he uses them to build a cohesive developer toolchain for Fastly’s internal teams and customer platform teams worldwide. This talk will cover: lessons learned from testing the limits of vendor systems to meet business needs, evaluating when to build versus buy platform engineering systems from first principles, and how to apply a rigorous experience design lens when architecting platforms for team success.

Speakers

Artur Bergman

Founder and CTO, Fastly

Artur Bergman currently serves as Chief Technology Officer of Fastly, Inc., a leading edge cloud platform. Artur founded Fastly in 2011 and served as its CEO until 2020, guiding the company through its IPO in 2019. Prior to becoming CTO in 2024, he held the role of Chief Architect... Read More →

Wednesday April 2, 2025 11:15 - 11:45 BST
Level 0 | ICC Capital Hall | Room 2

Platform Engineering

Content Experience Level Intermediate

11:15 BST

Tutorial: Exploring Multi-Tenant Kubernetes APIs and Controllers With Kcp - Robert Vasek, Clyso GmbH; Nabarun Pal, Broadcom; Varsha Narsing, Red Hat; Marko Mudrinic, Kubermatic GmbH; Mangirdas Judeikis, Cast AI

Wednesday April 2, 2025 11:15 - 12:30 BST

Level 1 | Hall Entrance N11

While Kubernetes transformed container orchestration, creating multi-tenant platforms remains a significant challenge. kcp goes beyond DevOps and workload management, to reimagine how we deliver true SaaS experiences for platform engineers. Think workspaces and multi-tenancy, not namespaces in a singular cluster. Think sharding and horizontal scaling, not overly large and hard to maintain deployments. With novel approaches to well-established building blocks in Kubernetes API-Machinery, this CNCF sandbox project gives engineers a framework to host and consume any kind of API they need to support their platforms.

In this hands-on workshop, participants will learn how to extend Kubernetes with KCP, build APIs, and design controllers to tackle multi-tenancy challenges. By exploring real-world scenarios like DBaaS across clusters, attendees will gain practical skills to create scalable, multi-tenant platforms for their Kubernetes environments.

Speakers

MJ / Mangirdas Judeikis

Staff Engineer, kcp maintainer, Cast AI

Control planes, distributed systems and opensource. All Kubernetes and kcp! A decade of Kubernetes experience, focusing on platform engineering based on Kubernetes over the last decade. Doing platform engineering before it was cool. :)I thrive on Go, Kubernetes, and an Open Source... Read More →

Marko Mudrinić

Senior Software Engineer, Kubermatic GmbH & University Union

Marko is a Senior Software Engineer at Kubermatic, working on the development of Kubernetes, kcp, and platforms for managing Kubernetes clusters at scale. He currently serves as a Subproject Lead for Kubernetes Release Engineering, a Senior Release Manager, and a Tech Lead for SIG... Read More →

Varsha Narsing

Senior Software Engineer, Red Hat

Varsha is a software engineer at Red Hat. She is passionate about solving problems by developing and leveraging various software technologies. She currently works with the Portfolio Enablement team (Operator Framework) and is an active contributor to Kubernetes SIGs projects like... Read More →

Nabarun Pal

Principal Software Engineer, Broadcom

Nabarun is a Principal Software Engineer at Broadcom, a maintainer of the Kubernetes project, a chair of Kubernetes SIG Contributor Experience and an emeritus Kubernetes Steering Committee member. He is contributing to kcp in various ways in the recent past.He is a Release Manager... Read More →

Robert Vasek

Software Engineer, Clyso GmbH

Robert is a software engineer working on storage and container technologies.

Wednesday April 2, 2025 11:15 - 12:30 BST
Level 1 | Hall Entrance N11

Tutorials, Platform Engineering

Content Experience Level Intermediate

12:00 BST

Streamlined Efficiency: Unshackling Kubernetes Image Volumes for Rapid AI Model and Dataset Loading - Esteban Rey, Microsoft & Yifan Yuan, AlibabaCloud

Wednesday April 2, 2025 12:00 - 12:30 BST

Level 1 | Hall Entrance S10 | Room D

In this presentation, we will introduce a novel approach to utilizing Kubernetes’ new Image Volumes for quickly and efficiently loading large language models and extensive datasets. We will explain how streaming loading and open-source technologies speed up mounting Open Container Initiative (OCI) artifacts without packaging existing object storage blobs. This ensures effective usage of storage space and faster loading times.

Packaging large models and petabyte-level datasets into OCI artifacts presents two challenges:

1. Converting existing datasets is time-consuming.
2. Pulling time and disk space usage are unacceptable.

Our approach eliminates the need to convert existing data and uses streaming loading technology to mount image volumes without pulling. It ensures high performance for accessing numerous small files and loading large models, making it practical for new and demanding scenarios.

Speakers

Yifan Yuan

senior software engineer, AlibabaCloud

Yifan Yuan is a software engineer in the Alibaba Cloud storage team and is a major maintainer of containerd/overlaybd project. He has rich experience in improving the startup efficiency of containers and large-scale data distribution. Yifan has collaborated with companies such as... Read More →

Esteban Rey

Software Engineer II, Microsoft

Esteban Rey is a Software Engineer at Azure and a maintainer of the containerd/accelerated-container-image project. Over the past four years, he has played a key role in developing the Azure Container Registry, ensuring Open Container Initiative conformance, and integrating open-source... Read More →

Wednesday April 2, 2025 12:00 - 12:30 BST
Level 1 | Hall Entrance S10 | Room D

Data Processing + Storage

Content Experience Level Intermediate

12:00 BST

Expanding eBPF’s Reach: From Batteries-Included Auto-Instrumentation To E2E Observability Pipelines - Dom Del Nano, Cosmic

Wednesday April 2, 2025 12:00 - 12:30 BST

Level 1 | Hall Entrance N10 | Room E

Traditional monitoring and o11y were defined by the painstaking process of manual instrumentation—an inconsistent and error-prone effort, especially with the rise of cloud environments. eBPF promised a breakthrough, introducing auto-instrumentation that could eliminate these challenges. When the magic of eBPF works, it’s transformative, but there are times where its auto instrumentation comes up empty. Rigid, black box tooling is frustrating—at its best it’s magical and at its worst it’s distrusted quickly.

What if eBPF provided a “batteries included but removable” experience, enabling engineers to customize o11y to their needs? In this talk, we’ll discuss how CNCF Pixie and Inspektor Gadget provide the right abstraction for unlocking eBPF’s full potential with their powerful post-processing and k8s enrichment capabilities. We’ll also explore how this vision transformed Pixie’s data collector into a universal agent that can power observability pipelines like Fluentbit and Vector.

Speakers

Dom Delnano

Pixie core maintainer, Cosmic

Dom is a core maintainer of the Pixie open source project and founder/CEO at Cosmic. He previously worked at Crowdstrike, focusing on the eBPF Linux sensor, and at New Relic, working on Pixie full-time. Dom first began building observability tooling at Twitter, where he scaled the... Read More →

Wednesday April 2, 2025 12:00 - 12:30 BST
Level 1 | Hall Entrance N10 | Room E

Observability

Content Experience Level Intermediate

12:00 BST

Taming 50 Billion Time Series: Operating Global-Scale Prometheus Deployments on Kubernetes - Orcun Berkem & Alan Protasio, AWS

Wednesday April 2, 2025 12:00 - 12:30 BST

Level 1 | Hall Entrance S10 | Room C

Scaling Prometheus to support 50 billion active time series across 20 regions on Kubernetes is a monumental challenge. This session delves into the architecture, processes, and tools that make it possible. We will explore the design of stateful sets and zone-aware deployments to ensure reliability and scalability, alongside deployment processes tailored for high availability and fault tolerance. Learn how cellular architecture enables granular scaling and fault isolation, and discover our approach to multi-tenancy, including protection mechanisms against noisy neighbors such as shuffle sharding, throttling with token buckets . We’ll also discuss the journey of scaling each cell to 1 billion active time series, highlighting the Kubernetes challenges we faced and solved along the way. Attendees will leave with actionable insights into building resilient, efficient, and scalable systems using Kubernetes in the cloud-native ecosystem.

Speakers

Alan Protasio

Software Developer Enginner, AWS

Alan is a core contributor and maintainer of Cortex and currently serves as a Senior Software Engineer at AWS, where he works on the Amazon Managed Prometheus Service. With over 15 years of experience in the tech industry, Alan has played a pivotal role in shaping several AWS services... Read More →

Orcun Berkem

Principal Engineer, AWS

Orcun is a seasoned engineer with expertise in building scalable, resilient systems and leading large teams. As a Principal Engineer at AWS Open Source Observability, he focuses on scaling Cortex, along with working on AWS Distribution of OpenTelemetry, Grafana, and OpenSearch, and... Read More →

Wednesday April 2, 2025 12:00 - 12:30 BST
Level 1 | Hall Entrance S10 | Room C

Operations + Performance

Content Experience Level Intermediate

12:00 BST

Day-2’000 - Migration From Kubeadm+Ansible To ClusterAPI+Talos: A Swiss Bank’s Journey - Clément Nussbaumer, PostFinance

Wednesday April 2, 2025 12:00 - 12:30 BST

Level 1 | Hall Entrance S10 | Room B

Is it even possible to migrate 35 clusters in an air-gapped environment with a custom PKI infrastructure to ClusterAPI without Downtime? We'll show you why and how this can be pulled off, and how you could do the same.

The journey starts with our legacy provisioning setup (a mix of kubeadm/ansible/puppet), followed by the migration path and tooling. Along the road, we'll discover a series of challenges such as loss of etcd quorum, matching legacy/new kube-apiserver configuration, mismatching etcd encryption keys, and more.

After a live demo of a migration, the session explores managing our fleet of clusters with ArgoCD (with a focus on simple Talos configuration files in our repositories thanks to a few templating tricks, and a clean ClusterAPI workload cluster overview through ArgoCD ApplicationSets).

The presentation concludes by addressing a critical puzzle: solving the chicken/egg bootstrapping problem of the first ClusterAPI management cluster(s).

Speakers

Clément Nussbaumer

Systems Engineer, PostFinance

🇨🇭 Systems Engineer living on a farm 🐄Kubernetes Clusters during the day, helping out on the farm whenever needed, and playing music in the evening 🎺

Wednesday April 2, 2025 12:00 - 12:30 BST
Level 1 | Hall Entrance S10 | Room B

Platform Engineering

Content Experience Level Intermediate

12:00 BST

Leveraging Internal Knowledge: Building AiKA at Spotify - Majd Salman & Jofre Mateu Matesanz, Spotify

Wednesday April 2, 2025 12:00 - 12:30 BST

Level 0 | ICC Capital Hall | Room 2

In the fast-paced world of technology, access to the right information at the right time is crucial for innovation and efficiency. Enter AiKA, Spotify's RAG based internal “artificial intelligence knowledge assistance” platform, designed to empower our developers by providing instant access to the vast pool of internal knowledge through various surfaces. We'll cover why we developed AiKA, detailing the challenges of managing and retrieving info across a large organization. Learn about the technologies and methodologies we employed and how we integrated AiKA seamlessly into our existing infrastructure

We'll highlight how AiKA's flexible API allows engineers to ingest their own custom knowledge, tailoring the tool to meet the unique needs of different teams. Discover how it not only enhances productivity but also fosters a culture of self-service and continuous learning.

Speakers

Jofre Mateu Matesanz

Software Engineer, Spotify

Jofre is a Senior Data Engineer at Spotify with a focus on making internal knowledge assistance and productivity tools for engineers.

Majd Salman

Senior Data Engineer, Spotify

Majd Salman is a Senior Data Engineer at Spotify with a focus on making internal knowledge assistance and productivity tools for engineers.

Wednesday April 2, 2025 12:00 - 12:30 BST
Level 0 | ICC Capital Hall | Room 2

Platform Engineering

Content Experience Level Intermediate

13:30 BST

🪧 Poster Session: Enhancing Research and Data Delivery With the Data Delivery System (DDS) - Álvaro Revuelta M., SciLifeLab Data Centre & Valentin Georgiev, Uppsala Universtet

Wednesday April 2, 2025 13:30 - 14:30 BST

Level 1 | Hall Entrances S8 - S9, N8 - N9

The Data Delivery System (DDS) is a cloud-based platform developed by the SciLifeLab Data Centre for the secure and efficient delivery of research data from SciLifeLab Facilities to their users, specifically research groups. The application is containerized and running in Kubernetes clusters. The deployments are synchronized with ArgoCD and uses modern GitOps tools such as SealedSecrets.

This poster session will present the architecture and key features of DDS, including its use of containerization, automated deployment, and robust data management capabilities. Attendees will gain insights into how DDS facilitates fast and secure data transfers, supporting the needs of the life sciences research community.

Speakers

Valentin Georgiev

Systems developer, Uppsala Universtet

With over 10 years of experience in High-Performance Computing (HPC), I have been working with microservices architecture since 2016 and have specialized in Kubernetes (k8s) and Kubernetes application development since 2020. My expertise spans designing, deploying, and managing scalable... Read More →

Álvaro Revuelta M.

System Developer, SciLifeLab Data Centre

System Developer, working to build reliable systems that enable life sciences research

Wednesday April 2, 2025 13:30 - 14:30 BST
Level 1 | Hall Entrances S8 - S9, N8 - N9

🪧 Poster Sessions, Platform Engineering

Content Experience Level Intermediate

13:30 BST

🪧 Poster Session: Extensible Kubernetes CRDs Via Inheritance for Modularity and Reuse - Nik Dijkema & Mostafa Hadadian, University of Groningen

Wednesday April 2, 2025 13:30 - 14:30 BST

Level 1 | Hall Entrances S8 - S9, N8 - N9

Maintainability and adaptability are crucial for continuous deployment in dynamic cloud environments, emphasizing the need for modularity.

Kubernetes CRDs and controllers provide declarative APIs. But extensibility and reusability limitations pose a challenge and impair custom resource modularity. Extending CRD schemas induces API changes or requires weaker schemas, control logic is not reusable for similar resource types, and many operators are complex monolithic controllers.

This work solves these limitations by implementing inheritance to enable extension and reuse of CRD schemas and controllers. Schema inheritance enables extending an existing CRD schema without changing its API, providing APIs at different levels of abstraction. This allows reuse of common controller functionality through generalisation, promoting separation of concerns in operators. Finally, inheritance enables reasoning about substitutability of custom resources, providing opportunities for adaptability.

Speakers

Nik Dijkema

Graduate Student, University of Groningen

Nik is a Master's student in Software Engineering and Distributed Systems at the University of Groningen, where he also obtained his Bachelor's degree in Computing Science. His interests lie in cloud computing and cloud-native infrastructure.

Mostafa Hadadian

AI/MLOps Innovator| Founder & CEO, University of Groningen | CAIDEL

Mostafa is Founder and CEO of CAIDEL: Continuous AI Deliver. He is also completing his PhD in Computer Science at the University of Groningen. His work lies in cloud native and machine learning development, emphasizing MLOps. Complementing his academic pursuits, he brings a wealth... Read More →

Wednesday April 2, 2025 13:30 - 14:30 BST
Level 1 | Hall Entrances S8 - S9, N8 - N9

🪧 Poster Sessions, Emerging + Advanced

Content Experience Level Intermediate

14:30 BST

Enhancing Database Observability With OpenTelemetry - Marylia Gutierrez, Grafana Labs

Wednesday April 2, 2025 14:30 - 15:00 BST

Level 1 | Hall Entrance N10 | Room E

With the recent stabilization of the OpenTelemetry semantic conventions for databases, it's an excellent time for OSS libraries to provide users with the observability they've been seeking. This talk dives into how you can instrument your application with OpenTelemetry SDKs to improve observability and collect actionable telemetry data from your databases. Learn about the SDK implementations that are currently available by language and database, their current gaps and how you can contribute and develop missing instrumentation.
Whether you're an SRE, developer, or database administrator, this talk will equip you with the tools and knowledge to bring clarity and efficiency to your database systems.

Speakers

Marylia Gutierrez

Staff Software Engineer, Grafana Labs

Marylia is a Staff Software Engineer at Grafana Labs, focusing on Observability with OpenTelemetry. In the OpenTelemetry project, she is an approver for Database Semantic Conventions, JS SDK and Portuguese localization and also a maintainer for Contributor Experience. Before that... Read More →

Wednesday April 2, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance N10 | Room E

Observability

Content Experience Level Intermediate

14:30 BST

Don't Write Controllers Like Charlie Don't Does: Avoiding Common Kubernetes Controller Mistakes - Nick Young, Isovalent at Cisco

Wednesday April 2, 2025 14:30 - 15:00 BST

Level 1 | Hall Entrance S10 | Room B

So you've learned about Custom Resource Definition (CRD) design errors, you've designed your CRD to avoid common mistakes, and now you're ready to write the controller.

Turns out there's a lot of gotchas in that process as well!

This talk explores the common pitfalls that the ever-unlucky Charlie Don't, who always makes the worst decisions, runs into when implementing a controller.

The talk should be particularly useful for anyone writing reconciliation loops that use Kubernetes objects, whether they are CRDs or not. You can expect to come away from this talk having learned about common mistakes like: straining the apiserver with too many status updates, missing updates in complex systems of CRDs, and having scaling problems from not using caching correctly.

No knowledge of the previous talks is required, so come and have a chuckle at poor old Charlie Don't's bad luck while picking up some tips for yourself.

Speakers

Nick Young

Senior Software Engineer, Isovalent at Cisco

Nick has been working to prevent the entropic downfall of systems for 25 years, across datacenters, clouds, networking, and others. He's a Staff Engineer at Isovalent, and a maintainer on the Kubernetes Gateway API project, where he works on improving the ingress and mesh experiences... Read More →

Wednesday April 2, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room B

Platform Engineering

Content Experience Level Intermediate

14:30 BST

Trust No One: Secure Storage With Confidential Containers - Aurélien Bombo, Microsoft

Wednesday April 2, 2025 14:30 - 15:00 BST

Level 0 | ICC Auditorium

If you are processing and storing sensitive data in the cloud, can you really trust anyone (including the cloud)? The answer is no. Confidential Containers (CoCo) is a CNCF project that leverages Trusted Execution Environments (TEEs) to tackle this challenge. A critical aspect in this effort is providing secure and confidential storage solutions that can be seamlessly deployed across cloud providers.

This session explores the implementation of trusted storage in CoCo, highlighting key aspects such as Kubernetes storage drivers, device virtualization, and the role of attestation in secure key release and data encryption. We also demonstrate how we prevent attackers from injecting data into the TEE using the CNCF Rego policy language.

Overall, we aim to show how cloud providers and end users can securely store and protect sensitive data, enabling the adoption of confidential computing across numerous use cases.

Speakers

Aurélien Bombo

Software Engineer, Microsoft

Aurélien is a contributor to the Confidential Containers project and serves on the Architecture Committee of sister project Kata Containers. At Microsoft, he works on the Linux confidential platform.

Wednesday April 2, 2025 14:30 - 15:00 BST
Level 0 | ICC Auditorium

Security

Content Experience Level Intermediate

14:30 BST

Tutorial: "Working Code Wins": Win Big With a Cloud Native Hackathon Starter Pack - Phill Morton, The Access Group & Abby Bangser, Syntasso

Wednesday April 2, 2025 14:30 - 15:45 BST

Level 1 | Hall Entrance N11

It can be easy to look at the CNCF landscape and think that the CNCF is only focused on tools and technologies. However, the Cloud Native Maturity Model helps re-centre the conversation on the real mission: Business outcomes, people, process, and policy – and of course, also technology.

Join this workshop to learn about our experiences running a company-wide hackathon at The Access Group using only open source software, which not only launched innovative business ideas but also created a whole new awareness and adoption of cloud native technologies.

You will get hands-on with creating effective developer experience using a Backstage Portal, managing infrastructure with OpenTofu, and everything in between. Most importantly, at the end of this session, you will have the working platform blueprint to take back ready for hacking in your organisation.

Speakers

Abby Bangser

Principal Engineer, Syntasso

Abby is a Principal Engineer at Syntasso delivering Kratix, an open-source cloud-native framework for building internal platforms on Kubernetes. Her keen interest in supporting internal development comes from over a decade of experience in consulting and product delivery roles across... Read More →

Phill Morton

Platform Architect, The Access Group

Phill is a dedicated software engineer turned platform engineer, with a strong passion for automating processes and enabling team success. With extensive experience in app modernization, cloud engineering, performance tuning, and observability. Phill brings a wealth of knowledge... Read More →

Wednesday April 2, 2025 14:30 - 15:45 BST
Level 1 | Hall Entrance N11

Tutorials, Platform Engineering

Content Experience Level Intermediate

15:15 BST

Production-Ready LLMs on Kubernetes: Patterns, Pitfalls, and Performance - Priya Samuel, Elsevier & Luke Marsden, MLOps Consulting

Wednesday April 2, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance S10 | Room A

Many orgs are evaluating running open source LLMs on their own infrastructure, and Kubernetes is a natural platform choice. However, running open source LLMs in production on Kubernetes is, honestly, a bit of an undocumented mess.

This technical presentation shares the experience of both speakers in deploying production-grade LLM infrastructure on Kubernetes. Through practical demonstrations, we'll explore the complete deployment lifecycle, from GPU setup to optimization techniques like Flash Attention, quantization tradeoffs and GPU sharing.

You'll learn:

* Architectural patterns for efficient LLM deployment using Ollama and vLLM
* Solutions for model weight management and context length optimization
* Techniques for GPU sharing and improving resource utilization
* Production approaches to fine-tuning with Axolotl and serving multiple models with LoRAX

You'll leave with a complete blueprint for building reliable, scalable LLM infrastructure on Kubernetes.

Speakers

Priya Samuel

Full stack engineer, Software Architect, Elsevier

Priya Samuel is a seasoned technology leader with a passion for transforming complex challenges into actionable solutions. With extensive expertise in DevOps, and cloud-native technologies, and Identity and Access Management (IAM). Priya has helped organizations scale their data and... Read More →

Luke Marsden

Founder, MLOps Consulting

Technical leader and startup founder who participated in the early development of Docker and Kubernetes. Former SIG lead for SIG-cluster-lifecycle.

Wednesday April 2, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

15:15 BST

The Great Sidecar Debate - William Morgan, Buoyant

Wednesday April 2, 2025 15:15 - 15:45 BST

Level 0 | ICC Capital Hall | Room 1

Sidecars, long the defining characteristic of the service mesh, are now the subject of its latest debate. While Kubernetes itself has recently added native support for sidecar containers, for service meshes, the question remains: does this architecture still hold water? Or, in the world of ambient and eBPF, are sidecars an antiquated approach already surpassed?

In this session, we'll take a pragmatic and engineering-focused approach to the debate. Every engineering choice is ultimately a tradeoff, so what are the tradeoffs at play here? Are there situations where sidecars provide value vs alternatives? Situations in which they suffer by comparison? We'll evaluate the practical considerations for service meshes: resource consumption, operational considerations (e.g. blast radius), security considerations (e.g. threat models), and more, and attempt to paint a comprehensive and unbiased picture of the pros and cons between approaches.

Speakers

William Morgan

CEO, Buoyant

William is the co-founder and CEO of Buoyant, the creator of the open source service mesh project Linkerd. Prior to Buoyant, he was an infrastructure engineer at Twitter, where he helped move Twitter from a failing monolithic Ruby on Rails app to a highly distributed, fault-tolerant... Read More →

Wednesday April 2, 2025 15:15 - 15:45 BST
Level 0 | ICC Capital Hall | Room 1

Connectivity

Content Experience Level Intermediate

15:15 BST

Trino and Data Governance on Kubernetes - Sung Yun & Aki Sukegawa, Bloomberg

Wednesday April 2, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance S10 | Room D

As secure and seamless data discovery and exploration become top priorities for data science platforms and their generative AI workflows, intelligent solutions for data access, catalog management, and distributed data analytics are becoming critical for cloud platform teams. One extremely popular solution is to utilize Trino in combination with Open Policy Agent (OPA) to deliver a distributed and secure SQL solution that can answer authorization checks at runtime, in a cloud native manner.

In this talk, we will walk through how we designed various Trino CustomResources on top of Kubernetes, Envoy Proxy, and Istio to enable a self-service and scalable data exploration platform. This design, in conjunction with a granular and centralized data governance framework, enables secure data discovery at a company-wide level within Bloomberg.

Speakers

Aki Sukegawa

Principal Engineer, Bloomberg

Aki Sukegawa is a Senior Software Engineer with the Enterprise Data Science Infrastructure team at Bloomberg. He is a contributor to various open source projects and is an Apache Thrift committer and PMC member.

Sung Yun

Team Lead, Bloomberg

Sung Yun is the Team Lead of Bloomberg's Cloud Native Compute Services (CNCS) Trino & Catalog engineering team, based out of New York City. His team focuses on utilizing open source tools like Kubernetes, Trino and Apache Iceberg to build a scalable data exploration platform for the... Read More →

Wednesday April 2, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance S10 | Room D

Data Processing + Storage

Content Experience Level Intermediate

15:15 BST

More Data Please: Hands on Green Cloud Experiments - Leonard Pahlke, BWI GmbH & Antonio Di Turi, Data Reply

Wednesday April 2, 2025 15:15 - 15:45 BST

Level 0 | ICC Capital Hall | Room 2

Sustainable cloud computing has been a topic for over a decade, but we lack concrete data on Kubernetes energy consumption. This session shares a case study of a microservice running on a k3s clusters, providing real energy metrics at every stage of Platform Engineering: Day 0 (manual setup with k3s, Cilium, microservice deployment), Day 1 (introducing ArgoCD, Falco for security), and Day 2 (adding observability with Prometheus, Grafana, OpenTelemetry, and Kepler). We use bare metal environments ensuring clean, measurable energy data, from idle setup to fully operational.

We’ll explore how tools like Kepler estimate energy consumption for Kubernetes components and compare them to actual plug measurements. For Day 3, we’ll present experiments: changing programming languages, OS images, VPA and KEDA. By sharing practical insights and data, we aim to inspire engineers to innovate and build a more sustainable cloud-native ecosystem.

Presented by TAG Environmental Sustainability Leads.

Speakers

Antonio Di Turi

Data Engineer, Data Reply

Co-chair of WG Green review in the CNCF TAG-environmental-sustainability. I am determined and dynamic, I like the crowd and I like to be exposed to new stimuli. DevOps and Sustainability are my passions. I feel very lucky because in my job I always find some fun.

Leonard Pahlke

Senior Expert Cloud Native Engineering, BWI GmbH

Leonard is a dedicated open source contributor and leader, currently chairing the CNCF TAG Environmental Sustainability. Previously, Leonard led the K8s release team for v1.26 and as the emeritus advisor for v1.28. With a strong focus on emerging technologies, he advocates for open... Read More →

Wednesday April 2, 2025 15:15 - 15:45 BST
Level 0 | ICC Capital Hall | Room 2

Platform Engineering

Content Experience Level Intermediate

15:15 BST

Zero Forks Given: Minimizing Friction When Adopting OSS - Alexander Perlman & Narayanamurthi Mari, Capital One

Wednesday April 2, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance S10 | Room B

Open source software often does not meet internal requirements at large enterprises, especially those with elevated security and regulation requirements. Leveraging said projects often requires modifying or extending them to meet these internal mandates.

In this talk, we will review different patterns for “internalizing” external open source projects and discuss the pros and cons of each approach. These patterns are upstream contribution, forking, wrapping, and mutation.

We will review specific case studies using popular open source projects (including Kubeflow, Argo Workflows, Dask, and more) and how we fulfilled internal requirements using the four aforementioned approaches.

In particular, we want to highlight the comparative benefits of Kubernetes mutating admission control (with Kyverno) when adopting open source projects. We hope that audiences will walk away with concrete tools to streamline open source adoption.

Speakers

Alexander Perlman

Senior Lead Software Engineer, Capital One

Alexander Perlman is a senior lead software engineer at Capital One's Machine Learning Experience organization. His areas of focus include distributed compute and workflow orchestration. He lives in the NYC metro area (aka NJ and ashamed) with his wife and three young children. He... Read More →

Narayanamurthi Mari

Distinguished Engineer @ Capitalone, Capitalone

Moorthy is a distinguished engineer at Capital One's Machine Learning Experience organization. His areas of focus include Site Reliability, Platform Engineering and Workflow Orchestration. He lives in the New Jersey with his wife and two young children.

Wednesday April 2, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance S10 | Room B

Platform Engineering

Content Experience Level Intermediate

16:15 BST

Orchestrating AI Models in Kubernetes: Deploying Ollama as a Native Container Runtime - Samuel Veloso, Cast AI & Lucas Fernández, Red Hat

Wednesday April 2, 2025 16:15 - 16:45 BST

Level 1 | Hall Entrance S10 | Room A

Existing solutions for serving AI models in Kubernetes are often difficult to deploy and manage with complex workflows and a lack of user-friendly design. This talk introduces a custom container runtime that leverages Ollama as the serving backend, simplifying the deployment and operation of AI models in Kubernetes environments.

A custom container runtime extends the standard container execution workflow by integrating additional capabilities directly into the container lifecycle. Solutions like gVisor and Kata Containers are prominent examples, leveraging this technology to enhance container security by isolating workloads or providing lightweight virtualized environments. In our case, we apply the same principle to AI model serving, enabling native deployment of open-source AI models within Kubernetes.

Speakers

Samuel Veloso

Software Engineer, Cast AI

Samu Veloso is a Software Engineer at Cast AI where he contributes to the future of Kubernetes security.

Lucas Fernández

Senior Software Engineer, Red Hat

I'm a technology fan and I love to explore as many fields as I can, such as Development, Ciber-Security or Artificial Intelligence. You can see what I am up to on lucferbux.dev. Feel free to contact me on my linkedin.

Wednesday April 2, 2025 16:15 - 16:45 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

16:15 BST

Unleashing the Power of Init Containers: Reducing Database Management Toil at Yelp - Muhammad Junaid Muzammil, Yelp

Wednesday April 2, 2025 16:15 - 16:45 BST

Level 1 | Hall Entrance S10 | Room D

Init containers are specialized containers that are launched during pod initialization and complete their tasks before the main containers in the pod start. But how do they unleash their potential in real-life situations, particularly when it comes to database management?
At Yelp, we run several Cassandra clusters in production on Kubernetes. Init containers have been instrumental in transforming the operational efficiency for managing these Cassandra clusters, especially during horizontal scaling, upgrades, and restoring clusters from backups. Join us to explore the strategic use of init containers by the Database Reliability Engineering team at Yelp.

Speakers

Muhammad Junaid Muzammil

Tech Lead, Yelp

Muhammad Junaid Muzammil is a Tech Lead in the Database Reliability Engineering team at Yelp. His primary focus is on distributed datastores like Cassandra and Zookeeper, including their interactions and automation. Outside of work, you'd find him playing different games with his... Read More →

Wednesday April 2, 2025 16:15 - 16:45 BST
Level 1 | Hall Entrance S10 | Room D

Data Processing + Storage

Content Experience Level Intermediate

16:15 BST

How the SIG-Multicluster API Specifications Are Used for Real World Multicluster Management - August Simonelli, Red Hat & Ryan Zhang, Microsoft

Wednesday April 2, 2025 16:15 - 16:45 BST

Level 1 | Hall Entrance N10 | Room H

Nearly everyone touches multiple clusters today, often resorting to bespoke management systems. But did you know that Kubernetes SIG-Multicluster has published specifications covering multicluster management which are actively used in production environments today?

This talk will review real-world implementations as demonstrated in the Open Cluster Management project (OCM-io) and KubeFleet (kubernetes-fleet.io).

We'll begin with an overview of key Multicluster API concepts from SIG-Multicluster exploring how the upcoming ClusterProfile API provides a standard way to represent clusters. We'll demo how OCM-io and KubeFleet use some of these APIs, such as the Work API for workload placement across clusters and the Multicluster Services API for managing endpoints and traffic policies.

If you manage – or plan to manage – multiple Kubernetes clusters across public and private clouds please join us to learn how these specifications can improve your multi-cluster management experience.

Speakers

August Simonelli

Principal Product Manager, Red Hat

August Simonelli is a Principal Product Manager at Red Hat. He has worked with customers around the world to help them adopt, use, improve, and implement open source technologies. Raised in Boulder, Colorado, August now lives in Sydney, Australia and is a strong advocate for using... Read More →

Ryan Zhang

Principal Software Engineering Manager, Microsoft

Dr. Ryan Zhang is a Principal Software Engineer Manager at Microsoft, working on Azure Kubernetes Service Team. Ryan has been working on Cloud Native open source projects for the past few years including CloudEvents, Open Application Model (OAM) and multi-cluster related initiati... Read More →

Wednesday April 2, 2025 16:15 - 16:45 BST
Level 1 | Hall Entrance N10 | Room H

Platform Engineering

Content Experience Level Intermediate

16:15 BST

Tutorial: Build, Operate, and Use a Multi-Tenant AI Cluster Based Entirely on Open Source - Claudia Misale, IBM Research; Olivier Tardieu & David Grove, IBM

Wednesday April 2, 2025 16:15 - 17:30 BST

Level 1 | Hall Entrance N11

With GPUs being scarce and costly, multi-tenant Kubernetes clusters that can queue and prioritize complex, heterogeneous AI/ML workloads while achieving both high utilization and fair sharing, are a necessity for many organizations. This tutorial will teach the audience how to build, operate, and use an AI cluster. Starting from either a managed or on-premise Kubernetes cluster, we will demonstrate how to install and configure a number of open source projects (and only open source projects) such as Kueue, Kubeflow, PyTorch, Ray, vLLM, and Autopilot to support the full AI model lifecycle (from data preprocessing to LLM training and inference), configure teams and quotas, monitor GPUs, and to a large degree automate fault detection and recovery. By the end of the tutorial the participants will have a thorough understanding of the AI software stack refined by IBM Research over several years to effectively manage and utilize thousands of GPUs. Come to learn the recipe and try it at home!

Speakers

David Grove

Distinguished Research Scientist, IBM Research

David Grove is a Distinguished Research Scientist at IBM T.J. Watson, NY, USA. He has been a software systems researcher at IBM since 1998, specializing in programming language implementation and scalable runtime systems. His current research focuses on cloud-related technologies... Read More →

Olivier Tardieu

Principal Research Scientist, Manager, IBM Research

Dr. Olivier Tardieu is a Principal Research Scientist and Manager at IBM T.J. Watson, NY, USA. He joined IBM Research in 2007. His current research focuses on cloud-related technologies, including Serverless Computing and Kubernetes, as well as their application to Machine Learning... Read More →

Claudia Misale

Staff Research Scientist, IBM Research

Claudia Misale is a Staff Research Scientist in the Hybrid Cloud Infrastructure Software group at IBM T.J. Watson Research Center (NY). Her research is focused on Kubernetes and targets monitoring, observability and scheduling for HPC and AI training workloads. She is mainly interested... Read More →

Wednesday April 2, 2025 16:15 - 17:30 BST
Level 1 | Hall Entrance N11

Tutorials, AI + ML

Content Experience Level Intermediate

17:00 BST

Optimizing Training Performance for Large Language Model(LLM) in Kubernetes - William Wang, Huawei Cloud Technologies Co., LTD & Peng Gu, Tech Starup

Wednesday April 2, 2025 17:00 - 17:30 BST

Level 1 | Hall Entrance S10 | Room A

Large Language Models are increasing in popularity and the training performance in Kubernetes at scale has become the biggest challenges for enterprises. How to achieve the optimal performance and linearity for a huge training job, such as 100k GPUs? What are the three most critical factors that affect performance? How to optimize performance step by step?

In this talk we will present an end to end analysis of the bottleneck of LLM training in Kubernetes at scale. And then show how the insufficient resource management and network topology awareness in Kubernetes affect the performance. Finally we will introduce the new resource management model, LLM dedicated training workload and scheduling solution which are initiated in the Volcano open source community and demonstrate how to use it to get optimal performance and linearity.

Speakers

Peng Gu

Software Architect, Tech Starup

Peng Gu holds a PhD degree in Computer Engineering from the University of Central Florida, specializing in high-performance computing. As a tech lead and cloud software architect at an AI infrastructure startup, he designs scalable, cutting-edge solutions to support highly demanding... Read More →

William Wang (Leibo Wang)

Senior software engineer, Nvidia

Cloud native architect, open-source enthusiast, technical lead and maintainer of CNCF Volcano, software developer with a decade of experience in diverse domains including cloud native technology, large-scale cluster resource management, batch scheduling, BigData, and AI acceleration... Read More →

Wednesday April 2, 2025 17:00 - 17:30 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

17:00 BST

Uncharted Waters: Dynamic Resource Allocation for Networking - Miguel Duarte Barroso, Red Hat & Lionel Jouin, Ericsson Software Technology

Wednesday April 2, 2025 17:00 - 17:30 BST

Level 0 | ICC Capital Hall | Room 1

In last year’s naval engagement, the multi-network fleet launched a bold assault on Kubernetes SIG-Network’s defenses, led by the flagship proposal, the USS Pod Spec Modification. But under heavy fire from SIG-Network’s coastal batteries, the mission was repelled, leaving both sides to regroup and rethink their strategies.

Now, as the fog of war clears, the fleet has charted a new course. Instead of another frontal assault on the Pod spec stronghold, the focus shifts to the versatile and Kubernetes-native waters of Dynamic Resource Allocation (DRA). This tactical pivot could outflank SIG-Network’s defenses, introducing the DRA CNI Driver and a new era for Kubernetes networking.

Join us to explore how DRA reshapes networking in Kubernetes, what it means for your clusters, and how you can help steer this upstream effort. From strategy to implementation, we’ll unpack what’s next in the ongoing naval battle of Kubernetes networking.

Speakers

Miguel Duarte Barroso

Principal Software Engineer, Red Hat

Miguel is a Principal Software Engineer for Openshift Virtualization at Red Hat.His main interests are SDN / NFV, functional programming, containers, and virtualization.Miguel is a member of the Network Plumbing Working Group, a maintainer of several CNI plugins (whereabouts, macvtap... Read More →

Lionel Joiun

Software Engineer, Ericsson Software Technology

Lionel Jouin is a Software Engineer at Ericsson Software Technology, based in Stockholm, Sweden. He actively contributes to Kubernetes with a focus on bringing native support for secondary networks and its ecosystem including services and policies…. His contributions span SIG Network... Read More →

Wednesday April 2, 2025 17:00 - 17:30 BST
Level 0 | ICC Capital Hall | Room 1

Connectivity

Content Experience Level Intermediate

17:00 BST

An Exemplary Path: Leveraging EBPFs and OpenTelemetry To Auto-instrument for Exemplars - Charlie Le & Kruthika Prasanna Simha, Apple

Wednesday April 2, 2025 17:00 - 17:30 BST

Level 1 | Hall Entrance N10 | Room E

Have you already adopted eBPF to unlock powerful, dynamic observability at the kernel level? Are you looking to take the next step by integrating exemplars seamlessly into your observability workflows? If so, you’ve likely encountered the challenge of manually instrumenting applications for exemplar support—an approach that’s often tedious, error-prone, and difficult to maintain. But what if you could leverage your existing eBPF setup to automate exemplar creation for your applications without touching your application code?

eBPF's in-kernel aggregation capabilities, paired with OpenTelemetry's flexible observability framework, enable automatic generation of exemplars. We’ll dive into how eBPF dynamically collects metrics and traces, processes them at the source, and works with OpenTelemetry to correlate kernel-level and application-level observability—all with minimal overhead and maximum convenience.

Speakers

Charlie Le

Software Engineer, Apple

Charlie is a software engineer at Apple, specializing in building and scaling cloud native observability solutions and infrastructure. Deeply inspired by the collaborative spirit of open source, he actively contributes to projects like Cortex and OpenTelemetry, shaping the future... Read More →

Kruthika Prasanna Simha

Machine Learning Engineer, Apple

Kruthika is a software engineer at Apple specializing in building ML enabled observability solutions. She holds a Masters in Computer Engineering and has specialized in ML. Kruthika is on a mission to identify how the ML and cloud-native worlds converge towards bigger and better ML... Read More →

Wednesday April 2, 2025 17:00 - 17:30 BST
Level 1 | Hall Entrance N10 | Room E

Observability

Content Experience Level Intermediate

17:00 BST

The Next Generation of DaemonSet Autoscaling - Adam Bernot, Google Cloud & Bryan Boreham, Grafana Labs

Wednesday April 2, 2025 17:00 - 17:30 BST

Level 1 | Hall Entrance S10 | Room C

Imagine you have small 4-core nodes and larger 64-core nodes in the same cluster, and a DaemonSet that does much more work on the larger nodes. How do you set resource requests and limits appropriately?

Managing resources for workloads deployed as a DaemonSet in Kubernetes can be challenging when load is not evenly distributed across nodes. Static allocation can cause over/under-utilization and scheduling issues. VPA helps, but currently assumes uniform load across all pods, which is a bad assumption for certain types of workloads.

We will discuss our case studies, why this feature will be useful, how our prototype implements per-pod VPA for DaemonSets to improve resource efficiency, stability, and eliminate the need for manual tuning. This is your chance to learn about this upcoming feature and connect with the people who are implementing it!

Speakers

Bryan Boreham

Distinguished Engineer, Grafana Labs

Bryan Boreham is a Distinguished Engineer at Grafana Labs, working on highly scalable storage for metrics, logs and traces. Bryan's career has ranged from charting pie sales at a bakery to real-time pricing of billion-dollar bond trades. A contributor to many Open Source projects... Read More →

Adam Bernot

Software Engineer, Google Cloud

Adam Bernot is a software engineer and Kubernetes enthusiast who works on scaling the Google Cloud Managed Service for Prometheus.

Wednesday April 2, 2025 17:00 - 17:30 BST
Level 1 | Hall Entrance S10 | Room C

Operations + Performance

Content Experience Level Intermediate

17:45 BST

More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling With Dynamic Resource Allocation - John Belamaric & Yash Sonthalia, Google

Wednesday April 2, 2025 17:45 - 18:15 BST

Level 1 | Hall Entrance S10 | Room A

Big training jobs and muti-host inference need a lot of nodes and accelerators. More nodes and accelerators mean more chances for failures. How can we be sure to have enough working GPUs for our job? How can we utilize the healthy portions of a 16x16 TPU cluster if one node fails? Simple node labels won’t cut it.

DRA is beta in Kubernetes 1.32. Usually, it’s used for managing individual devices on a node. But did you know that DRA supports modeling resources that are accessible across many nodes? This powerful abstraction can model clusters of nodes and devices. Combining it with the alpha partitionable device model in 1.33, we can correctly model complex multi-host, multi-accelerator topologies, and schedule workloads to them as a unit! This is a real game changer for AI/ML workloads on K8s.

Come learn about these current and upcoming technologies, and how the K8s community is applying them to massive compute clusters like the NVIDIA GB200 and ultra powerful multi-host TPU slices.

Speakers

John Belamaric

Senior Staff Software Engineer, Google

John is a Sr Staff SWE, co-chair of K8s SIG Architecture and of K8s WG Device Management, helping lead efforts to improve how GPUs, TPUs, NICs and other devices are selected, shared, and configured in Kubernetes. He is also co-founder of Nephio, an LF project for K8s-based automation... Read More →

Yash Sonthalia

Google, Staff Software Engineer, Google

7 years of experience working as a software engineer in Google. Tech Lead for TPUs/GPUs in GKE AI.

Wednesday April 2, 2025 17:45 - 18:15 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

17:45 BST

Making the Leap: What Gateway API Needs To Support Ingress-NGINX Users - Rob Scott, Google & James Strong, Isovalent at Cisco

Wednesday April 2, 2025 17:45 - 18:15 BST

Level 0 | ICC Capital Hall | Room 1

Ingress-NGINX has been the cornerstone of Kubernetes Ingress for years. As the maintainers transition to a new Gateway API-focused implementation, we face a critical question - how can we provide a seamless migration to Gateway API? What about the Ingress-NGINX features that Gateway API doesn’t support yet? To ensure a smooth transition to Gateway API, the ecosystem must address these gaps - and your input is essential.

In this talk, Rob and James will explore the critical challenges of migrating from Ingress to Gateway. They’ll highlight commonly used Ingress-NGINX features that are not yet supported in Gateway API and discuss how the community can drive the evolution of Gateway API to meet the needs of Ingress-NGINX users.

This session will provide insights into what’s needed to make Gateway API a true successor for Ingress-NGINX users, focusing on collaboration and feedback. Join us in shaping the future of ingress networking in Kubernetes.

Speakers

James Strong

solution architect, isovalent at cisco

James has been working in the cloud for 7 years. He helped build a private cloud at GE Appliances and developed and supported REST API's in AWS on docker. Recently he has passed the CNCF's CKA exam and helps companies migrate their applications to Kubernetes.

Rob Scott

Staff Software Engineer, Google

Rob is an open source enthusiast currently working on Kubernetes Networking at Google. He's been a maintainer of Gateway API since the very early days of the project and led the development of other Kubernetes networking APIs like EndpointSlices.

Wednesday April 2, 2025 17:45 - 18:15 BST
Level 0 | ICC Capital Hall | Room 1

Connectivity

Content Experience Level Intermediate

17:45 BST

Flink on Karmada: Building Resilient Data Pipelines on Multi-Cluster K8s - Michas Szacillo, Bloomberg & Hongcai Ren, Huawei

Wednesday April 2, 2025 17:45 - 18:15 BST

Level 1 | Hall Entrance S10 | Room D

Karmada is an increasingly popular open source tool for deploying and managing cloud-native applications across Kubernetes clusters. It can also be used to boost workload resiliency with its existing failover support. But what happens if we need to conserve state?

Within the context of data processing (e.g., Apache Flink or Apache Spark), the state is often critical to making sure workloads are able to gracefully resume in the event of a disruption. In collaboration with the Karmada community, the Bloomberg Streaming Analytics team has worked to bridge this gap in Karmada’s existing failover features.

During this talk, we’ll use a real-life Flink on Karmada use case to discuss:
- The complexities related to intelligently scheduling stateful workloads, improving resiliency, and ensuring state consistency during failover on multi-cluster K8s
- The open source enhancements to Karmada to manage these challenges
- How to leverage Karmada to support other stateful use cases!

Speakers

Michas Szacillo

Tech Lead, Bloomberg

Michas is a senior software engineer and tech lead on Bloomberg’s Streaming Analytics engineering team. The platform, which is running on Kubernetes, serves as the foundation for many of Bloomberg's data streaming use cases. Michas is also a frequent collaborator to the CNCF community... Read More →

Hongcai Ren

Senior Software Engineer, Huawei

Hongcai Ren(@RainbowMango) is the CNCF Ambassador, who has been working on Kubernetes and other CNCF projects since 2019, and is the maintainer of the Kubernetes and Karmada projects.

Wednesday April 2, 2025 17:45 - 18:15 BST
Level 1 | Hall Entrance S10 | Room D

Data Processing + Storage

Content Experience Level Intermediate

17:45 BST

How Green Is My OpenTelemetry Collector? - Nancy Chauhan, Student & Adriana Villela, Dynatrace

Wednesday April 2, 2025 17:45 - 18:15 BST

Level 1 | Hall Entrance N10 | Room E

We live in a world heavily dependent on technology, and this comes at an environmental cost. For example, data centres consume 2% of global power. As our systems become more complex, that power consumption will only increase. We strive to understand our systems through Observability, and yet the very telemetry that our systems emit and is ingested by our favorite Observability backends contributes to an increasing global tech carbon footprint.

How can we mitigate this? One way is via the Kepler project. The Kepler Exporter exposes statistics, including power consumption metrics, from an application running in a Kubernetes (k8s) cluster.

In this talk, attendees will learn about:
- Kepler - what is is and what it does
- How to deploy Kepler
- Demo showing Kepler tweaking the power consumption of OTel Collectors in k8s

Attendees will walk away with an understanding of how to deploy greener Collectors, thereby reducing power consumption and costs.

Speakers

Adriana Villela

Principal Developer Advocate, Dynatrace

Adriana Villela is a Principal Developer Advocate, helping companies achieve reliability greatness through Observability, SRE, & DevOps practices. Previously, she managed a Platform Engineering team & an Observability Practices team at Tucows. Adriana has worked at various large-scale... Read More →

Nancy Chauhan

Student, Cornell University

I like hacking through software engineering problems. I have been developing solutions for software reliability and also like to break complicated concepts into easier tech content (blogs and videos).I have also worked in Dev Advocacy, amid the crossover of two things I like the most... Read More →

Wednesday April 2, 2025 17:45 - 18:15 BST
Level 1 | Hall Entrance N10 | Room E

Observability

Content Experience Level Intermediate

17:45 BST

Scale Smarter Not Harder: How Extending Cluster Autoscaler Saves Millions - Rahul Rangith & Ben Hinthorne, Datadog

Wednesday April 2, 2025 17:45 - 18:15 BST

Level 0 | ICC Capital Hall | Room 2

“I need 100 instances with 32 CPUs and 128GB of memory each, with remote storage and up to 10GB/s of network bandwidth, and I need them now”! At Datadog, we make scaling requests like this thousands of times a day, across dozens of clusters in multiple cloud providers. At this scale, and with so many machine specifications to choose from, we realized the importance of asking the question: how do I select the best instance type in every environment?
Join us to learn how answering this question with every scale up decision significantly reduces our cloud costs. We’ll discuss the tools we use to score instance types, and strategies to plug these recommendations into the Kubernetes Cluster Autoscaler via its gRPC expander. Whether you’re operating a single cluster or a massive Kubernetes platform, this talk will teach you how to upgrade your infrastructure to make informed instance type selections that minimize your cloud spend.

Speakers

Rahul Rangith

Software Engineer, Datadog

Rahul Rangith has worked at Datadog since 2022 after graduating from the University of Waterloo. He works on Datadog’s Compute team which is responsible for the company’s Kubernetes platform. On the team, he focuses on node management and autoscaling. Rahul is active in the Kubernetes... Read More →

Ben Hinthorne

Software Engineer, Datadog

Ben Hinthorne joined Datadog’s Compute Team in 2021, which is responsible for building and scaling their Kubernetes platform. Recently, he has focused on the autoscaling ecosystem, working to optimize application performance, infrastructure cost, and resiliency through opinionated... Read More →

Wednesday April 2, 2025 17:45 - 18:15 BST
Level 0 | ICC Capital Hall | Room 2

Platform Engineering

Content Experience Level Intermediate

17:45 BST

Securing AI Workloads: Building Zero-Trust Architecture for LLM Applications - Rohit Ghumare, Taikun & Joinal Ahmed, NTG

Wednesday April 2, 2025 17:45 - 18:15 BST

Level 0 | ICC Auditorium

As businesses increasingly rely on LLM applications for their important functions, it becomes important to implement strong security measures to protect sensitive information and guarantee smooth operations. This session shows how to build a zero-trust security architecture for AI workloads using cloud native patterns. We'll explore how to implement AI Gateways that have strong authentication and authorization and include audit logging. Keep compliance and governance requirements while you secure model artifacts and implement runtime security and protect against prompt injection attacks.

Speakers

Joinal Ahmed

head of ai, ntg

Joinal is an experienced Data Science professional with a interest on building solutions with quick prototypes, community engagements and influencing technology adoption.

Rohit Ghumare

DevRel As Service, Founder

As a Google Developer Expert specializing in Google Cloud, I am a passionate DevOps Advocate and a dedicated Community Evangelist. I lead and nurture multiple communities across diverse platforms, fostering DevOps and Developer Relations awareness. My commitment to the open-source... Read More →

Wednesday April 2, 2025 17:45 - 18:15 BST
Level 0 | ICC Auditorium

Security

Content Experience Level Intermediate

11:00 BST

Debugging Envoy Tunnels: A Deep Dive - Carlos Sanchez & Alexandra Stoica, Adobe

Thursday April 3, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance N10 | Room E

Envoy is a powerful proxy for modern microservices architectures that can securely connect services using encryption and mutual authentication with certificates. However, when Envoy tunnels don't work as expected, troubleshooting can become a complex and time-consuming task.

At Adobe, we use Envoy to connect pods running in Kubernetes with customer-dedicated infrastructure, such as on-premise services and databases. This setup allows different pods to have their own dedicated egress IP, or to connect from pods to multiple customer on-premise services using VPN. This relies heavily on Envoy tunnels and mTLS, and we've encountered numerous situations where things can and do go wrong.

Join us as we challenge you through a series of interactive demos to solve various cases of tunnel failures. Are you ready to crack the case and become an Envoy troubleshooting expert?

Speakers

Carlos Sanchez

Principal Scientist, Adobe

Carlos Sanchez is a Principal Scientist at Adobe Experience Manager, specializing in software automation, from build tools to Continuous Delivery and Progressive Delivery. Involved in Open Source for over 20 years, he is the author of the Jenkins Kubernetes plugin and a member of... Read More →

Alexandra Stoica

Site Reliability Engineer, Adobe

Alexandra Stoica is a Site Reliability Engineer at Adobe, specializing in cloud infrastructure, automation, and continuous delivery. With extensive experience in building and maintaining Kubernetes Operators, Alexandra has developed tools to automate networking infrastructure provisioning... Read More →

Thursday April 3, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance N10 | Room E

Connectivity

Content Experience Level Intermediate

11:00 BST

A Cloud Native Workflow for Hardware-in-the-Loop Software Development - Miguel Angel Ajo, Red Hat

Thursday April 3, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance S10 | Room D

Does your organization build firmware for hardware devices on Kubernetes? Do you still test firmware on hardware manually? Jumpstarter, an open-source project started by Red Hat, connects your software factory to your hardware, modernizing embedded software development. Developed in collaboration with a leading automotive manufacturer, Jumpstarter bridges the gap between embedded and cloud-native workflows.

This session demonstrates how to automate software testing on physical devices within Kubernetes using Tekton Pipelines and GitLab, leasing devices for tasks like flashing firmware, booting, and interfacing through serial, CAN bus, audio, and video. Eclipse Che will also be showcased for developing and debugging tests.

The presentation will include a live demo and will share deployment instructions, workflow examples, and real-world use cases from Red Hat and other community projects.

Speakers

Miguel Angel Ajo Pelayo

Senior Principal Software Engineer, Red Hat

Miguel has been an upstream contributor to open-source projects throughout his career at Red Hat. He has always been interested in hardware and the low-level details of how technology works. Before joining Red Hat, he ran a small consulting startup that developed embedded systems... Read More →

Thursday April 3, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance S10 | Room D

Emerging + Advanced

Content Experience Level Intermediate

11:00 BST

OTel Sucks (But Also Rocks!) - Juraci Paixão Kröhling, OllyGarden & Daniel Dyla, Dynatrace

Thursday April 3, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance N10 | Room G

OpenTelemetry (OTel) has become a cornerstone of observability, but the journey hasn’t been without challenges. Inspired by the famous "Linux Sucks" format, this talk explores OTel’s pain points and highlights its successes.

We’ll cover:
* SDK Configuration: Once complex for simple scenarios, now simplified by the Config SIG.
* Collector Challenges: Tail-sampling woes and Prometheus performance issues, balanced by OTel’s ability to handle multiple signals in one binary with great performance using OTLP.
* Semantic Conventions: Painful changes, like in HTTP conventions, but with long-term benefits through unified standards.

Featuring real-world user insights, this session delivers a brutally honest yet optimistic take on OTel’s evolution. Perfect for anyone navigating OpenTelemetry’s complexities or celebrating its strengths.

Speakers

Daniel Dyla

Senior Open Source Architect / OpenTelemetry Maintainer, Dynatrace

Daniel is a Senior Architect with 9 years of experience in observability. Daniel is a member of the W3C Distributed Tracing WG, maintainer of OpenTelemetry JS, former OTel Governance Committee member, and OTel specification sponsor, in addition to working on many other areas of the... Read More →

Juraci Paixão Kröhling

Software Engineer, OllyGarden

Juraci Paixão Kröhling is a software engineer, a maintainer of the OpenTelemetry project, a member of the project's governing board and CNCF Ambassador. He has presented about distributed tracing, OpenTelemetry, and other related topics at conferences like KubeCon, OpenSource Summit... Read More →

Thursday April 3, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance N10 | Room G

Observability

Content Experience Level Intermediate

11:00 BST

Extending Kubernetes Resource Model (KRM) Beyond Kubernetes Workloads - Mangirdas Judeikis, Cast AI & Nabarun Pal, Broadcom

Thursday April 3, 2025 11:00 - 11:30 BST

Level 0 | ICC Capital Hall | Room 2

Writing consistent APIs is hard. The Kubernetes Resource Model (KRM) is the foundation of Kubernetes’ success because it is consistent, predictable, and easy to understand, and it provides a declarative approach to managing infrastructure and applications. But what if KRM could transcend Kubernetes itself?

This talk will explore the paradigm shift of how one could use KRM with kcp or Kubernetes Generic control plane to provide more than just workload management. This is not a new concept, Crossplane and many other tools are already doing this. But if we could take this further? What if each cloud API would look and feel like Kubernetes API? We will extensively cover how “kcp + friends” in the CNCF ecosystem fulfill that vision.

At the end of the talk, the audience will walk away with knowledge of KRM++, the approaches on building a scalable multi-tenant control plane for managing resources in their multi-cluster Kubernetes based infrastructure, possibly hybrid cloud.

Speakers

MJ / Mangirdas Judeikis

Staff Engineer, kcp maintainer, Cast AI

Nabarun Pal

Principal Software Engineer, Broadcom

Thursday April 3, 2025 11:00 - 11:30 BST
Level 0 | ICC Capital Hall | Room 2

Platform Engineering

Content Experience Level Intermediate

11:45 BST

Beyond the Limits: Scaling Kubernetes Controllers Horizontally - Tim Ebert, STACKIT

Thursday April 3, 2025 11:45 - 12:15 BST

Level 1 | Hall Entrance S10 | Room D

Do your Kubernetes controllers struggle to keep up with the demands of your growing infrastructure? As your clusters scale, traditional controller setups face increasing challenges, leading to slow reconciliation times, impacting application performance and overall cluster stability.

This session introduces sharding for Kubernetes controllers as a groundbreaking solution. By horizontally scaling controller workloads across multiple instances, it significantly improves scalability and addresses the inherent limitations of traditional leader election mechanisms.

In this session, we'll dive deep into the technical details of applying proven sharding mechanism from distributed databases to effectively partition controller workloads. We'll explore the underlying concepts and how to implement sharding in your own Kubernetes controllers.

Join us to learn how to overcome the scalability challenges of your Kubernetes controllers and unlock the full potential of your infrastructure.

Speakers

Tim Ebert

Cloud Engineer, STACKIT

Tim loves designing, developing, and operating cloud native systems at STACKIT. He is knee-deep in managing infrastructure and Kubernetes clusters themselves using Kubernetes operators. Tim is a core developer of Gardener, an open source project for managing Kubernetes clusters at... Read More →

Thursday April 3, 2025 11:45 - 12:15 BST
Level 1 | Hall Entrance S10 | Room D

Emerging + Advanced

Content Experience Level Intermediate

11:45 BST

Dancing With the Pods: Live Migration of a Database Fleet While Serving Millions of Queries - Jayme Bird & Manish Gill, ClickHouse

Thursday April 3, 2025 11:45 - 12:15 BST

Level 1 | Hall Entrance S10 | Room A

At ClickHouse, we recently changed the way we orchestrate databases provisioned by customers, specifically the way we use StatefulSets. There was just one big problem: we wanted to migrate our legacy fleet of thousands of services from the old orchestration code-path to the new one without any downtime - even the queries should continue to run as they are.

If there is one thing that people hate doing - it is migrations. They are painful, have lots of corner cases, and take a long time. In our case, it took us almost 6 months to migrate the entire fleet. But we encountered lots of interesting challenges along the way. This talk will walk you through these challenges of live migrating the entire ClickHouse Cloud Fleet's orchestration while continuing to serve customer queries and ingest. The story involves our Operator, deep-dive into StatefulSets, a custom migration controller, durable execution workflows, and many, many database synchronisation challenges.

Speakers

Manish Gill

Engineering Manager, ClickHouse Inc

Manish Gill works at ClickHouse Inc, where he is managing the AutoScaling team for ClickHouse Cloud. He is based out of Berlin and is deeply interested in Databases and Cloud challenges and still considers himself new to Kubernetes. In a past life, he worked in an ML research team... Read More →

Jayme Bird

Senior Software Engineer, ClickHouse

Jayme Bird is a Senior Software Engineer at ClickHouse Inc, working on the development of horizontal and vertical autoscaling solutions for ClickHouse Cloud, a stateful analytics DBaaS running on Kubernetes.

Thursday April 3, 2025 11:45 - 12:15 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Intermediate

11:45 BST

IAM, Agent: Identity for Autonomous AI - Matthew Bates, Cofide

Thursday April 3, 2025 11:45 - 12:15 BST

Level 1 | Hall Entrance S10 | Room C

First there were chatbots, then LLMs and now we're beginning to hear everyone talk about "agents", where multiple AI agents collaborate and execute tasks autonomously. As AI systems evolve toward multi-agent architectures, robust identity and access management (IAM) becomes critical for security. While these share similarities with microservices, AI agents introduce unique challenges around dynamic capabilities, trust and the interplay between human and agent identities.

This talk explores applying zero trust principles to AI agent workloads using CNCF projects like SPIFFE/SPIRE and emerging IETF standards (WIMSE). We'll explore dynamic identity provisioning, agent-to-agent authentication, and cryptographic attestation. Through hands-on demonstrations, you'll learn how to implement secure, standards-compliant identity management in your multi-agent AI systems, addressing both familiar distributed systems challenges and novel security considerations.

Speakers

Matthew Bates

Founder, Cofide

Matt is the founder of Cofide, a startup focused on workload identity and access management. He was previously co-founder and CTO of Jetstack, the company behind cert-manager. Since the launch, he has contributed widely to the Kubernetes project, both to the technology and to the... Read More →

Thursday April 3, 2025 11:45 - 12:15 BST
Level 1 | Hall Entrance S10 | Room C

Security

Content Experience Level Intermediate

13:15 BST

🪧 Poster Session: From Pods To Petabytes: Managing Data Objects as Kubernetes Resources - Sebastian Beyvers & Jannis Hochmuth, Giessen University

Thursday April 3, 2025 13:15 - 14:15 BST

Level 1 | Hall Entrances S8 - S9, N8 - N9

As ML and data-intensive applications expand across industries, organizations face growing pressure to integrate more external and internal data sources into their data and compute ecosystems. This raises a crucial question: How do you integrate data lifecycle management in a distributed environment like Kubernetes? It turns out, there are striking parallels between orchestrating containerized applications in Kubernetes and managing datasets across various locations. From lifecycle management to replication to placement strategies, by applying Kubernetes' proven orchestration concepts to data, it is possible to deliver consistent, efficient, and scalable “data orchestration”, which can be a powerful tool for streamlining data-driven applications – all using familiar K8s interfaces. This presentation explores the benefits of rethinking distributed data management with Kubernetes-inspired strategies and showcases a prototypical data orchestration implementation.

Speakers

Sebastian Beyvers

Distributed Systems Researcher, Giessen University

Sebastian Beyvers is a distributed systems researcher in bioinformatics and a cloud-native Rust developer at Giessen University. Sebastian's current work focuses on cloud-native data storage and processing solutions that try to harmonize existing national and international data ecosystems... Read More →

Jannis Hochmuth

Data Management Enthusiast, Giessen University

Jannis Hochmuth is a research assistant at Giessen University with a strong interest in scientific data management, particularly within distributed systems. Currently engaged in the NFDI initiative, his work centers on harmonizing data ecosystems at a national level, advancing collaborative... Read More →

Thursday April 3, 2025 13:15 - 14:15 BST
Level 1 | Hall Entrances S8 - S9, N8 - N9

🪧 Poster Sessions, Data Processing + Storage

Content Experience Level Intermediate

13:15 BST

🪧 Poster Session: GitOps Reinvented: Leveraging Imperative Tools for CAPI Clusters Management - Damien Dassieu, Independent

Thursday April 3, 2025 13:15 - 14:15 BST

Level 1 | Hall Entrances S8 - S9, N8 - N9

The GitOps approach has become a cornerstone of Kubernetes workflows, offering a declarative way to manage infrastructure and applications. However, managing infrastructure like Kubernetes clusters with GitOps presents challenges. For instance, large and complex CAPI manifests can lead to misconfigurations with unintended consequences.

To address this, platform engineers can use tools like kubectl, oc, or web UIs for an imperative, user-friendly experience. These tools validate inputs before sending requests to the Kubernetes API server, reducing errors.

But how can we integrate GitOps principles while using these tools? This session explores how ArgoCD & Syngit enable GitOps workflows for CAPI cluster management, combining declarative and imperative approaches for better results.

Speakers

Damien Dassieu

Kubernetes platform engineer, Independent

I am an active contributor to Kubernetes projects (Kubebuilder, controller-runtime, ...) with a focus on enabling scalable and efficient cluster management. I worked at Orange, the largest telecom company in France and as a tech-leader. I developed a solution to deliver and manage... Read More →

Thursday April 3, 2025 13:15 - 14:15 BST
Level 1 | Hall Entrances S8 - S9, N8 - N9

🪧 Poster Sessions, Platform Engineering

Content Experience Level Intermediate

13:15 BST

🪧 Poster Session: NOMADIC: How To Build a Flexible and Automated Compute Continuum From a Telco Operator’s Perspective - Xuan Du & Adam Morsman, BT

Thursday April 3, 2025 13:15 - 14:15 BST

Level 1 | Hall Entrances S8 - S9, N8 - N9

Telco cloud infrastructure is challenged on both a horizontal scale, as it extends towards the edge, and a vertical scale, given stringent KPIs from telco workloads. Thus, leveraging a blend of heterogeneous hardware such as multi-arch CPUs, GPUs, and other accelerators, and deploying them at large-scale and highly distributed locations is vital to having the most energy-efficient and cost-effective network.

What technologies from open-source and cloud-native communities can help address these challenges? NOMADIC (Network-oriented Multi-architecture Distributed Infrastructure as Code) is an answer by applying the declarative approach, DevOps practices, and self-service principles to demonstrate automated lifecycle management of telco cloud.

However, this presents yet unanswered questions on how to advertise heterogeneous resources so that intelligence driven workload placement can be achieved. A “single pane of glass” could enable this, but what implementation should this take?

Speakers

Adam Morsman

Research Professional, BT

Adam started his career as an apprentice in the research department at BT studying Digital and Technology Solutions Degree with a specialism in Data Analysis from the University of Exeter. Following completion of the apprenticeship in 2022 he began his current role of Research Professional... Read More →

Xuan Du

Senior Research Specialist, BT Group

Xuan Du is currently a Senior Research Specialist in the Cloud Infrastructure Centre of Excellence at BT Research in the UK, where he focuses on cloud-native technologies and approaches for building and running telco cloud infrastructure to host telco workloads, including radio access... Read More →

Thursday April 3, 2025 13:15 - 14:15 BST
Level 1 | Hall Entrances S8 - S9, N8 - N9

🪧 Poster Sessions, Platform Engineering

Content Experience Level Intermediate

14:15 BST

AI Workload Preemption in a Multi-Cluster Scheduling System at Bloomberg - Leon Zhou & Wei-Cheng Lai, Bloomberg

Thursday April 3, 2025 14:15 - 14:45 BST

Level 1 | Hall Entrance S10 | Room B

As Bloomberg’s usage of AI continues to grow rapidly, it is critical to ensure that those workloads with high business impact are prioritized to use the firm’s available GPU resources. As a result, Bloomberg’s Data Science Platform engineering team has implemented Karmada’s Priority and Preemption feature to efficiently manage the sequencing of machine learning (ML) workloads using a multi-cluster scheduling system.

This talk will discuss the challenges of balancing resource allocation between high-priority and lower-priority ML batch jobs, and how Karmada helps ensure that business-critical workloads are not starved of resources during periods of high contention. Attendees will gain practical insights into configuring and managing multi-cluster environments, ensuring timely execution of ML jobs while maintaining cluster efficiency. This session is ideal for Kubernetes' administrators and engineers who are managing large-scale ML workloads.

Speakers

Leon Zhou

Software Engineer, Bloomberg LP.

Leon Zhou is a software engineer on the Data Science Platform engineering team at Bloomberg. With prior NLP experience, he is now building ML platforms to facilitate machine learning development. He is interested in ML infrastructure to enable large-scale training and complex pipelines... Read More →

Wei-Cheng Lai

Software Engineer, Bloomberg

Wei-Cheng Lai is a software engineer on Bloomberg's Data Science Platform engineering team. He is an open source contributor to Karmada and Kubeflow, and focuses on building ML training platforms on Kubernetes to facilitate training processes, enable large-scale training, and provide... Read More →

Thursday April 3, 2025 14:15 - 14:45 BST
Level 1 | Hall Entrance S10 | Room B

AI + ML

Content Experience Level Intermediate

14:15 BST

How We Moved Spotify To a Proxyless gRPC Service Mesh - Erik Lindblad & Erica Manno, Spotify

Thursday April 3, 2025 14:15 - 14:45 BST

Level 1 | Hall Entrance N10 | Room E

This talk tells the story of how Spotify transitioned its service network from a decade old DNS based service discovery to a modern service mesh built on the xDS API’s from the Envoy project. The talk covers the research and design considerations for this new system, and how it draws full advantage of native support in gRPC for both xDS and proxyless load balancing to support Spotify’s scale (2 million kubernetes pods) without the performance impact of traditional service mesh setups. The audience will learn how this setup was used to build three important mesh capabilities at Spotify: dynamic traffic splitting, a service call graph and zone aware routing.

This is a case study, so the talk will also cover operational considerations like safe rollouts using fast fallback mechanisms, and how to use gRPC’s custom load balancer support to do a centrally managed rollout that’s transparent to teams using your platform.

Speakers

Erik Lindblad

Staff Engineer, Spotify

Erik works as a Staff Engineer in Spotify's Infrastructure department since 2018, and at Spotify since 2013. He has led work on several major infrastructure projects, like global load balancing, service mesh and cloud cost performance.

Erica Manno

Senior Software Engineer, Spotify

I am Senior Software Engineer at Spotify based out of Italy, working in Core Infrastructure. I am passionate about distributed systems, reliability at scale and solving infrastructure-related challenges. Prior to Spotify, I worked at Verisign as a tech lead building the registry for... Read More →

Thursday April 3, 2025 14:15 - 14:45 BST
Level 1 | Hall Entrance N10 | Room E

Connectivity

Content Experience Level Intermediate

14:15 BST

Beyond Security: Leveraging OPA for FinOps in Kubernetes - Sathish Kumar Venkatesan, Royal Bank of Canada

Thursday April 3, 2025 14:15 - 14:45 BST

Level 1 | Hall Entrance S10 | Room A

The Open Policy Agent (OPA) is widely known for enforcing security policies, but its capabilities extend far beyond compliance. This session explores how OPA can be harnessed for FinOps practices in Kubernetes. Learn how to create policies to enforce cost-efficient resource requests, limit the use of high-cost instance types, and ensure workloads adhere to budget constraints. Discover how to integrate OPA with tools like Gatekeeper and OpenCost to provide real-time cost visibility and actionable insights. Through practical examples, attendees will gain the skills to use OPA for both security and cost optimization in Kubernetes environments.

Speakers

Sathish Kumar Venkatesan

Principal Cloud Customer Engineer, Royal Bank of Canada

A Kubestronaut with 17 years of IT experience and 8 years in cloud-native technologies. As Cloud Engineer, DevOps practitioner, and SRE, I focus on extending CNCF projects beyond traditional uses. Currently transforming OPA from security into FinOps, combining KEDA and virtual clusters... Read More →

Thursday April 3, 2025 14:15 - 14:45 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Intermediate

14:15 BST

Simplify Kubernetes Operator Development With a Modular Design Pattern - Mostafa Hadadian & Alexander Lazovik, University of Groningen

Thursday April 3, 2025 14:15 - 14:45 BST

Level 1 | Hall Entrance N10 | Room F

Kubernetes operators automate complex application management. However, building and maintaining them poses significant challenges. Custom Resource Definitions (CRDs) are painful to evolve once established, and controllers’ logic becomes increasingly complex over time. We learned these lessons the hard way through years, but you don't have to.

We present a design pattern that simplifies Kubernetes operator development by decomposing CRDs into manageable pieces and controllers into more focused microcontrollers. This pattern decouples K8s instructions from controllers' logic by leveraging Helm charts for translating CRD specifications into Kubernetes resources. As a result, our solution reduces code and maintenance complexities, accelerates iteration, and provides an efficient development workflow.

Finally, we share a real-world implementation of our design in the Netherlands' water sector that accelerates AI stream processing application delivery.

Speakers

Mostafa Hadadian

AI/MLOps Innovator| Founder & CEO, University of Groningen | CAIDEL

Alexander Lazovik

Professor in Distributed Systems, University of Groningen

Alexander Lazovik, Professor of Distributed Systems at the University of Groningen since 2009, specializes in AI, optimization in distributed environments, cloud computing, and scalable IT infrastructures. He earned his Ph.D. from the University of Trento in 2006 on the topic of Interaction... Read More →

Thursday April 3, 2025 14:15 - 14:45 BST
Level 1 | Hall Entrance N10 | Room F

Platform Engineering

Content Experience Level Intermediate

14:15 BST

Mind the Gap: Bridging Supply Chain Policy With Git-less GitOps and GUAC - Michael Lieberman, Kusari & Andrew Martin, ControlPlane

Thursday April 3, 2025 14:15 - 14:45 BST

Level 1 | Hall Entrance S10 | Room C

In a live supply chain attack demo, we demonstrate the latest security features of Flux CD and OpenSSF GUAC together in a hardened, wide-scale production scenario. When the next XZ or log4shell vulnerability lands, see how to assess, respond, and prevent proliferation before or after an attacker gets a foothold in your systems.

See how to defend against an assault on your dependency tree, prevent hostile insiders from escalating their privilege, and lock down your production environment to harden it against future threats.

We:
Use OCI-first Flux CD to remove network routes to Git servers from production
GUAC to manage dependency inventory and bring signal to the noise of CVE updates
Timoni to reliably patch, customise, and verify deployments before release
Flux Autopilot to roll out multi-tenancy lockdown, horizontal and vertical scaling, and persistent storage across fleets of clusters

Speakers

Michael Lieberman

CTO, Kusari

Michael Lieberman is co-founder and CTO of Kusari where he helps build transparency and security in the software supply chain. Michael is an active member of the open-source community, co-creating the GUAC and FRSCA projects and co-leading the CNCF’s Secure Software Factory Reference... Read More →

Andrew Martin

CEO, ControlPlane

Andrew has an incisive security engineering ethos gained building and destroying high-traffic web applications. Proficient in systems development, testing, and operations, he is at his happiest profiling and securing every tier of a cloud native system, and has battle-hardened experience... Read More →

Thursday April 3, 2025 14:15 - 14:45 BST
Level 1 | Hall Entrance S10 | Room C

Security

Content Experience Level Intermediate

15:00 BST

Autonomous AI Agents in Production: Slashing Cloud Cost Root Cause Analysis From Hours To Minutes - Ilya Lyamkin, Spotify

Thursday April 3, 2025 15:00 - 15:30 BST

Level 1 | Hall Entrance S10 | Room B

As cloud infrastructures scale, traditional cost monitoring struggles to identify root causes of spending anomalies. This technical deep-dive shows how autonomous AI agents transformed our cost observability pipeline, reducing root cause analysis time from hours to under 5 minutes. We'll examine the agent architecture including deployment patterns, distributed cost tracing, and automated analysis workflows. Learn how we engineered AI agents to correlate cost signals across cloud services, implemented real-time pattern recognition with ML models, and built resilient feedback loops. Through production examples, we'll share our journey from manual investigation to automated root cause identification, including challenges in scaling agent intelligence. Attendees will gain practical insights into building their own AI-powered cost analysis system that scales with their infrastructure.

Speakers

Ilya Lyamkin

Senior Software Engineer, Spotify

Ilya leads the cost tooling infrastructure team at Spotify, driving cloud optimization initiatives through AI-driven analysis tools. Over the past two years, his team pioneered autonomous cost optimization systems that significantly reduced anomaly detection time. With 8 years of... Read More →

Thursday April 3, 2025 15:00 - 15:30 BST
Level 1 | Hall Entrance S10 | Room B

AI + ML

Content Experience Level Intermediate

15:00 BST

Encryption, Identities, and Everything in Between; Building Secure Kubernetes Networks - Lior Lieberman, Google & Igor Velichkovich, Stealth Startup

Thursday April 3, 2025 15:00 - 15:30 BST

Level 1 | Hall Entrance N10 | Room E

As the scale of your clusters grows, so does the complexity of securing your networks. The stakes are high: inadequate encryption or identity management solutions can leave clusters vulnerable to a range of security risks.

In this session, Lior and Igor will explore the landscape of network encryption, AuthN and AuthZ solutions grounded in the principles of defense-in-depth and least privilege. Starting with the current projects in the ecosystem, they’ll highlight the principles and design requirements essential for building resilient, secure networks. The session will then dive into real-world scenarios where you’ll learn security strategies at scale. Finally, they’ll highlight how the community can work together to standardize and simplify encryption and identity management, making security more accessible and robust for all users.

Join us! We’d also love your feedback to help drive the future of Kubernetes network security.

Speakers

Igor Velichkovich

Software Engineering Lead, Stealth Startup

Igor is an engineering lead at a stealth startup focused on accelerated infrastructure and high performance compute. He has worked with sig-api-machinery (CEL) and continues work with various projects of kubernetes-sigs used in accelerated infrastructure environments.

Lior Lieberman

Site Reliability Engineer, Google

Lior is site reliability engineer at Google working on Google Compute Engine and Cloud Service Mesh. He is a leading maintainer of ingress2gateway, and an active contributor to Kubernetes SIG network focused on Gateway API.

Thursday April 3, 2025 15:00 - 15:30 BST
Level 1 | Hall Entrance N10 | Room E

Connectivity

Content Experience Level Intermediate

15:00 BST

Dynamic Multi-Cluster Controllers With Controller-runtime - Marvin Beckers, Kubermatic & Stefan Schimanski, Upbound

Thursday April 3, 2025 15:00 - 15:30 BST

Level 1 | Hall Entrance S10 | Room D

controller-runtime is the most popular SDK to write controllers for individual Kubernetes clusters. But the Kubernetes landscape is changing quickly: multi-cluster is becoming ubiquitous (e.g. through Cluster API), with clusters joining and leaving dynamically. controller-runtime has had no direct support, making writing uniform multi-cluster controllers hard and fracturing the emerging ecosystem.

This talk explores how to build controllers that reconcile resources across a dynamic fleet of Kubernetes clusters. A key change is the ability to plug in a dynamic cluster provider that registers new Kubernetes clusters from a specific source. While implementation internals are briefly discussed, focus is on a hands-on walkthrough for writing your own cluster provider, event handlers and reconciler functions.

We discuss a simplistic cluster provider implementation for “kind” clusters as an example and extrapolate from that how more complex providers could look like (e.g. for CAPI or kcp).

Speakers

Stefan Schimanski

Senior Principal Engineer, Upbound

Stefan is a Senior Principal Engineer at Upbound working on control planes, Kubernetes, kcp, and as a tech-lead in Sig API Machinery. He contributed a major part of the CRD feature set. Stefan is a 2nd time GoogleSummer of Code mentor with CNCF, loves to teach and help people to learn... Read More →

Marvin Beckers

Team Lead, Kubermatic

Marvin started out as a sysadmin, gradually turned into a software engineer and now works as an Software Engineering Team Lead at Kubermatic. He always had a passion for effective management of large server fleets, which has turned his attention to Kubernetes in 2018. He has been... Read More →

Thursday April 3, 2025 15:00 - 15:30 BST
Level 1 | Hall Entrance S10 | Room D

Emerging + Advanced

Content Experience Level Intermediate

15:00 BST

How To Rename Metrics Without Impacting Somebody’s Observability - Bartłomiej Płotka, Google & Arianna Vespri, Independent

Thursday April 3, 2025 15:00 - 15:30 BST

Level 1 | Hall Entrance N10 | Room G

Metrics are a core aspect of modern cloud-native observability and monitoring. With the Prometheus project, it’s easy to create metrics and adopt existing ones from applications or exporters. It's easy to build layers of tools, alerts, dashboards and integrations that depend on specific metrics. Unnoticed, metric names and labels became an API contract between instrumentation and consumers.

However, second-day operations kick in! New standards, naming opinions and software versions force metrics to be changed, causing major downstream breakages. Projects like Kubernetes or OpenTelemetry started frameworks to raise awareness about this problem. Can we do more?

In this talk, Bartek (Prometheus maintainer) and Arianna (Prometheus client_golang maintainer) will explore renaming strategies for Prometheus and OpenTelemetry end users. Finally, they will discuss existing conventions and frameworks for stable metric versioning that could be adopted by the next generation of instrumentation.

Speakers

Bartłomiej Płotka

Sr Software Engineer, Google

Bartek Płotka is a Senior Software Engineer at Google. SWE by heart, with an SRE background, currently working on Cloud Observability. Previously Principal Software Engineer at Red Hat. Author of "Efficient Go" book with O'Reilly. As the co-founder of the CNCF Thanos project and... Read More →

Arianna Vespri

Software Engineer, Self-employed

Arianna Vespri is a Go developer with a background in the music industry. Passionate about monitoring and observability, is a Prometheus contributor and a maintainer of Prometheus client_golang. Active as an electronic musician for decades under a pseudonym, is very familiar with... Read More →

Thursday April 3, 2025 15:00 - 15:30 BST
Level 1 | Hall Entrance N10 | Room G

Observability

Content Experience Level Intermediate

15:00 BST

Breaking Free From the Cloud: Banking on Self-Hosted Kubernetes - Kārlis Akots Gribulis & Per Hedegaard Christiansen, Saxo Bank

Thursday April 3, 2025 15:00 - 15:30 BST

Level 0 | ICC Capital Hall | Room 2

What drives a global investment bank to transition from managed cloud Kubernetes service to self-hosted on-premises solution? While managed Kubernetes in the cloud can simplify deployments they do often come with significant trade-offs. At Saxo Bank, we made the decision to regain control by shifting to a self-hosted, on-premises Kubernetes platform.

This session will unpack our motivations, such as decreasing costs by 80%, reducing cluster creation time fifteenfold, and improving our CIS benchmark standing by 30%. We’ll dive into the architecture we adopted, the lessons learned from overcoming performance and resilience challenges, and how this change has impacted our infrastructure into positioning Kubernetes as Saxo Bank’s cornerstone for the future.

Speakers

Per Hedegaard Christiansen

Head of Container Platform Engineering, Saxo Bank

Passionate about container technology and always eager to explore new tech stacks. With extensive experience in Docker, Kubernetes, and microservices, I design and optimize scalable, secure container environments. Constantly learning and embracing cutting-edge tools, I thrive in agile... Read More →

Kārlis Akots Gribulis

Senior Container Platform Engineer, Saxo Bank

Kārlis Akots Gribulis has hands-on experience working across various companies in the cloud-native space. Throughout his career, he has been deeply involved in deploying, managing, and optimizing Kubernetes clusters, helping organizations harness the full power of cloud-native technologies... Read More →

Thursday April 3, 2025 15:00 - 15:30 BST
Level 0 | ICC Capital Hall | Room 2

Platform Engineering

Content Experience Level Intermediate

15:00 BST

SPIFFE in Practice: Universal Identity for WebAssembly Workloads - Joonas Bergius, Cosmonic & Colin Murphy, Adobe

Thursday April 3, 2025 15:00 - 15:30 BST

Level 1 | Hall Entrance S10 | Room C

Universal Identity (or Workload Identity) is a foundational concept that underpins every secure platform. When implemented well, it provides the platform and security teams the ability to reason about the entities running on their platform and the interactions between them.

SPIFFE has become the industry standard for establishing Identity that can be used to authenticate across all major cloud providers, on various workload platforms and even to an increasing number of third-party services. As SPIFFE adoption across various CNCF projects is growing, WebAssembly workloads present some unique challenges to simply lifting and shifting from what’s been done before.

This talk will cover the journey CNCF wasmCloud underwent in adopting SPIFFE as the foundation for providing Secure Production Identity for the WebAssembly Workloads running on the platform. We will share the lessons we learned from our journey, starting out with a concept to then bringing it all the way to production.

Speakers

Colin Murphy

Sr Software Engineer, Adobe

Colin Murphy is a senior software engineer on the Adobe Content Authenticity Initiative team. Previous roles include frontend engineer for Adobe Express, head of infrastructure of Adobe Document Cloud microservices, including Adobe Sign and Acrobat Web. He has been responsible for... Read More →

Joonas Bergius

Senior Software Engineer, Cosmonic

Joonas Bergius is a veteran of the Cloud Native community, having been part of the Kubernetes ecosystem as a contributor and end-user since the early days (circa 2015) of Kubernetes.

Thursday April 3, 2025 15:00 - 15:30 BST
Level 1 | Hall Entrance S10 | Room C

Security

Content Experience Level Intermediate

16:00 BST

Balancing Cost and Efficiency: Day2 Optimization of Multi-Cluster AI Infrastructure - Kevin Wang, Huawei

Thursday April 3, 2025 16:00 - 16:30 BST

Level 1 | Hall Entrance S10 | Room B

Multi-cluster AI infrastructures have become the norm due to factors such as resource availability, platform scale, high availability, and business resource pool consolidation. However, managing diverse workloads across heterogeneous clusters can be challenging. In this talk, we will share our experiences and lessons learned from deploying Karmada and Volcano in real-world multi-cluster AI environments. We will delve into specific Day2 optimization techniques, including:
1) Configuring scheduling strategies to balance resource utilization and workload priorities.
2) Customizing workload management to accommodate diverse AI workloads with varying requirements.
3) Leveraging topology-aware scheduling to improve the efficiency of AI training and inference tasks.

By sharing concrete examples and results, we will demonstrate how to effectively optimize multi-cluster AI infrastructures to achieve better performance, cost efficiency, and scalability.

Speakers

Kevin Wang

Technical Expert, Lead of CloudNative Open Source, Huawei

Kevin Wang has been an outstanding contributor in the CNCF community since its beginning and is the leader of the cloud native open source team at Huawei. Kevin has contributed critical enhancements to Kubernetes, led the incubation of the KubeEdge, Volcano, Karmada projects in CNCF... Read More →

Thursday April 3, 2025 16:00 - 16:30 BST
Level 1 | Hall Entrance S10 | Room B

AI + ML

Content Experience Level Intermediate

16:00 BST

Defusing the Kubernetes API Performance Minefield - Madhav Jivrajani, UIUC & Marek Siarkowicz, Google

Thursday April 3, 2025 16:00 - 16:30 BST

Level 1 | Hall Entrance S10 | Room A

Kubernetes enables a wide landscape of CNCF projects and organisations to build upon its foundation and extend its functionality through custom controllers. But anyone who has deployed an operator at scale, quickly discovers that the Kubernetes API is a performance minefield. Forget to set resourceVersion when listing pods? Your control plane explodes! This talk delves into recent enhancements in Kubernetes designed to defuse this performance minefield. We'll explore the improved storage layer that allows caching more types of requests, effectively halving request latency and reducing the load on etcd. Don't let your cluster fall victim to faulty controllers – join us to learn how these changes mitigate risks, boost performance, and contribute to a more stable and reliable Kubernetes experience. We'll explore how the storage layer improves API responsiveness and predictability, and you'll understand the impact of these changes on scalability, reliability, and overall user experience.

Speakers

Madhav Jivrajani

Kubernetes Maintainer, UIUC

Madhav is currently working at VMware on upstream Kubernetes. He has been a part of the Kubernetes community for about a year and mainly helps out with SIG-{Contribex, Node, Architecture, API-Machinery}. He was also involved with the structured logging efforts in the Kubernetes project... Read More →

Marek Siarkowicz

Senior Software Engineer, Google

Marek is a Software Engineer working at Google in Etcd team. He began his career in local startups where he loved open source and extreme programming. Currently he is a etcd maintainer and active member of SIG-instrumentation leading structured logging effort in Kubernetes. In his... Read More →

Thursday April 3, 2025 16:00 - 16:30 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Intermediate

16:00 BST

How We Progressively Deliver Changes To Kubernetes Using Canary Deployments and Feature Flags - Bob Walker, Octopus Deploy

Thursday April 3, 2025 16:00 - 16:30 BST

Level 0 | ICC Capital Hall | Room 2

This is the case study of how we changed how we ship software.

With thousands of customers, each in their own Kubernetes container, deploying updates was tough. Off-hours schedules meant it took over 24 hours to push a new version. If something broke, we had to scramble. Canary deployments let us update small groups of customers at a time. We built a tool to stop rollouts fast when issues appeared, limiting the damage.

In the past, new features went to everyone at once. Rolling back wasn't an option. If something failed it'd leave customers stuck in the mess. Now, using OpenFeature, we hide new functionality behind feature flags. We release features to small groups, gather feedback, and test internally for weeks. If things go wrong, we flip the flag off and move on.

This two-pronged approach lets us avoid risky big-bang releases. We went from deploying every 10 days to every 4, with fewer than 1% high-severity defects. Most of these are resolved before customers notice them.

Speakers

Bob Walker

Field CTO, Octopus Deploy

Bob Walker is a Field CTO Octopus Deploy. Bob started as a developer in the early days of .NET when web forms were the hottest new thing, and manual deployments were the norm. After one too many five-hour 2 AM Saturday deployments, he searched for any automation to stop that pain... Read More →

Thursday April 3, 2025 16:00 - 16:30 BST
Level 0 | ICC Capital Hall | Room 2

Platform Engineering

Content Experience Level Intermediate

16:00 BST

Practical Zombie Hunting for Kubernetes Users - Holly Cummins, Red Hat

Thursday April 3, 2025 16:00 - 16:30 BST

Level 1 | Hall Entrance N10 | Room F

Zombies? Yup, zombies. Zombies are servers which aren’t doing useful work. They’re everywhere, costing money, eating electricity, and belching carbon. And they’re useless! Sadly, the cloud has *not* helped our zombie problem, and even Kubernetes hasn't helped.

One of the reasons zombies don’t get switched off is that no one knows they’re there. So how do we get rid of our pesky zombies? In this talk, Holly will explain the underlying technical and organisational factors that lead to zombies, and introduce a range of real-world zombie-hunting strategies. These include getting to grips with elasticity and utilisation, LightSwitchOps, FinOps, and the eco-monkey (it’s like the chaos monkey, but greener). Technologies covered include absurdly simple scripts, DailyClean, Kruize Autotune, and Backstage.

Speakers

Holly Cummins

Senior Principal Software Engineer, Red Hat

Holly Cummins is a Senior Principal Software Engineer on the Red Hat Quarkus team. Before joining Red Hat, Holly was a long time IBMer, in a range of roles from cloud consultant, full-stack javascript developer, WebSphere Liberty devops architect, JVM performance engineer, to innovation... Read More →

Thursday April 3, 2025 16:00 - 16:30 BST
Level 1 | Hall Entrance N10 | Room F

Platform Engineering

Content Experience Level Intermediate

16:00 BST

Tutorial: Unlock the Future of Kubernetes and Accelerators (and all Specialized Hardware) with DRA - Rey Lejano, Red Hat

Thursday April 3, 2025 16:00 - 17:15 BST

Level 1 | Hall Entrance N11

At the heart of the AI revolution are GPUs and the platform that provides access to them is Kubernetes. Workloads historically access GPUs and other devices with the device plugin API but features are lacking. The new Dynamic Resource Allocation (DRA) feature helps maximize GPU utilization across workloads with additional features like the ability to control device sharing across Pods, use multiple GPU models per node, handle dynamic allocation of multi-instance GPU (MIG) and more. DRA is not limited to GPUs but any specialized hardware that a Pod may use including network attached resources such as edge devices like IP cameras.
DRA is a new way to request for resources like GPUs and gives the ability to precisely control how resources are shared between Pods.
This tutorial introduces DRA, reviews the “behind-the-scenes” of DRA in the Kubernetes cluster and walks through multiple ways to use DRA to request for GPU and a network attached resource.

Speakers

Rey Lejano

Solutions Architect, CNCF Ambassador, K8s SIG Docs co-chair, Red Hat

Rey Lejano is a Solutions Architect at Red Hat and is the co-chair of Kubernetes SIG Docs. He contributes to Kubernetes SIG Security, Release, & Contributor Experience. He is a member of seven Kubernetes Release Teams including serving as the 1.23 Release Lead and 1.25 Emeritus Adviser... Read More →

Thursday April 3, 2025 16:00 - 17:15 BST
Level 1 | Hall Entrance N11

Tutorials, AI + ML

Content Experience Level Intermediate

16:45 BST

GPU Sharing at CERN: Cutting the Cake Without Losing a Slice! - Diana Gaponcic, CERN

Thursday April 3, 2025 16:45 - 17:15 BST

Level 1 | Hall Entrance S10 | Room D

GPUs and accelerators are changing traditional High Energy Physics (HEP) deployments while also being the key to enabling efficient machine learning. However, their high cost and increasing demand oblige service managers to look into ways to maximize the HW utilization through sharing. While the existing methods are flexible and easy to use, complex use cases still require building custom components on top of the existing device plugin API.

This talk explores the new, exciting way of allocating and sharing GPUs - using Dynamic Resource Allocation (DRA). We go over the multiple options for GPU scheduling: time sharing, MPS, and MIG. We cover the features and limitations of each option and present extensive benchmark results that helped us assign each of our ML and scientific workloads to the most appropriate layout. Finally, we describe how managing GPUs in a centralized way improves resource utilization across interactive and batch workloads while optimizing costs in the long run.

Speakers

Diana Gaponcic

Computing Engineer, CERN

Diana is a Computing Engineer in the CERN IT department. After an internship at CERN focusing on containerization of ETL applications she later joined the Kubernetes team, working on the GitOps and monitoring infrastructure. Her current focus is on optimizing the usage of GPUs and... Read More →

Thursday April 3, 2025 16:45 - 17:15 BST
Level 1 | Hall Entrance S10 | Room D

Emerging + Advanced

Content Experience Level Intermediate

16:45 BST

Observability Pipeline Query Languages: Present and the Future - Jacek Migdal, Quesma

Thursday April 3, 2025 16:45 - 17:15 BST

Level 1 | Hall Entrance N10 | Room G

Many observability products have created their query languages, starting with Splunk and followed by a parade of incompatible options (Sumo Logic, Coralogix Dataprime, Grafana LogQL, Elastic ES/QL, OpenSearch PPL, to name a few). I’ll admit I’m one of the culprits who contributed to this fragmented landscape. Even PromQL, a well-known open-source option for time-series data, hasn’t reached the universal adoption levels of good old SQL.

Is there a way to untangle this mess and march toward some standardization? In this piece, I’ll dive into a few proposals, including concepts like “pipe SQL” and ideas floating around in CNCF forums, to see if there’s a glimmer of hope for alignment.

Speakers

Jacek Migdal

CEO & Co-founder, Quesma

Jacek started a career as an engineering intern at NVIDIA CUDA and Facebook. He joined pre-revenue startup Sumo Logic as ~20 Sumo Logic in the San Francisco Bay Area. He moved back to Poland and opened an office with 80+ full-time engineers. We optimized gross margins on AWS and... Read More →

Thursday April 3, 2025 16:45 - 17:15 BST
Level 1 | Hall Entrance N10 | Room G

Observability

Content Experience Level Intermediate

16:45 BST

Live Migrating Stateful Batch Containers To Decrease Cluster Cost - Chris Battarbee & Ece Kayan, Metoro

Thursday April 3, 2025 16:45 - 17:15 BST

Level 1 | Hall Entrance S10 | Room A

Stateless workloads have long been able to take advantage of cluster compaction and the cost savings of spot instances, but stateful workloads present unique challenges. Unlike stateless applications, stateful workloads can’t easily restart on a new node without losing their critical state, making dynamic optimization much more difficult.

This talk explores how container snapshotting using the Kubelet Checkpoint API enables live migration of stateful workloads. By capturing and restoring the state of running containers, we can now compact stateful workloads to fewer nodes and even run them on spot instances cutting costs significantly.

We’ll cover the technical details of analyzing your cluster for consolidation opportunities, snapshotting containers, and migrating them seamlessly using open source tooling.

Speakers

Chris Battarbee

Software Engineer, Metoro

Chris Battarbee is the founder of Metoro and a former engineer at Palantir, where he wrote software to manage Spark workloads on Kubernetes focussing on efficiency and cost savings.

Ece Kayan

Software Engineer, Metoro

Ece Kayan, co-founder of Metoro, is a former Amazon engineer who focused on improving the resiliency and reliability of Prime Video services.

Thursday April 3, 2025 16:45 - 17:15 BST
Level 1 | Hall Entrance S10 | Room A

Operations + Performance

Content Experience Level Intermediate

16:45 BST

A Journey To Modernizing a Regulated Cloud Control Plane - Pranita Praveen, Macquarie Group Pty Ltd & Steven Borrelli, Upbound

Thursday April 3, 2025 16:45 - 17:15 BST

Level 0 | ICC Capital Hall | Room 1

At Macquarie, we have embarked on a transformative journey to modernize our cloud control plane. Initially designed for a single-cloud environment (AWS) to facilitate our move away from data centers, we are now evolving towards a multi-cloud solution underpinned by GitOps principles and foundational tooling made possible through the CNCF ecosystem. Our focus is on Kubernetes, Crossplane, OPA, Argo, among others, which have been instrumental in our progress.

We aim to share our successes and the lessons learned throughout this journey, built for engineers in a globally regulated environment comprising four distinct lines of business. Our experience underscores the vital role of the CNCF in our modernization efforts, and we are eager to give back to the community that has provided us with indispensable resources and support.

Speakers

Steven Borrelli

Principal Solutions Architect, Upbound

Steven is a Principal Solutions Architect for Upbound, where he helps customers adopt Crossplane.

Pranita Praveen

Head of Enterprise Multi-Cloud, Macquarie Group Pty Ltd

I am a cloud platform engineer and passionate about creating robust, simple and easy to operate solutions.

Thursday April 3, 2025 16:45 - 17:15 BST
Level 0 | ICC Capital Hall | Room 1

Platform Engineering

Content Experience Level Intermediate

16:45 BST

Redefining Access Control: Scaling Policy as Code for Humans and AI Agents - Raz Cohen, Permit.io

Thursday April 3, 2025 16:45 - 17:15 BST

Level 1 | Hall Entrance S10 | Room C

As enterprises embrace AI, managing access for both human users and AI agents has become essential. Traditional access control methods can no longer meet the demands of AI-driven identities such as chatbots, AI agents, decision engines, and autonomous tools.

This talk explores how Policy as Code redefines fine-grained access control, enabling scalability for both humans and AI. Learn how to design flexible, auditable policies that support real-time decision-making and address AI-specific challenges. Tools like Open Policy Agent (OPA) and OpenFGA will be featured, along with strategies for integrating AI-driven access models into zero-trust environments.

Through real-world case studies, discover how enterprises secure billions of interactions while fostering seamless collaboration between humans and machines.

Join me to gain practical insights into implementing scalable access control for today’s AI-powered ecosystems !

Speakers

Raz Cohen

Head of Platform, Permit.io

I'm Raz Cohen, Head of Platform at Permit.io. With over eight years in Kubernetes, cloud-native solutions, open-source projects & Platform engineering, starting at IDF's 8200 unit, Logz.io and Doubleverify, I've become a specialist in Developer Tools. I've spoken at events like KubeCon... Read More →

Thursday April 3, 2025 16:45 - 17:15 BST
Level 1 | Hall Entrance S10 | Room C

Security

Content Experience Level Intermediate

17:10 BST

⚡Lightning Talk: Scaling To the Stars: Simulating Massive Clusters With KWOK - Soumya Balakrishnan, NVIDIA

Thursday April 3, 2025 17:10 - 17:15 BST

Level 0 | ICC Auditorium

At NVIDIA, we operate a large fleet of GPU Clusters that run Gaming and AI/ML workloads. As we expand, ensuring that we scale safely and efficiently becomes a critical challenge. Enter KWOK(Kubernetes Without Kubelet), our secret weapon for stress-testing new features before they hit production.
This talk will dive into how we integrate KWOK into our development pipeline, showcasing how it's helped us maintain stability while rapidly innovating.
1. Identifying resource utilization boundaries: Demonstrate how KWOK has helped us evaluate the resource limits that need to be set on service pods so they can operate within safe boundaries.
2. Code optimization insights: Share examples of how KWOK has helped optimize our automation tools, significantly reducing their memory footprint.
3. Performance testing at scale: Illustrate how KWOK enables us to simulate large-scale environments, allowing us to identify potential bottlenecks and optimize system performance before production deployment.

Speakers

Soumya Balakrishnan

Senior Software Engineer, NVIDIA

Soumya is a Senior DevOps Engineer at NVIDIA, specializing in cloud infrastructure and Kubernetes technologies.

Thursday April 3, 2025 17:10 - 17:15 BST
Level 0 | ICC Auditorium

⚡ Lightning Talks, Platform Engineering

Content Experience Level Intermediate

17:15 BST

⚡Lightning Talk: Scheduling Success: Precision Updates for Continuous Manufacturing Operations - Raul - Mihail Galescu, Bosch Connected Industry

Thursday April 3, 2025 17:15 - 17:20 BST

Level 0 | ICC Auditorium

Cloud-native technologies are gaining traction in manufacturing, as the industry strives for zero-downtime deployments in production systems. However, many plants rely on legacy software that doesn’t integrate smoothly with cloud-native environments. Even when containerized, these components often fail to support seamless request redirection between replicas, causing disruptions during cluster or node updates. These disruptions require precise scheduling around plant shift plans. This lightning talk will explain why maintenance windows can still be effective and how Bosch Connected Industry addresses the limitations of public cloud providers' update controls. You’ll learn a simple yet effective approach to managing cluster updates and node image promotions in production-critical environments.

Speakers

Raul Galescu

Junior DevOps Engineer, Bosch Connected Industry

Raul is a Junior DevOps Engineer at Bosch Connected Industry, specializing in optimizing cloud-native solutions. Prior to this role, he worked as a Junior System Administrator at the West University of Timisoara and provided IT solutions to public institutions at a local company in... Read More →

Thursday April 3, 2025 17:15 - 17:20 BST
Level 0 | ICC Auditorium

⚡ Lightning Talks, Platform Engineering

Content Experience Level Intermediate

17:30 BST

Efficient Transparent Checkpointing of AI/ML Workloads in Kubernetes - Radostin Stoyanov, University of Oxford & Adrian Reber, Red Hat

Thursday April 3, 2025 17:30 - 18:00 BST

Level 1 | Hall Entrance S10 | Room B

As long-running AI/ML workloads become more common in cloud-native environments, the need for efficient checkpointing mechanisms to provide fault tolerance becomes increasingly important. However, current state-of-the-art techniques for transparent GPU checkpointing rely on intercepting and logging device API calls (e.g., CUDA runtime) as well as capturing input data and object handles (e.g., events, streams). This approach inevitably introduces steady-state overhead and requires replaying the entire recorded execution, potentially with nondeterministic operations, to recover from failures.

This talk will cover how the Kubernetes container checkpointing functionality has been extended with recently introduced CRIU plugins to enable transparent checkpoint/restore of GPU computations without the overhead of API interception, logging, or re-execution. This talk will also discuss how these mechanisms can be utilized to improve resource utilization in large-scale GPU clusters.

Speakers

Adrian Reber

Senior Principal Software Engineer, Red Hat

Adrian is a Senior Principal Software Engineer at Red Hat and is migrating processes at least since 2010. He started to migrate processes in a high performance computing environment and at some point he migrated so many processes that he got a PhD for that. Most of the time he is... Read More →

Radostin Stoyanov

PhD Student, University of Oxford

Radostin Stoyanov is a PhD student at the Scientific Computing research group at the University of Oxford, and a Software Engineer at the Core Kernel Team at Red Hat. His research focuses on improving the resilience and performance of HPC and cloud computing systems.

Thursday April 3, 2025 17:30 - 18:00 BST
Level 1 | Hall Entrance S10 | Room B

AI + ML

Content Experience Level Intermediate

17:30 BST

Challenges of and Solutions for Migrating Spark From Legacy Hadoop Clusters To Kubernetes - Peter Christensen & Neha Singla, Apple

Thursday April 3, 2025 17:30 - 18:00 BST

Level 1 | Hall Entrance N10 | Room H

While there are performance and security reasons for operating Spark from bare metal Apache Hadoop clusters, cloud-based installations using Kubernetes as the cluster manager are becoming more and more mainstream due to superior scalability, flexibility, and cost-effectiveness for variable workloads. However, the migration of Spark from bare metal clusters to a cloud-based cluster environment poses a number of non-trivial challenges from a technical as well as from a human/organizational perspective. Specifically, these challenges include but are not limited to dealing with difficulties in achieving query performance parity, differences in scheduling and resource management, security in a multi-tenancy context, and the provisioning of sufficient introspection for aiding diagnostics and configuration adjustments. This case study recounts challenges encountered and solutions implemented while migrating Spark from bare metal to Kubernetes managed cloud in a large corporate environment.

Speakers

Peter Christensen

Software Engineer, Apple

Senior software engineer with multi-disciplinary background in various fields such as electronic design automation, materials science, and large-scale distributed processing and cloud computing

Neha Singla

Senior Software Engineer, Apple

Neha Singla is a software engineer with Data platform team in Apple who provides Jupyter notebooks solutions at scale to help data scientists/data engineers at Apple build great data products. She is working with Apple for 2+ years and have experience building platforms at scale with... Read More →

Thursday April 3, 2025 17:30 - 18:00 BST
Level 1 | Hall Entrance N10 | Room H

Data Processing + Storage

Content Experience Level Intermediate

17:30 BST

Image Snapshotters for Efficient Container Execution in Particle Physics - Clemens Lange, Paul Scherrer Institute & Valentin Volkl, CERN

Thursday April 3, 2025 17:30 - 18:00 BST

Level 1 | Hall Entrance S10 | Room D

In particle physics, compute-intensive workloads often involve thousands of "embarrassingly parallel" jobs relying on multi-gigabyte container images. A large fraction of these workloads is executed using software containers. Efficient execution across large-scale computing environments demands advanced caching and image loading techniques to prevent network saturation and reduce startup times. Leveraging the industry-standard containerd runtime, we evaluate snapshotter plugins such as CVMFS (a CERN-developed distributed file system for large-scale software distribution), SOCI, and Stargz, which use "lazy" image loading to optimise performance. This talk includes an analysis of metrics such as container startup time and image data downloaded, alongside usability evaluations in a research environment. We demonstrate how these tools enhance the reusability and reproducibility of physics analyses---insights relevant to broader high-performance computing scenarios.

Speakers

Clemens Lange

Research Physicist, Paul Scherrer Institute

Clemens is a particle physicist at Switzerland’s Paul Scherrer Institute, where he contributes to the CMS experiment at CERN’s Large Hadron Collider. He focusses on Higgs boson analysis, the development of new particle detectors, and is passionate about computing and open science... Read More →

Valentin Volkl

Systems Software Engineer, CERN

Valentin is a physicist and staff software engineer at CERN. In the past he has worked on software and simulations for the next generation of particle colliders. Since 2023 he is lead developer for the CernVM-FileSystem (CVMFS) that is used to distribute software for users in science... Read More →

Thursday April 3, 2025 17:30 - 18:00 BST
Level 1 | Hall Entrance S10 | Room D

Emerging + Advanced

Content Experience Level Intermediate

17:30 BST

Optimizing Metrics Collection & Serving When Autoscaling LLM Workloads - Vincent Hou, Bloomberg & Jiří Kremser, kedify.io

Thursday April 3, 2025 17:30 - 18:00 BST

Level 1 | Hall Entrance N10 | Room G

Balancing resource provision for LLM workloads is critical for maintaining both cost efficiency and service quality. Kubernetes’s Horizontal Autoscaling offers a cloud-native capability to address these challenges, relying on the metrics to make the autoscaling decisions. However, the efficiency of metrics collection impacts how quickly and accurately Autoscaler responds to the LLM workload demands. This session explores strategies to enhance metrics collection for autoscaling LLM workloads with:
1. The fundamentals of how horizontal autoscaling works in Kubernetes
2. The unique challenges of autoscaling LLM workloads
3. A comparison of existing Kubernetes autoscaling solution for custom metrics with their pros and cons
4. How optimizing metrics collection through push-based approaches can improve scaling responsiveness.
It will demonstrate an integrated solution using KServe, OpenTelemetry collector and KEDA to showcase how they can be leveraged to optimize LLM workload autoscaling.

Speakers

Vincent Hou

Senior Software Engineer, Bloomberg

Vincent Hou is a senior software engineer on Bloomberg’s Cloud Native Compute Services AI Inference engineering team, which he joined in 2023 after working for IBM for 13-years. He has been an active open source contributor since 2010. He previously was an active contributor to... Read More →

Jiří Kremser

YAML Engineer, kedify.io

whois jkremser? Software engineer and open-source enthusiast currently working on kedify.io. Previously GiantSwarm.io, ABSA, Red Hat, etc. He likes road trips, 3d print and he is also a proud contributor to CNCF sandbox project called k8gb.io

Thursday April 3, 2025 17:30 - 18:00 BST
Level 1 | Hall Entrance N10 | Room G

Observability

Content Experience Level Intermediate

17:30 BST

Automating Kubernetes Cluster Updates: Achieving Zero Downtime Effortlessly - Haitao Zhang, CloudPilot AI; Baofa Fan, DaoCloud; Ling Ling, Independent; Wei Jiang, Huawei

Thursday April 3, 2025 17:30 - 18:00 BST

Level 0 | ICC Capital Hall | Room 1

Upgrading a Kubernetes cluster is an ongoing task. The biggest challenge for teams maintaining Kubernetes clusters is how to avoid service disruptions or system crashes during the upgrade process. With Karpenter's disruption mechanism, we can now automate Kubernetes cluster upgrades on major cloud platforms such as AWS, Azure, and AlibabaCloud with controlled, zero downtime. To date, Karpenter supports these cloud vendors and will expand to more platforms in the future. This mechanism makes Kubernetes cluster upgrades safe, controllable, easy and efficient, and significantly reduces the operation and maintenance pressure of DevOps teams. In this session, we will discuss how Karpenter's disruption works, show examples of its practice on major cloud platforms, and help you master how to achieve smooth upgrades and ensure the continuous and stable operation of services.

Speakers

Wei Jiang

Tech Leader, CloudPilot AI

Wei Jiang serves as a Tech Leader at CloudPilot AI. He primarily works on open-source projects, focusing on node scaling with Karpenter and other technologies that achieve high utilization and cost-effectiveness.

Xinxia Ling

Open Source & AI Enthusiast, CloudPilot AI Inc.

With experience in promoting cloud-native solutions like Karpenter and Rancher, Ling offers valuable insights on how developers can cut cloud costs while scaling their infrastructure efficiently.

Fan Baofa

Software Engineer, DaoCloud

Baofa Fan (GitHub @carlory) is an active reviewer of the Kubernetes, Kubernetes-sigs and Kubernetes-csi organization, currently mainly on sig-storage. And He is also a reviewer of the Karmada project which focus on the multi-cluster area.

Haitao Zhang

Software Engineer, CloudPilot AI

Haitao Zhang (GitHub@helen-frank) is a major contributor and reviewer of karpenter-provider-alibabacloud, and a member of kubernetes-sigs and karmada.

Thursday April 3, 2025 17:30 - 18:00 BST
Level 0 | ICC Capital Hall | Room 1

Platform Engineering

Content Experience Level Intermediate

17:30 BST

From Metal To Apps: LinkedIn’s Kubernetes-based Compute Platform - Ahmet Alp Balkan & Ronak Nathani, LinkedIn

Thursday April 3, 2025 17:30 - 18:00 BST

Level 1 | Hall Entrance N10 | Room F

What does it take to design a Kubernetes-based fleet management stack that bridges the gap between bare-metal servers in data centers and a platform capable of hosting thousands of microservices, large-scale stateful applications, and a GPU fleet running AI workloads?

At LinkedIn, we use Kubernetes as a foundational primitive in our compute platform. We run thousands of microservices, manage large stateful applications with our custom scheduler, manage a large fleet of GPUs –all while performing regular maintenance on the bare metal hosts with no downtime or manual intervention.

In this talk, we’ll talk about how we architected and built an API-driven, Kubernetes-based compute stack with a large-scale microservices platform, a workload-agnostic stateful scheduler, and a multi-tenant ML/batch jobs platform. We’ll share insights on scaling Kubernetes for diverse workloads while maintaining tenant isolation, resilience, flexibility, and ease of use for developers.

Speakers

Ahmet Alp Balkan

Sr.Staff Software Engineer, LinkedIn

Ahmet is working on building LinkedIn's next generation compute cluster management stack using Kubernetes. In the open source he maintains projects like Krew (kubectl plugin manager), and kubectx.

Ronak Nathani

Sr. Staff Software Engineer, LinkedIn

Ronak leads the Kubernetes team at LinkedIn, spearheading the company's transition to Kubernetes over the past few years. Prior to this role, he contributed to the development and management of LinkedIn's home-grown scheduler and internal private cloud. In addition to his day job... Read More →

Thursday April 3, 2025 17:30 - 18:00 BST
Level 1 | Hall Entrance N10 | Room F

Platform Engineering

Content Experience Level Intermediate

09:06 BST

Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Efficiency - Clayton Coleman, Distinguished Engineer, Google & Jiaxin Shan, Software Engineer, Bytedance

Friday April 4, 2025 09:06 - 09:21 BST

Level 0 | ICC Auditorium

Traditional load balancing approaches, including round robin or those relying on metrics like QPS are often ineffective when applied to LLM serving. LLM requests vary significantly in computational demands due to prompt length, the model differences and their autoregressive nature, leading to unpredictable request running times. Moreover, the emergence of model multiplexing techniques (e.g., LoRA) introduces new complexities that necessitate LLM-aware load balancing strategies.
In this talk, we introduce a new set of Kubernetes APIs for routing to LLM workloads that allow configuration of serving objectives and priorities for each use case. These APIs integrate seamlessly with Gateway API, and an included extension means that support for these APIs can easily be plugged into many Gateway API implementations to enable turnkey LLM routing support.
This talk will show this project in action, demonstrating the significant improvements it can enable across a variety of real world examples.

Speakers

Jiaxin

Software Engineer, Bytedance

Jiaxin works at ByteDance Infrastructure Lab, focusing on serverless and AI infrastructure. He is also a co-chair of Kubernetes WG-Serving, Jiaxin drives innovations and contributes to the future of scalable AI systems.

Clayton Coleman

Distinguished Engineer, Google

Architect, engineer, and strategic visionary for application platforms in the cloud. Core contributor to Kubernetes and OpenShift, the open source platform as a service and the containerized cluster manager. I helped launch the shift to cloud native applications and the platforms... Read More →

Friday April 4, 2025 09:06 - 09:21 BST
Level 0 | ICC Auditorium

Keynote Sessions, AI + ML

Content Experience Level Intermediate

10:10 BST

Keynote: Science at Light Speed: Cloud Native Infrastructure for Astronomy Workloads - Carolina Lindqvist, System Specialist, EPFL

Friday April 4, 2025 10:10 - 10:25 BST

Level 0 | ICC Auditorium

The Square Kilometre Array (SKA) project is a global collaboration for constructing the world’s largest radio telescope. This presentation shows how the Swiss SKA Regional Center (CHSRC) unit within the global SKA Regional Center Network (SRCNet) collaboration uses Kubernetes as a service management plane and leverages its ecosystem to build a novel infrastructure to support data- and compute-intensive astronomy use cases. The main focus is on an example setup of a Kubernetes cluster, showing how cloud-native tools are leveraged to interact with external storage and compute services, and demonstrating how to build infrastructure suitable for multiple sites. It is applicable both for beginners who seek guidance for where to start their cloud-native journey as well as intermediate Kubernetes users who wish to see examples of cloud-native use cases from within a scientific organisation.

Speakers

Carolina Lindqvist

System Specialist, EPFL

Carolina Lindqvist is a System Specialist at the EPFL SCITAS department for Scientific Computing and High Performance Computing (HPC). She works with Kubernetes infrastructure for scientific use cases. Before joining SCITAS, Carolina worked at the Blue Brain Project, startups and... Read More →

Friday April 4, 2025 10:10 - 10:25 BST
Level 0 | ICC Auditorium

Keynote Sessions, Emerging + Advanced

Content Experience Level Intermediate

11:00 BST

Kubernetes and AI To Protect Our Forests: A Cloud Native Infrastructure for Wildfire Prevention - Andrea Giardini, Crossover Engineering BV

Friday April 4, 2025 11:00 - 11:30 BST

Level 0 | ICC Capital Hall | Room 2

As wildfires become increasingly devastating due to climate change, leveraging technology for environmental protection is crucial. This talk focuses on the infrastructure needed to support AI-driven wildfire prevention systems using Kubernetes and cloud-native technologies. We will discuss the challenges of managing robust data pipelines for processing satellite imagery and environmental data, emphasizing the importance of GPU acceleration for AI. Additionally, we will explore strategies for efficient storage solutions to handle large datasets, ensuring scalability and performance. Attendees will gain insights into the architectural considerations and operational challenges of deploying an effective, resilient wildfire monitoring and prevention infrastructure. Join us in understanding how we can harness the power of technology to protect our forests and mitigate the impact of wildfires on our environment.

Speakers

Andrea Giardini

Cloud Native Consultant / Trainer, Crossover Engineering

Andrea is a technical consultant passionate about infrastructure, cloud, and automation. Throughout his career, he has worked in different roles, from an individual contributor building infrastructure as code to an engineering manager growing a team from the ground up. He likes... Read More →

Friday April 4, 2025 11:00 - 11:30 BST
Level 0 | ICC Capital Hall | Room 2

AI + ML

Content Experience Level Intermediate

11:00 BST

"Surviving Day2 : Picking the Right Tool To Secure Your Kubernetes Habitat" - Bruno Gabriel da Silva, Sysdig & Henrique Santana, AWS

Friday April 4, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance N10 | Room F

The CNCF landscape is so big that it can feel impossible to comprehend.

A jungle of tools with unique roles and capabilities, divided into several categories.

In nature, every species has its strengths. A falco(n), for instance, serves as a vigilant runtime protector, while the racoon (Trivy) hunts for vulnerabilities. Some animals are hunters, each using a unique set of skills and techniques to survive.

In this session, you'll be exposed to different fauna, like Falco, Trivy, Kyverno and others, with a fun and biological approach.

After this presentation, you’ll have the confidence to decide the correct predator, or a non-poisonous fruit, ensuring your Kubernetes habitat stays secure and thriving.

Speakers

Henrique Santana

Sr. Cloud Support Engineer, AWS

I'm Containers Specialist with over 15 years of experience in infrastructure operations. Skilled at automating workflows and solving problems through user-centered design and emerging technologies. Currently focusing on containers and container orchestration. Adept at optimizing... Read More →

Bruno Gabriel da Silva

Sr Solutions Engineer, Sysdig

I have been working as a Solutions Engineer for several years, with my passion for cloud-native technologies igniting around 2018. That year, I transitioned from a traditional IT Windows Sysadmin role to fully embracing DevOps, focusing entirely on Open Source and Cloud. My first... Read More →

Friday April 4, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance N10 | Room F

Cloud Native Novice

Content Experience Level Intermediate

11:00 BST

Consistent Volume Group Snapshots, Unraveling the Magic - Leonardo Cecchi, EDB & Xing Yang, VMware by Broadcom

Friday April 4, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance S10 | Room C

Snapshotting databases running on multiple volumes is not easy because of inconsistencies due to snapshots being taken at different times.

VolumeGroupSnapshots, introduced as an alpha feature in Kubernetes 1.27 and now in the process of being promoted to beta, provides a solution by enabling write-order consistent snapshots for multiple volumes.

In this session, explore the inner workings of VolumeGroupSnapshots by discovering the key implementation components and their cooperative efforts in achieving consistent group snapshots.

Gain valuable insights to ensure proper usage of this feature and become adept at troubleshooting and debugging potential issues.

Speakers

Xing Yang

Tech Lead, VMware by Broadcom

Xing Yang is a Tech Lead in the Cloud Native Storage team at VMware by Broadcom. She is a co-chair of CNCF Storage TAG, a co-chair of the Kubernetes Storage SIG, a co-chair of the Data Protection WG, and a maintainer in Kubernetes CSI. Before joining VMware, Xing was the Lead Architect... Read More →

Leonardo Cecchi

Software Development Principal, EDB

Leonardo Cecchi, a principal in software development at EDB, plays a pivotal role as a maintainer in the CloudNativePG project and Biganimal, EDB's DBaaS offering. With a longstanding preference for PostgreSQL dating back to 1998, his expertise in this DBMS is extensive. Before EDB... Read More →

Friday April 4, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance S10 | Room C

Data Processing + Storage

Content Experience Level Intermediate

11:00 BST

Quantum-Ready Kubernetes: How Do We Get There? - Nikhita Raghunath & Natalie Fisher, Broadcom; Paul Schweigert, IBM; Ricardo Rocha, CERN ; Tomas Gustavsson, Keyfactor

Friday April 4, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance S10 | Room B

As AI continues to evolve, quantum computing is poised to disrupt Kubernetes in ways we can’t ignore. By 2035, the US government will only procure quantum-safe solutions, and if our infrastructure isn’t ready soon we’ll be scrambling to catch up.

This panel brings together experts to explore:
- What quantum computing is & why it’s a game changer
- How to orchestrate quantum workloads on Kubernetes
- Middleware needed to bridge classical and quantum resources
- Redesigning infrastructure to meet NIST’s quantum-safe standards with an agile long-term strategy
- Building infrastructure for real-world use cases like scientific simulations
- How quantum machine learning can help run AI workloads

You don’t need to be a quantum expert to join! You’ll walk away with actionable insights on architectural trade-offs for running quantum workloads and learn how to implement quantum-safe security. This is your chance to spark fresh ideas & take the lead in shaping the next decade of technology!

Speakers

Ricardo Rocha

Computing Engineer, CERN

Ricardo leads the Platform Infrastructure team at CERN with a strong focus on cloud native deployments and machine learning. He has led for several years the internal effort to transition services and workloads to use cloud native technologies, as well as dissemination and training... Read More →

Nikhita Raghunath

Principal Engineer, Broadcom

Nikhita is a Principal Engineer at Broadcom, past co-chair of KubeCon and a maintainer of the Kubernetes project. She is the vice chair of the CNCF Technical Oversight Committee and has won the CNCF Top Committer Award in 2021 for her technical contributions. She was also a member... Read More →

Paul Schweigert

Senior Software Engineer, IBM

Paul Schweigert works on quantum and AI technologies at IBM. He has extensive experience in open source (Knative and Kubernetes in particular) and has spoken at numerous conferences. He has also led various platform engineering and data science teams. In a previous life, he studied... Read More →

Tomas Gustavsson

Chief PKI Officer, Keyfactor

Tomas Gustavsson is the chief public key infrastructure (PKI) officer at Keyfactor.. He pioneered open source public key infrastructure with EJBCA, now embraced by over 3,000 downloads per month. With a background in computer science, Tomas established EJBCA to fortify trusted digital... Read More →

Natalie Fisher

Technology Product Manager, Broadcom

Natalie is a Technology Product Manager at VCF. A lifelong learner, she’s always been fascinated with emerging technology and the endless possibilities and solutions one could dream up. Having spent many years in product and working in companies ranging from e-Commerce, Data Analytics... Read More →

Friday April 4, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance S10 | Room B

Emerging + Advanced

Content Experience Level Intermediate

11:00 BST

The Missing Metrics: Measuring Memory Interference in Cloud Native Systems - Jonathan Perry, PerfPod

Friday April 4, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance N10 | Room G

Your applications may be suffering from severe performance degradation without you knowing it. Memory bandwidth contention and cache interference between containers can increase tail latency by 4-13x and reduce compute efficiency by 25%, even with CPU and memory limits in place. This effect is particularly insidious as it manifests as high CPU utilization, leading operators to misdiagnose the root cause.

This session presents the latest research on detecting memory interference, including findings from Google, Alibaba, and Meta's production environments. We'll explore how modern CPU performance counters can identify noisy neighbors, examine real-world patterns that trigger interference (like garbage collection and container image decompression), and demonstrate practical approaches to measure these effects in Kubernetes environments.

Speakers

Jonathan Perry

Founder & CEO, PerfPod

Jonathan Perry is a maintainer of the OpenTelemetry eBPF network collector. His PhD research at MIT CSAIL focused on performance isolation in datacenter and cloud networks, aiming to enhance network efficiency and reduce latency. Jonathan founded Flowmill, where he developed eBPF-based... Read More →

Friday April 4, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance N10 | Room G

Observability

Content Experience Level Intermediate

11:00 BST

Container Runtimes... on Lockdown: The Hidden Costs of Multi-tenant Workloads - Lewis Denham-Parry, Edera & Caleb Woodbine, ii.nz

Friday April 4, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance S10 | Room D

Container runtimes form the bedrock of Kubernetes, but running diverse workloads side-by-side introduces complex security challenges that many teams overlook. This talk peels back the layers of container isolation, starting with the fundamentals of how containers operate as Linux processes and evolving through today's runtime landscape.

We'll dive deep into the hidden costs and security implications of different container runtime choices in multi-tenant environments. Through real-world examples and performance benchmarks, we'll explore the delicate balance between isolation and efficiency. You'll learn about emerging solutions in the container runtime space and practical approaches to securing workloads without sacrificing performance.

Attendees will leave with critical security considerations for choosing container runtimes, strategies for workload isolation, and tools to evaluate isolation versus performance tradeoffs.

Speakers

Caleb

Software Engineer, calebwoodbine.nz

Open Source, software, cloud native community and distributed cloud enthusiast.

Lewis Denham-Parry

Staff Solutions Engineer, Edera

Lewis Denham-Parry orchestrates containers by day and puts them through rigorous security testing by night. As Staff Solutions Engineer at Edera, he leverages his diverse background to deliver the robust security and isolation that modern systems demand.A dynamic speaker at KubeCon... Read More →

Friday April 4, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance S10 | Room D

Security

Content Experience Level Intermediate

11:00 BST

Zero Trust at Shopify Scale: Automating MTLS Across Thousands of Services - Dani Santos & Michelle Mali, Shopify

Friday April 4, 2025 11:00 - 11:30 BST

Level 1 | Hall Entrance N10 | Room H

Certificate management at scale presents critical challenges for securing service-to-service communication in zero trust architectures. We will demonstrate how Shopify automates mTLS across thousands of services, addressing certificate rotation without interruption, renewal failures, and cross-cluster distribution. Drawing from production experience, we'll explore our evolution from custom admission controllers to versatile patterns working across Kubernetes and non-Kubernetes environments, including mounting CA certificates at container startup with periodic Cronjob renewals. We'll share code examples for resilient rotation mechanisms, graceful certificate rollover, and RBAC. Attendees will learn practical patterns for scaling mTLS, with examples of monitoring certificate lifecycles and troubleshooting common failure modes.

Speakers

Michelle Mali

Infrastructure Security Engineer, Shopify

Michelle Mali is an Infrastructure Security Engineer at Shopify, specializing in securing cloud-native environments. With experience in Kubernetes and container security, they hold the Certified Kubernetes Application Developer (CKAD) and Certified Kubernetes Administrator (CKA) certifications... Read More →

Dani Santos

Senior Infrastructure Security Engineer, Shopify

Dani Santos is a Senior InfraSec Engineer at Shopify, focusing on service identity and PKI infrastructure at scale in cloud-native environments. She's involved in certificate management initiatives across Shopify's internal services, developing solutions for automated mTLS flows... Read More →

Friday April 4, 2025 11:00 - 11:30 BST
Level 1 | Hall Entrance N10 | Room H

Security

Content Experience Level Intermediate

11:45 BST

Extending Kubernetes for AI | Lessons Learned From Platform Engineering - Susan Wu, Google & Lucy Sweet, Uber

Friday April 4, 2025 11:45 - 12:15 BST

Level 1 | Hall Entrance S10 | Room A

Kubernetes and the open-source ecosystem are becoming the universal control plane not only for conventional app orchestration but also for building AI applications. Yet, developers and cluster operators struggle with cost optimization for the specialized compute and customizing Kubernetes.

In this session, hear from the platform engineers for Morgan Stanley, Uber, Trivago and learn how they designed shared platforms with infrastructure across cloud providers to support both business-critical apps and accelerated workloads.

You can expect to come away with guidance, hear of pitfalls to watch out for and learn how they extended Kubernetes with custom controls and other cloud native projects and built efficient, self-service interfaces to enable developer velocity and researcher experimentation.

Panelists:

Lucy Sweet, Senior Software Engineer Uber
Susan Wu, PM Google

Speakers

Lucy Sweet

Senior Software Engineer, Uber

Lucy is a Senior Software Engineer at Uber Denmark who works on platform infrastructure

Susan Wu

Outbound Product Manager, Google

Susan is an Outbound Product Manager for Google Cloud, focusing on GKE Networking and Network Security. She previously led product and technical marketing roles at VMware, Sun/Oracle, Canonical, Docker, Citrix and Midokura (part of Sony Group). She is a frequent speaker at conferences... Read More →

Friday April 4, 2025 11:45 - 12:15 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

11:45 BST

Kubernetes Meets Climate Science: Building Large-scale Feature Detection From Climate Data Records - Armagan Karatosun & Roope Tervo, European Organisation for the Exploitation of Meteorological Satellites

Friday April 4, 2025 11:45 - 12:15 BST

Level 0 | ICC Capital Hall | Room 2

The Exponential growth of Earth Observation (EO) data volumes in the past decade has made downloading and processing EO data locally impractical. In response, the European public space sector launched initiatives to provide private cloud infrastructure, like the European Weather Cloud (EWC), allowing users to provision computing resources close to the data.

Leveraging these new possibilities introduced by cloud services and machine learning, the hydro-meteorological community has initiated projects to identify features from remote sensing data, including satellite imagery, to enhance early weather warnings and climate science. EUMETSAT and its Member States are now developing a collaborative environment within EWC for manual annotation, model development, and analyses to provide reliable feature identification from EO data.

Join us in our session to learn more about our solution, involving an environment for data preparation, community annotation tools, and a features database.

Speakers

Roope Tervo

European Weather Cloud service coordinator, EUMETSAT

Software professional with special interests are in Clouds, AI, ML, Open Data, APIs, team management, architecture and spatial services.

Armagan Karatosun

Cloud Data Services Expert, EUMETSAT (European Organisation for the Exploitation of Meteorological Satellites)

Armagan Karatosun (He/him), holds an MSc in High-Performance Computing from Istanbul Technical University with 6+ years of industry experience. As a Cloud Data Services Expert at EUMETSAT, he specializes in crafting cloud-based solutions. His focus is on creating resilient and event-driven... Read More →

Friday April 4, 2025 11:45 - 12:15 BST
Level 0 | ICC Capital Hall | Room 2

AI + ML

Content Experience Level Intermediate

11:45 BST

Into the Shopfloor: Moving Manufacturing Execution Systems To Kubernetes - Manuel Peuster & Andrei Traian Cucuruzac, Bosch Connected Industry

Friday April 4, 2025 11:45 - 12:15 BST

Level 1 | Hall Entrance N10 | Room E

Kubernetes is breaking boundaries, entering the manufacturing sector and powering mission-critical systems on production floors. This case study explores Bosch Connected Industry’s journey to modernize a manufacturing execution system (MES) into a cloud-native ecosystem. From containerization to evolving from Docker-Compose and Ansible-driven Kubernetes manifests to a a streamlined Helm-based setup, we’ll share how we overcame challenges step by step.

The operator pattern became our secret weapon, automating workflows and enabling scalability. However, no two plants are identical, making versatile parameterization crucial. Manufacturing setups demand support for diverse environments, from public cloud to air-gapped, on-premise edge clusters, often managed by engineers with limited DevOps expertise.

This session is for DevOps engineers, architects, and tech enthusiasts eager to tackle the real-world challenges of bringing Kubernetes into diverse and demanding operational contexts.

Speakers

Andrei

Junior DevOps Engineer, Bosch Connected Industry

Andrei Cucuruzac, a Junior DevOps Engineer at Bosch and graduate of the Polytechnic University of Bucharest in Industrial Engineering, focuses on Kubernetes for container orchestration. He explores advanced features like dynamic scaling, workload automation, and multi-cluster management... Read More →

Manuel Peuster

Senior DevOps Engineer, Bosch Connected Indurstry

Manuel Peuster holds a PhD in computer science and his research interests include network softwarization, industrial IoT, as well as benchmarking of distributed systems. He was an active contributor to OpenSource MANO and founded several open-source projects, such as Containernet... Read More →

Friday April 4, 2025 11:45 - 12:15 BST
Level 1 | Hall Entrance N10 | Room E

Application Development

Content Experience Level Intermediate

11:45 BST

Enhancing Software Composition Analysis Resilience Against Container Image Obfuscation - Agathe Blaise, Thales & Jacopo Bufalino, CNAM

Friday April 4, 2025 11:45 - 12:15 BST

Level 1 | Hall Entrance S10 | Room D

Malicious compliance has been highlighted in previous KubeCon talks as a challenge for software composition analysis, as it conceals OS and package information in container images and hides vulnerabilities. In this talk, we analyze how the landscape evolved over the past two years and propose improvements for SBOM generation. We found that open-source and cloud providers' tools remain vulnerable, which is even more visible in compressed images from public container registries. We uncover another form of malicious compliance with no standardization of package identifier format, resulting in inconsistencies in detected vulnerabilities between SBOM tools. To address this, we introduce an open-source methodology for layer-by-layer container image analysis, reconstructing complete history of file modifications and retrieving package metadata and package-related content, improving file coverage and SBOM accuracy. We finally outline concrete steps for advancing SBOM resilience and accuracy.

Speakers

Agathe Blaise

Research Engineer, Thales

Agathe Blaise is currently a research engineer at Thales (Gennevilliers, France). She received the Ph.D. degree in Computer Science from LIP6, Sorbonne University (Paris, France) in 2020. Her research interests focus on cloud computing security, studying various aspects (container... Read More →

Jacopo Bufalino

Security Researcher, CNAM

I've always enjoyed breaking things, that's why I work in security. After some years in industry working as DevOps, I moved to academia, focusing on cloud network security.

Friday April 4, 2025 11:45 - 12:15 BST
Level 1 | Hall Entrance S10 | Room D

Security

Content Experience Level Intermediate

13:45 BST

Don't Let Your Kubernetes Cluster Go Wild: Ensuring Etcd Reliability - Arka Saha, VMware by Broadcom & Chun-Hung (Henry) Tseng, Google

Friday April 4, 2025 13:45 - 14:15 BST

Level 1 | Hall Entrance S10 | Room C

Have you ever encountered a perplexing Kubernetes issue that left you no choice but to recreate your cluster?As the backbone of Kubernetes, etcd stores the state and configuration at any given moment.Since any changes to this critical component can introduce instability, how can we continuously ensure that new features, improvements, or bug fixes don’t introduce data inconsistency and regression?
Join us for a deep dive into the etcd test framework and discover how we safeguard your Kubernetes clusters from catastrophic bugs. We will share the rigorous processes to guarantee correctness, consistency, and reliability with every code change for the etcd v3.6 release.
We'll share the challenges in our journey of developing, leveraging, and debugging issues caught by the robustness test framework. Whether you’re building Kubernetes or complex distributed systems, this session will equip you with invaluable knowledge and practical tools to create a more reliable and resilient infrastructure

Speakers

Arka Saha

Software Engineer, VMware By Broadcom

Arka Saha, a Broadcom Software Engineer, leads Kubernetes releases & maintenance for Tanzu Extended Support. He manages VMware by Broadcom's Prow infrastructure, ensuring long-term support for k8s, etcd, containers, Golang & related components. Previously he managed Red Hat OpenShift... Read More →

Chun-Hung (Henry) Tseng

Software Engineer, Google

Henry is a CK* certified Software Engineer who currently works at Google as a software engineer. He has been an etcd contributor since 2024.

Friday April 4, 2025 13:45 - 14:15 BST
Level 1 | Hall Entrance S10 | Room C

Data Processing + Storage

Content Experience Level Intermediate

13:45 BST

Thousands of Virtual Kubelets: 1-to-1 Mapping a Supercomputer To Kubernetes With Supernetes - Dennis Marttinen, Aalto University

Friday April 4, 2025 13:45 - 14:15 BST

Level 1 | Hall Entrance S10 | Room B

Bridging the gap between High-Performance Computing (HPC) and the cloud is an ongoing challenge in the cloud-native ecosystem. Most projects migrate some parts of the batch job scheduling from Slurm to Kubernetes. However, with many HPC systems rigidly tied to Slurm and its features, where is the integration limit?

Introducing Supernetes: an open source HPC-to-cloud bridge that bidirectionally reconciles all Slurm tasks to v1/Pods, and all Slurm nodes to v1/Nodes, 1-to-1. Supernetes tolerates the strictest HPC limitations: tight firewalls, no root, no fakeroot, no namespaces, no slurmrestd API. If you can run sbatch and scontrol, you can run Supernetes.

In this session, Dennis presents his quest to integrate LUMI, a global top-10 supercomputer, with Kubernetes. Starting from HPC-to-cloud bridge basics, the talk evolves into running thousands of virtual kubelet instances and hacking FluxCD to reconcile from a gRPC tunnel. The session concludes with a live demo of Supernetes on LUMI.

Speakers

Dennis Marttinen

Security and Cloud Computing (SECCLO) Master Student, Aalto University

Dennis is a Security and Cloud Computing (SECCLO) double-degree master student with a broad background in Kubernetes, supercomputing/HPC, networking and cloud security. He is the co-author of Weave Ignite, a container-to-microVM solution, and Racklet, a scale model rack project presented... Read More →

Friday April 4, 2025 13:45 - 14:15 BST
Level 1 | Hall Entrance S10 | Room B

Emerging + Advanced

Content Experience Level Intermediate

13:45 BST

Smooth Scaling With the OpAMP Supervisor: Managing Thousands of OpenTelemetry Collectors - Evan Bradley, Dynatrace & Andy Keller, observIQ

Friday April 4, 2025 13:45 - 14:15 BST

Level 1 | Hall Entrance N10 | Room G

The OpAMP protocol has become a powerful solution for managing OpenTelemetry Collectors, offering seamless remote configuration and control. Until recently, only a limited number of Collector distributions supported OpAMP. However, with the introduction of the OpAMP Extension and Supervisor, it is now easy to include OpAMP support in any Collector distribution.

This session will explore how to utilize OpAMP in upstream Collector distributions and outline the simple steps to make your own distribution OpAMP-compatible. Attendees will gain insights into the architecture and features of the OpAMP Supervisor and its role in enhancing Collector management. The talk will also include a demonstration of how the OpAMP Supervisor enables centralized remote configuration, monitoring, and updates for your Collectors.

Speakers

Andy Keller

Principal Engineer, observIQ

Andy is a Principal Engineer at observIQ where he is responsible for the architecture and implementation of the BindPlane OP, an observability agent management and configuration platform. Andy has worked in the observability space for over 8 years and is a maintainer of the OpAMP... Read More →

Evan Bradley

Senior Software Engineer, Dynatrace

Evan helps maintain the OpenTelemetry Collector, where he is a primary contributor to the OpenTelemetry Transformation Language (OTTL), and helps drive adoption of the OpenTelemetry Agent Management Protocol (OpAMP) to enable users to manage fleets of Collectors. Evan has a background... Read More →

Friday April 4, 2025 13:45 - 14:15 BST
Level 1 | Hall Entrance N10 | Room G

Observability

Content Experience Level Intermediate

13:45 BST

Resilient Multi-Cloud Strategies: Harnessing Kubernetes, Cluster API, and Cell-Based Architecture - Tasdik Rahman & Javi Mosquera, New Relic

Friday April 4, 2025 13:45 - 14:15 BST

Level 0 | ICC Auditorium

In today's multi-cloud world, resilience and high availability at scale are crucial. This session will cover how we utilized Kubernetes with Cluster API and other cloud native components, to deploy a cell-based architecture across multiple cloud providers, scaling to 270+ clusters and 18,000+ nodes, creating independent, isolated cells that limit failures and improve uptime, thus simplifying compliance, cost management, and disaster recovery planning.

We'll explore how Cluster API facilitates seamless automation of cluster creation and management across our multi-cloud setup, upgrades, enhancing autonomy and resilience. Moreover, we'll highlight real-world use cases sharing our learnings from automation built for efficient management of k8s clusters while limiting operational overhead.

End users will learn from this talk on how they can use ClusterAPI, to automate their multi cloud cluster lifecycle management and leverage cellular architecture to build a highly available setup.

Speakers

Javier Mosquera Sanchez

Principal Software Engineer, New Relic

I am a Principal Software Engineer at New Relic, where I work as the multicloud architect for the initiative to integrate our offering into the main three cloud service providers (AWS, Azure, and GCP). I also serve as the Kubernetes architect for our Container Fabric team, which is... Read More →

Tasdik Rahman

Senior Software Engineer, New Relic

A generalist developer, with a focus on the infrastructure side of things. Past ClusterAPI release 1.9 team member, Past Contributor to oVirt.

Friday April 4, 2025 13:45 - 14:15 BST
Level 0 | ICC Auditorium

Operations + Performance

Content Experience Level Intermediate

14:30 BST

From High Performance Computing To AI Workloads on Kubernetes: MPI Runtime in Kubeflow TrainJob - Andrey Velichkevich, Apple & Yuki Iwai, CyberAgent, inc

Friday April 4, 2025 14:30 - 15:00 BST

Level 1 | Hall Entrance S10 | Room A

Message Passing Interface (MPI) is a foundational technology in distributed computing essential for ML frameworks like MLX, DeepSpeed, and NVIDIA NeMo. It powers efficient communication for large-scale AI workloads using high-speed interconnects via InfiniBand. However, running MPI on Kubernetes presents challenges, such as ensuring high-throughput pod-to-pod communication, managing MPI Job initialization in containerized environments, and supporting diverse MPI implementations, including OpenMPI, IntelMPI, and MPICH.

This talk will introduce the Kubeflow MPI Runtime integrated with Kubeflow TrainJob, featuring distributed training with MLX and LLMs fine-tuning with DeepSpeed on Kubernetes. Speakers will highlight SSH-based optimization to boost MPI performance. Attendees will discover how this solution simplifies, scales, and optimizes AI workloads while addressing key challenges and combining MPI's efficiency with Kubernetes' orchestration power.

Speakers

Andrey Velichkevich

Senior Software Engineer, Apple

Andrey Velichkevich is a Senior Software Engineer at Apple and is a key contributor to the Kubeflow open-source project. He is a member of Kubeflow Steering Committee and a co-chair of Kubeflow AutoML and Training WG. Additionally, Andrey is an active member of the CNCF WG AI. He... Read More →

Yuki Iwai

Software Engineer, CyberAgent, inc

Yuki is a Software Engineer at CyberAgent, Inc. He works on the internal platform for machine-learning applications and high-performance computing. He is currently a Technical Lead for Kubeflow WG AutoML / Training. He is also a Kubernetes WG Batch active member, Job API reviewer... Read More →

Friday April 4, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

14:30 BST

Are You Covered? Falling in Love With E2E Testing - Scott McAllister, ngrok

Friday April 4, 2025 14:30 - 15:00 BST

Level 0 | ICC Capital Hall | Room 1

Automated testing aims to give us confidence that our code will run as expected in every situation–especially when we push changes. Good tests will increase your team's velocity of developing new features and reduce the headache of bugs and outages.

As more applications shift to containerized environments, testing them becomes more complex. Not only does the application code need to be tested, but so does the Kubernetes manifests. This session will clarify setting up and running automated tests in these environments. We'll discuss organizing tests in containers, handling dependencies, and maintaining consistent testing throughout the deployment process.

The session will cover setting up containers for replicable test environments, Argo CD for GitOps automation, and utilizing k3s to manage complex, interdependent test workflows, ensuring consistent, reliable end-to-end testing.

Speakers

Scott McAllister

Principal Developer Advocate, ngrok

Scott McAllister is a Developer Advocate at ngrok. He has been building web applications in several industries for over a decade. Now he's helping others learn about a wide range of web and infrastructure technologies. When he's not coding, writing or speaking he enjoys long walks... Read More →

Friday April 4, 2025 14:30 - 15:00 BST
Level 0 | ICC Capital Hall | Room 1

Application Development

Content Experience Level Intermediate

14:30 BST

Transparent, Infra-Level Checkpoint and Restore for Resilient AI/ML Workloads at Scale - Ganeshkumar Ashokavardhanan, Microsoft & Bernie Wu, MemVerge

Friday April 4, 2025 14:30 - 15:00 BST

Level 1 | Hall Entrance S10 | Room B

While model checkpointing at the application framework level provides basic failure recovery for AI/ML training, it burdens developers with complex config requirements. As the scale of production workload increases, infra-level checkpointing using Checkpoint/Restore in Userspace (CRIU) can provide fault-tolerance and live migration transparently to the end user. We will demonstrate with a k8s operator how to checkpoint and restore distributed ML workloads, showcasing novel extensions across CRIU, CRI-O, and cuda-checkpoint.

Our talk focuses on implementing synchronization mechanisms for JobSets running stateful workloads to be checkpointed in unison, while minimizing interruption overhead. The presentation explores how this infra-level approach accelerates recovery times, and workload reprioritization. Key topics include network state handling in distributed training and GPU memory checkpoint management, highlighting benefits for stateful applications requiring higher resiliency.

Speakers

Bernie Wu

VP Technology Partnerships, MemVerge

Bernie is VP of Technology Partnerships and leads the Kubernetes, AI/ML, and CXL Memory initiatives for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including Conner/Seagate, Cheyenne Software, Trend... Read More →

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft

Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, and is the lead for the GPU workload experience and error handling on this kubernetes platform. He collaborates with partners in the ecosystem to support operator models for machine learning workloads... Read More →

Friday April 4, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room B

Emerging + Advanced

Content Experience Level Intermediate

14:30 BST

C.A.L.L.I.N.G. Now I'm Calling You, Calling You Now - Mario Macías & Terra Tauri, Grafana Labs

Friday April 4, 2025 14:30 - 15:00 BST

Level 0 | ICC Auditorium

The Kubernetes API is awesome and so tempting to use, especially when building Observability Solutions. Nobody wants to just get raw IP addresses and ports in their network or request telemetry, it’s much better to see your pod and service metadata. But what’s even better is that getting information about all the nodes in your cluster can help you produce amazing service graphs.

This talk is a story of how we took down the Kubernetes API in our biggest production cluster at Grafana, by deploying observability tools which make heavy use of the Kubernetes API. We’ll show you the techniques we used to avoid repeating our mistakes, by applying configuration changes and building services which helped us shield the Kubernetes API from the information thirsty observability tools, while keeping the functionality intact.

Speakers

Mario Macías

Staff Software Engineer, Grafana

I love programming since I was 12 years old. I’m a software engineer with 20 years of experience. During that time, I’ve been a scientific researcher, Ph.D student, university teacher, backend developer, and book writer. During the last 7 years I've focused on monitoring and observability... Read More →

Terra Tauri

Staff Software Engineer, Grafana Labs

terra is a Platform Network Engineer at Grafana Labs measuring beeps and boops for software that measures beeps and boops. Grafana ingests petabytes of data every single day and the Platform Networking squad is responsible for ensuring every one of those o11y packets makes it into... Read More →

Friday April 4, 2025 14:30 - 15:00 BST
Level 0 | ICC Auditorium

Operations + Performance

Content Experience Level Intermediate

14:30 BST

Compliance at the Speed of Innovation: Leveraging AI-Driven Automation for Real-Time Regulatory Read - Larry Carvalho, RobustCloud LLC; Simon Metson, EnterpriseDB; Robert Ficcaglia, Sunstone Secure, LLC; Anca Sailer, Red Hat / IBM; Yuji Watanabe, IBM Japa

Friday April 4, 2025 14:30 - 15:00 BST

Level 1 | Hall Entrance N10 | Room G

Due to upcoming regulations, the increased time organizations need to meet compliance requirements is slowing down their ability to innovate rapidly. Businesses are transitioning from periodic compliance assessments to continuous compliance monitoring, which offers constant, real-time visibility into an enterprise's ability to meet regulatory guidelines. With the rapid evolution of regulatory requirements and the surge in recent data breaches, it is evident that customers need a continuously updated and comprehensive understanding of their compliance status and risk exposure. In this session, attendees will learn how adopting a code-based approach to compliance—powered by agentic AI—can accelerate their go-to-market strategy by automating the creation of compliance artifacts. Catalog, controls, and automatic assessments will be discussed. As a use case, the new DORA regulations will be discussed along with the workflow this technology can enable to help organizations adhere to DORA.

Speakers

Larry Carvalho

Principal Consultant, RobustCloud LLC

Larry Carvalho of RobustCloud LLC provides strategy and insight into the adaption of Edge and Cloud Computing technologies. He provides advisory services and works closely with customers and vendors to help all parts of the ecosystem understand cloud computing, map business goals... Read More →

Anca Sailer

Distinguished Engineer, Red Hat / IBM

Dr. Anca Sailer is an IBM Distinguished Engineer at the T. J. Watson Research Center where she transforms the clients compliance processes into an engineering practice. Dr. Sailer received her Ph.D. in CS from Sorbonne Universités, France and applied her Ph.D. work to Bell Labs before... Read More →

Robert Ficcaglia

CTO and CISO, Sunstone Secure, LLC

Robert is leading the CNCF Compliance WG, helps Kubernetes Audit in SIG-Security, and is the emeritus chair of wg-policy and an active lead in the project assessments for CNCF Security TAG. He also participates in LF efforts related to AI security and safety. As CTO for SunStone... Read More →

Yuji Watanabe

Senior Technical Staff Member, IBM

Yuji Watanabe is a Senior Technical Staff member at IBM Research that lives in Tokyo, Japan. He leads a research team on cloud native security and has been delivering new integrity monitoring and enforcement technology to the open-source community and products. His current focus is... Read More →

Simon Metson

SVP Engineering, EnterpriseDB

Simon Metson is SVP for EDB’s Hybrid Cloud products. Throughout his career he’s worked on data problems on distributed systems; whether 100's of 1000+ node batch farms for physics experiments processing petabytes of data, first generation Cloud DBaaS products or bringing automation... Read More →

Friday April 4, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance N10 | Room G

Security

Content Experience Level Intermediate

14:30 BST

Fresh Secrets From the Docks: Lessons Learnt From Analyzing 180,000 Public DockerHub Images - Guillaume Valadon, GitGuardian

Friday April 4, 2025 14:30 - 15:00 BST

Level 1 | Hall Entrance S10 | Room D

Hardcoded secrets remain a common practice in containerized environments, often used for convenience during testing or deployment, despite their significant, well-known security risks.

Docker images are not immune and can inadvertently leak secrets through Dockerfiles, configuration files, or image layers. Once pushed to registries such as DockerHub, these secrets become discoverable to attackers, putting environments at risk.

In this session, we will share insights from an extensive analysis of 180,000 public Docker images retrieved from DockerHub, uncovering a staggering number of 35,000 secrets from 18,000 images. More than 6,000 of these secrets were valid when the study was conducted in late 2024, including AWS keys, GCP keys, OpenAI tokens, and GitHub tokens belonging to Fortune 500 companies.

Finally, we will discuss common misuses and pitfalls in Dockerfile files that lead to secrets being leaked, and describe best practices for handling secrets in Docker images.

Speakers

Guillaume Valadon

Staff CyberSecurity Researcher, GitGuardian

Guillaume is a Cybersecurity Researcher at GitGuardian. He holds a PhD in networking. He likes looking at data and crafting packets. He co-maintains Scapy. And he still remembers what AT+MS=V34 means!

Friday April 4, 2025 14:30 - 15:00 BST
Level 1 | Hall Entrance S10 | Room D

Security

Content Experience Level Intermediate

15:15 BST

Green AI in Cloud Native Ecosystems: Strategies for Sustainability and Efficiency - Vincent Caldeira, Red Hat & Tamar Eilam, IBM Research

Friday April 4, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance S10 | Room A

The rapid proliferation of AI is increasing focus on the environmental costs associated with large-scale model training and deployment. As cloud-native technologies form the backbone of modern AI systems, the Cloud Native Computing Foundation (CNCF) is spearheading efforts to balance AI innovation with sustainability. This session will provide an overview of the CNCF effort to identify key areas, techniques, and best practices for energy-efficient AI in cloud-native environments. Attendees will gain insights into a newly developed taxonomy that categorises remediation patterns and sustainable practices across AI lifecycle phases, deployment environments, and personas.

We will also explore real-world applications and discuss reference architectures that provide means to optimise resource use, such as GPU slicing for inference efficiency, power capping during training, and carbon-aware scheduling, while maintaining performance and scalability.

Speakers

Tamar Eilam

IBM Fellow, Chief Scientist Sustainable Computing, IBM Research

Dr. Tamar Eilam is an IBM Fellow and Chief Scientist for Sustainable Computing in the IBM T. J. Watson Research Center, New York. Tamar complete a Ph.D. degree in Computer Science in the Technion, Israel, in 2000. She joined the IBM T.J. Watson Research Center in New York as a Research... Read More →

Vincent Caldeira

CTO APAC, Red Hat

Vincent Caldeira, CTO of Red Hat in APAC, is responsible for strategic partnerships and technology strategy. Named a top CTO in APAC in 2023, he has 20+ years in IT, excelling in technology transformation in finance. An authority in open source, cloud computing, and digital transformation... Read More →

Friday April 4, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance S10 | Room A

AI + ML

Content Experience Level Intermediate

15:15 BST

How To Supercharge AI/ML Observability With OpenTelemetry and Fluent Bit - Celalettin Calis, Chronosphere

Friday April 4, 2025 15:15 - 15:45 BST

Level 0 | ICC Capital Hall | Room 2

Keeping AI/ML models performant and reliable in production is no small task—especially when running on Kubernetes. Effective monitoring and observability are key to ensuring these systems deliver results at scale.

This session explores how to build an advanced open source observability stack tailored for AI/ML workloads using Fluent Bit and OpenTelemetry. We’ll cover:

- Logging and debugging popular models like GPT, BERT, and custom LLMs.
- Tracking prompts and their results to gain actionable insights.
- Monitoring agent performance in production environments.

Complementing OpenTelemetry’s robust tracing and error stack trace capabilities with Fluent Bit’s resource-efficient log processing, live tail, and metrics scraping creates a comprehensive observability solution tailored for AI/ML workloads. If you’re an AI/ML practitioner working with Kubernetes, this talk will equip you with the strategies and tools you need to enhance your system’s reliability and performance.

Speakers

Celalettin Calis

Member of Technical Staff, Chronosphere

Celalettin Calis is a Member of Technical Staff at Chronosphere. His career includes significant roles at Calyptia and SAP, where he focused on Kubernetes platform engineering, developing CI/CD pipelines, and managing containerized environments. As a cloud-native expert, he has extensive... Read More →

Friday April 4, 2025 15:15 - 15:45 BST
Level 0 | ICC Capital Hall | Room 2

AI + ML

Content Experience Level Intermediate

15:15 BST

Stateful Connections in Kubernetes: The Scaling Secrets Nobody Talks About - André Mocke & Rodrigo Fior Kuntzer, Miro

Friday April 4, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance N10 | Room H

Dive into how Miro scales real-time collaboration with long-living TCP connections at its core. Learn how we built and deployed a custom a WebSocket manager in Kubernetes, leveraging connection rebalancing, draining, and graceful shutdown techniques, while maintaining enterprise level compliance. Discover the k8s operators that made it possible, the design decisions we nailed (and the ones we regretted), and how we tackled unforeseen challenges. This is your backstage pass to engineering the intelligent canvas!

Speakers

Rodrigo Fior Kuntzer

Staff Site Reliability Engineer, Miro

A Software Engineer and Cloud Native Specialist with 20 years of experience, currently serving as Staff Site Reliability Engineer at Miro. Specializing in building high-performance platforms and ensuring system reliability, I leverage extensive experience with Docker, Kubernetes... Read More →

André Mocke

Software Engineer, Miro

I'm a Full-stack engineer with north of a decade of experience in a variety of industries, from agriculture to finance, now, multiplayer online games where we get sued if we lose data (Miro). More recently I've taken the opportunity to dive deeper into developing platforms for Infrastructure... Read More →

Friday April 4, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance N10 | Room H

Platform Engineering

Content Experience Level Intermediate

15:15 BST

Taming the Beast: Advanced Resource Management With Kubernetes - Lucy Sweet, Uber & Dawn Chen, Google

Friday April 4, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance N10 | Room F

Are you struggling to optimize resource utilization for demanding workloads like databases?

Kubernetes 1.30 to 1.32 introduced a list of powerful new features to help you tame resource-hungry applications and achieve peak cluster efficiency. In this session, Dawn Chen (Software Engineer at Google & Tech Lead SIG Node) and Lucy Sweet (Software Engineer at Uber) will guide you through the latest advancements in pod resource management, including in-place pod resizing, pod-level resource limits, and node swap memory.

Learn how to leverage these features to reduce infrastructure costs, improve application performance, and prevent resource contention in your clusters. Discover best practices for resource allocation, QoS configuration, and troubleshooting, and get a glimpse into the future of pod resource management in Kubernetes.

Speakers

Dawn Chen

Principal Software Engineer, Google

Dawn Chen is a principal software engineer at Google. Dawn has worked on Kubernetes and Google Container Engine (GKE) before the project was founded. She has been one of tech leads in both Kubernetes and GKE. Prior to Kubernetes, she was the one of the tech leads for Google internal... Read More →

Lucy Sweet

Senior Software Engineer, Uber

Lucy is a Senior Software Engineer at Uber Denmark who works on platform infrastructure

Friday April 4, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance N10 | Room F

Platform Engineering

Content Experience Level Intermediate

15:15 BST

EVAPorating Kubernetes Security Risk: Adopting Validating Admission Policy at Scale - Kaitlyn Lee & Jordan Conard, Datadog

Friday April 4, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance S10 | Room D

Is the cost and operational toil of security policy enforcement raining on your parade? Learn how Datadog is simplifying its internal security policies across its dozens of clusters using Validating Admission Policy. We’ll cover our motivations for adopting VAP, detailing its features and contrasts with webhook-based admission controllers, like OPA Gatekeeper.

We will dive into the design of our policy that restricts the use of additional capabilities on containers, sharing tips on Common Expression Language, the use of multiple types of VAP parameters, and how we provide helpful validation error messages to our engineers. Lastly, we will outline our migration from OPA and how we ensure the health and reliability of our API servers by monitoring metrics and validation cost budgets.

Discover VAP’s features, scalable policy design, and our migration insights to help enhance your security posture, streamline policy enforcement, and safeguard your environments against abuse and bypass.

Speakers

Kaitlyn Lee

Software Engineer, Datadog

Kaitlyn Lee is a software engineer at Datadog. She works in the Compute team which is responsible for running the company’s Kubernetes platform. She focuses on workload autoscaling and node lifecycle automation.

Jordan Conard

Security Engineer, Datadog

Jordan joined DataDog in 2022 as a Security Engineer and is currently focused on securing its Kubernetes infrastructure through admission policies and secure-by-default initiatives. Jordan’s decade of industry experience runs the gamut from managing hybrid cloud environments to... Read More →

Friday April 4, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance S10 | Room D

Security

Content Experience Level Intermediate

15:15 BST

From Chaos To Control: Migrating Access Control To OpenFGA in a Multi-Tenant World - Jo Guerreiro, Grafana Labs & Poovamraj Thanganadar Thiagarajan, Okta

Friday April 4, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance N10 | Room G

Designing access control that works seamlessly for individuals and scales to millions of resources is a complex challenge.
From lackluster search performance to feature inconsistency and multi-tenant schema discrepancies, there’s no shortage of issues to face.
Join the Grafana Access squad’s journey through the ups and downs of how we’re tackling these issues using OpenFGA, a CNCF sandbox project, by porting our existing access control schema and rethinking our resource search strategy.
If you’ve ever wondered what it takes as a platform engineer to support access control on a multi-tenant system with millions of resources, this is your opportunity to learn how to orchestrate a migration from your current access control system and hear about the peculiar challenges of developing security critical systems.

Speakers

Jo Guerreiro

Engineering Manager, Grafana Labs

Jo Guerreiro is a Staff Engineer turned Engineering Manager at Grafana Labs. As part of the Identity and Access team at Grafana, Jo’s focus has been on developing Grafana’s access control system and making it accessible to both users wanting to configure their access rules and... Read More →

Poovamraj Thanganadar Thiagarajan

Senior Software Engineer, Okta

Poovamraj Thanganadar Thiagarajan is a Senior Software Engineer at Okta. As part of the FGA team, he focuses on developing resilient infrastructure for FGA projects, including setting up and scaling systems for high-traffic environments. Poovamraj also plays a key role in data-driven... Read More →

Friday April 4, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance N10 | Room G

Security

Content Experience Level Intermediate

15:15 BST

Why Don’t We Have Both? Track Build- and Run-time Information for Security With Kubescape and GUAC - Jeff Mendoza, Kusari & Ben Hirschberg, ARMO

Friday April 4, 2025 15:15 - 15:45 BST

Level 1 | Hall Entrance S10 | Room B

The best way to secure your software is to know what’s in it. But do you use software bills of materials (SBOMs) at build time or do you scan what’s actually running? Build-time analysis lets you know what’s in your application before you deploy it. Run-time analysis tells you what’s actually in use right now. With GUAC’s Kubescape integration, you can have both.

GUAC, an OpenSSF incubating project, creates a graph database of your supply chain information from many sources and supports querying to derive insights. It now supports collecting cluster scan data from Kubescape, a CNCF sandbox project that provides comprehensive security coverage. Used together, they provide a powerful tool for consuming, storing, managing, and analyzing software supply chain information that reflects what software is used, not just what is compiled into the environment.

Speakers

Ben Hirschberg

Co-founder and CTO, ARMO

Ben is a veteran cybersecurity and DevOps professional, as well as computer science lecturer. Today, he is the co-founder at ARMO, with a vision of making end-to-end Kubernetes security simple for everyone, and a core maintainer of the open source Kubescape project. He teaches advanced... Read More →

Jeff Mendoza

Software Engineer, Kusari

Jeff is a maintainer of GUAC, an OpenSSF incubating project. Also in the OpenSSF: Jeff is a maintainer of Allstar, on the Scorecard steering committee, and a Co-Chair of the Securing Critical Projects WG. As a software engineer at Kusari, he is focused on Open Source, Cloud Native... Read More →

Friday April 4, 2025 15:15 - 15:45 BST
Level 1 | Hall Entrance S10 | Room B

Security

Content Experience Level Intermediate