Name: Optimizing Model Serving on Kubernetes With Model Streaming - Ekin Karabulut & Ronen Dar, Run:ai
Start: 2025-04-04T13:45:00+0100
End: 2025-04-04T14:15:00+0100

In-person
1-4 April 2025
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in British Summer Time (BST) (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Friday April 4, 2025 13:45 - 14:15 BST

Level 0 | ICC Capital Hall | Room 2

Deploying large language models in Kubernetes environments faces a critical challenge: the cold start problem.When auto-scaling workloads with tools like Knative, the latency from loading large model weights into GPU memory slows response times, degrades performance, and increases costs.Traditional methods rely on loading weights sequentially into CPU memory then to the GPU,which is slow and inefficient.This talk introduces Run:ai Model Streamer, an open-source tool that mitigates cold starts by streaming model weights to GPU memory while reading them from storage in parallel.It integrates seamlessly into inference engine containers and Kubernetes workflows, enabling parallelized weight streaming without modifying weight formats, making it an easy-to-adopt solution for Kubernetes-based AI deployments.We’ll share benchmarking results comparing storage backends like GP3 SSDs, IO2 SSDs, and S3, highlighting performance improvements, cost savings, and best practices from these experiments.

Speakers

Ekin Karabulut

Data Scientist & Developer Advocate, Run:ai

Ekin is a data scientist at Run:ai. She specialized in the privacy implications of federated learning with DNNs. Through her journey, she focused on distributed training techniques and observed inefficiencies in GPU usage both in research and industry settings. She thus established... Read More →

Ronen Dar

CTO and Co-Founder, Run:ai

Ronen Dar, PhD, is the co-founder and CTO of Run:ai. Ronen has been responsible for building the Run:ai Atlas platform and the technology that powers the platform, from GPU API-level virtualization to advanced K8s-based scheduling capabilities.

Friday April 4, 2025 13:45 - 14:15 BST
Level 0 | ICC Capital Hall | Room 2

AI + ML

Content Experience Level Any

KubeCon + CloudNativeCon Europe 2025

Ekin Karabulut

Ronen Dar

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!