Loading…
In-person
19-22 March
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Central European Standard Time (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Wednesday, March 20 • 11:15 - 11:50
Advanced Resource Management for Running AI/ML Workloads with Kueue - Michał Woźniak, Google & Yuki Iwai, CyberAgent, Inc.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.


Kueue is a Job-level queueing manager which stands up to the challenges of managing computational resources to run batch workloads on Kubernetes. We walk you through its architecture, demonstrating how it can be used to set up quota- and priority-based sharing of resources between multiple teams. We describe how the Kueue’s scheduler decides when to start or stop (preempt) a job. We showcase Kueue by its production use at CyberAgent, where it is a building block of the multi-tenant system, supporting multiple engineers and ML research teams; using multiple types of CPUs and GPUs. Here, Kueue manages various types of Jobs (batch Job, MPIJob, or in-house Jobs), using various ML frameworks (TensorFlow, PyTorch or DeepSpeed). Finally, we discuss the challenge of running ML training jobs which require all pods to be scheduled. We show how it is solved by using Kueue at CyberAgent, and how it can be solved using Kueue in the autoscaling environments with the new ProvisioningRequest API.

Speakers
avatar for Michał Woźniak

Michał Woźniak

Software Engineer, Google
Michał is a software engineer with background in computer science, a PhD in computational biology, and 5+ years of professional experience. In his current role he is focusing on enhancing the support for batch workloads in the Kubernetes ecosystem. Outside of work he enjoys playing... Read More →
avatar for Yuki Iwai

Yuki Iwai

Software Engineer, CyberAgent, Inc.
Yuki is a Software Engineer at CyberAgent, Inc. He works on an internal platform for machine-learning applications and high-performance computing. He is currently a maintainer of some Kubeflow WG AutoML / Training sub-projects. He is also a WG Batch member and a Kubernetes' Kueue... Read More →



Wednesday March 20, 2024 11:15 - 11:50 CET
Pavilion 7 | Level 7.3 | S02