Loading…
In-person
19-22 March
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Central European Standard Time (UTC +1). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Wednesday, March 20 • 09:25 - 09:40
Keynote: Accelerating AI Workloads with GPUs in Kubernetes - Kevin Klues, Distinguished Engineer & Sanjay Chatterjee, Engineering Manager, NVIDIA

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.


Video Stream
As AI and machine learning become ubiquitous, GPU acceleration is essential for model training and inference at scale. However, effectively leveraging GPUs in Kubernetes brings challenges around efficiency, configuration, extensibility, and scalability.

This talk provides an overview of the capabilities needed to address these challenges, enabling seamless support for next-generation AI applications on Kubernetes.

- GPU resource-sharing mechanisms such as MPS (Multiple-Process Service), Time-Slicing, MIG (Multi-Instance GPU), and GPU virtualization

- Flexible accelerator configuration using the traditional device plugin and the upcoming Dynamic Resource Allocation (DRA) feature

- Advanced scheduling and resource management techniques, including gang scheduling, topology-awareness, fault-tolerance and more

- Key learnings (and areas of improvement) necessary to scale multi-node AI/ML jobs in large production clusters

Some of these capabilities are already supported today and some of them are not. By addressing the remaining challenges, Kubernetes is poised to emerge as the go-to platform for accelerated AI/ML in the cloud, mirroring Linux's pervasive dominance in the datacenter.

Speakers
avatar for Kevin Klues

Kevin Klues

Distinguished Engineer, NVIDIA
Kevin Klues is a distinguished engineer on the NVIDIA Cloud Native team. Kevin has been involved in the design and implementation of a number of Kubernetes technologies, including the Topology Manager, the Kubernetes stack for Multi-Instance GPUs, and Dynamic Resource Allocation (DRA... Read More →
avatar for Sanjay Chatterjee

Sanjay Chatterjee

Engineering Manager, NVIDIA
Sanjay Chatterjee is an engineering manager at NVIDIA. He works on GPU compute infrastructure with a focus on GPU scheduling to enable DL/AI and HPC workloads scale on Kubernetes. Previously he worked on multiple DoE/DARPA funded advanced technology projects towards designing the... Read More →



Wednesday March 20, 2024 09:25 - 09:40 CET
Pavilion 7 | Level 7.3 | Paris Room