FOSDEM 2026
/
Schedule
/
Events
/
Developer rooms
/
AI Plumbers
/
Supercharging LLM serving with Dynamo

Supercharging LLM serving with Dynamo

Track: AI Plumbers
Room: UD2.120 (Chavanne)
Day: Saturday
Start (UTC+1): 15:40
End (UTC+1): 16:00
Chat: Join the conversation!

The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:

Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.
Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.

This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.

Speakers

Piotr Tarasiewicz

fosdem-2026

Brussels / 31 January & 1 February 2026

Supercharging LLM serving with Dynamo

Speakers

Links