Supercharging LLM serving with Dynamo
- Track: AI Plumbers
- Room: UD2.120 (Chavanne)
- Day: Saturday
- Start: 15:45
- End: 16:05
- Video only: ud2120
- Chat: Join the conversation!
The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:
- Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.
- Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
- Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.
This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.
Speakers
| Harry Kim | |
| Anish Maddipoti |