Accelerating vLLM Inference with Quantization and Speculative Decoding
- Track: Open Research
- Room: AW1.120
- Day: Sunday
- Start: 11:00
- End: 11:30
- Video only: aw1120
- Chat: Join the conversation!
vLLM (https://github.com/vllm-project/vllm) has rapidly become a community-standard open-source engine for LLM inference, backed by a large and growing contributor base and widely adopted for production serving. This talk offers a practical blueprint for scaling inference in vLLM using two complementary techniques, quantization (https://github.com/vllm-project/llm-compressor) and speculative decoding (https://github.com/vllm-project/speculators). Drawing on extensive evaluations across language and vision-language models, we examine the real accuracy–performance trade-offs of each method and, crucially, how they interact in end-to-end deployments. We highlight configurations that substantially cut memory footprint while preserving model quality, and show when these speedups translate best to low-latency versus high-throughput serving. Attendees will leave with data-backed guidance, deployment-ready settings, and a clear roadmap for leveraging quantization and speculative decoding to accelerate vLLM inference in real-world pipelines.
Speakers
| Eldar Kurtić |