FOSDEM 2026
/
Schedule
/
Events
/
Developer rooms
/
Open Research
/
Accelerating vLLM Inference with Quantization and Speculative Decoding

Accelerating vLLM Inference with Quantization and Speculative Decoding

Track: Open Research
Room: AW1.120
Day: Sunday
Start: 11:00
End: 11:30
Video only: aw1120
Chat: Join the conversation!

vLLM (https://github.com/vllm-project/vllm) has rapidly become a community-standard open-source engine for LLM inference, backed by a large and growing contributor base and widely adopted for production serving. This talk offers a practical blueprint for scaling inference in vLLM using two complementary techniques, quantization (https://github.com/vllm-project/llm-compressor) and speculative decoding (https://github.com/vllm-project/speculators). Drawing on extensive evaluations across language and vision-language models, we examine the real accuracy–performance trade-offs of each method and, crucially, how they interact in end-to-end deployments. We highlight configurations that substantially cut memory footprint while preserving model quality, and show when these speedups translate best to low-latency versus high-throughput serving. Attendees will leave with data-backed guidance, deployment-ready settings, and a clear roadmap for leveraging quantization and speculative decoding to accelerate vLLM inference in real-world pipelines.

Speakers

Eldar Kurtić

fosdem-2026

Brussels / 31 January & 1 February 2026

Accelerating vLLM Inference with Quantization and Speculative Decoding

Speakers

Links