Adventures in Model Quantization
- Track: AI Plumbers
- Room: UD2.120 (Chavanne)
- Day: Saturday
- Start: 14:30
- End: 14:50
- Video only: ud2120
- Chat: Join the conversation!
"Adventures in Model Quantization" continues to quest to run high quality models with minimal hardware resources. In this edition, community quantizer John Leimgruber ("ubergarm" on huggingface), tells the story of how a single line change to llama.cpp enabled the 1000B open weights model Kimi-K2-Thinking to maintain full quality while using only half the memory!
This talk presents an overview and visualizations of llama.cpp quantization types and discuses how Quantization Aware Training (QAT) effects mapping models across ecosystems from transformers' safetensors into llama.cpp GGUF.
If you're interested in running the best open-weights LLMs and ai models on gaming rigs, home-lab servers, or privately for your organization, then come learn how to benchmark both quality and speed for all the huggingface quants available for ik/llama.cpp.
This is an updated presentation expanding upon a recent AI Plumber's talk given in October 2025 in San Fransisco:
- https://blog.aifoundry.org/p/adventures-in-model-quantization
- https://ubergarm.com/images/AI-Plumbers-Conference-2025-SF.pdf
Speakers
| ubergarm |