quantizing your GGUF models using iterative refinement of the importance matrix
- Track: Low-level AI Engineering and Hacking
- Room: UB2.252A (Lameere)
- Day: Sunday
- Start: 11:00
- End: 11:20
- Video only: ub2252a
- Chat: Join the conversation!
Presenting Llama-gguf-optimize, the result of work and research in creating high-quality quantizations for multilingual models, specifically of the salamandra series. With a focus on preserving language diversity, the project leverages llama.cpp’s importance matrix approach to minimize quantization loss across distinct language domains. This presentation will outline the scripts in the toolkit in terms of a systematic approach for quantizing your models using iterative refinement of the importance matrix (I-matrix) and evaluating quantization quality through KL-divergence metrics.
Speakers
Robert Collins |