Multimodal support in llama.cpp - Achievements and Future Directions
- Track: AI Plumbers
- Room: UD2.120 (Chavanne)
- Day: Saturday
- Start: 10:35
- End: 10:55
- Video only: ud2120
- Chat: Join the conversation!
llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.
We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real examples of low-latency OCR applications. We will also talk about adding audio support, which lets newer models summarize audio inputs. Plus, we will cover the challenges of handling legacy code while keeping the project flexible for future models.
Looking forward, the talk will share plans for new features like video input, text-to-speech support, and image generation. Attendees will also learn how to contribute and use these multimodal tools in their own project.
Speakers
| Xuan-Son Nguyen |