GPUStack: Building a Simple and Scalable Management Experience for Diverse AI Models
- Track: Low-level AI Engineering and Hacking
- Room: UB2.252A (Lameere)
- Day: Sunday
- Start: 12:20
- End: 12:40
- Video only: ub2252a
- Chat: Join the conversation!
Outstanding tools like llama.cpp, Ollama, and LM Studio have made life significantly easier for developers. Running large language models (LLMs) on laptops has become remarkably convenient. However, inference engines and their wrappers don’t address the following challenges: 1. Scaling your solution as your team grows. 2. Supporting models beyond LLMs, such as using diffusion models for role-playing applications, TTS models for NotebookLM equivalents, rerankers and embeddings for retrieval-augmented generation (RAG), and more.
Today, both models and inference engines are highly diverse and rapidly evolving, while GPU resources remain fragmented and heterogeneous. In this talk, we will share our experience building GPUStack — a platform designed to help developers abstract away these complexities and focus solely on building APIs for AI applications.
Speakers
Lawrence Li |