One GPU, Many Models: What Works and What Segfaults
- Track: AI Plumbers
- Room: UD2.120 (Chavanne)
- Day: Saturday
- Start: 14:10
- End: 14:30
- Video only: ud2120
- Chat: Join the conversation!
Serving multiple models on a single GPU sounds great until something segfaults.
Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing. Both have trade-offs and their behavior changes based on GPU architecture.
I tested both strategies on Hopper and Blackwell, running diffusion, MoE, and TTS workloads in parallel. Some setups survived. Others didn't.
This talk digs into what actually happened: where memory isolation falls apart, which configs crash, and what survives under load.
By the end, you'll know:
- How to utilize unused GPU capacity.
- How to setup MIG and MPS.
- How MIG and MPS behave under actual load.
- Memory issues, crashes, and failures.
- Which config is suited best for your AI workload.
Speakers
| YASH PANCHAL |