One GPU, Many Models: What Works and What Segfaults
- Track: AI Plumbers
- Room: UD2.120 (Chavanne)
- Day: Saturday
- Start: 13:55
- End: 14:15
- Video only: ud2120
- Chat: Join the conversation!
Serving multiple models on a single GPU sounds great until something segfaults.
Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing.
I tested both strategies for running different AI workloads in parallel.
This talk digs into what actually happened: where things worked, where memory isolation fell apart, which configs crashed, and what survives under load.
By the end, you'll know:
- How to utilize unused GPU capacity.
- How to setup MIG and MPS.
- How MIG and MPS behave under load.
- Memory issues, crashes, and failures.
- Which config is suited best for your AI workload.
Speakers
| YASH PANCHAL |