GPU Virtualization with MIG: Multi-Tenant Isolation for AI Inference Workloads
- Track: Virtualization and Cloud Infrastructure
- Room: H.2213
- Day: Saturday
- Start: 18:00
- End: 18:30
- Video only: h2213
- Chat: Join the conversation!
Serving large video diffusion models to multiple concurrent users sounds challenging till you partition a GPU correctly.
This talk is a deep technical exploration of running large-scale video generation inference on modern GPUs across Hopper and Blackwell with Multi-Instance GPU (MIG) isolation.
We'll explore:
- GPU MIG topology: Memory hierarchy, interconnect partitioning, and leveraging high-bandwidth memory effectively.
- Memory profiling for inference: Tracking GPU memory allocation across the generation pipeline
- MIG profile selection: Choosing partition sizes—when isolation beats raw throughput
- Request scheduling: Fair queuing for heterogeneous workloads and batch sizes
- Failure modes: OOM recovery, MIG instance health checks, and graceful degradation strategies
- Monitoring at scale: Per-instance GPU metrics and detecting performance bottlenecks
Whether you're building a multi-tenant inference platform, optimizing GPU utilization for your team, or exploring how to serve video diffusion models cost-effectively, this talk provides practical configurations for your AI workloads.
Speakers
| YASH PANCHAL |