Dedup for S3: Smarter Storage, Zero Duplicates
- Track: Software Defined Storage
- Room: UB4.136
- Day: Saturday
- Start: 18:05
- End: 18:35
- Video only: ub4136
- Chat: Join the conversation!
Modern S3 workloads generate massive duplicate data—from backup chains to model checkpoints—quietly consuming petabytes. Ceph’s new S3 data deduplication feature solves this by identifying identical content through chunking and cryptographic hashing, storing it only once, and tracking references with a lightweight dedup index.
This talk explains how dedup works inside Ceph RGW: how chunks are created, how refcounts stay consistent under parallel writes, versioning, and deletes, and how the system avoids corruption using atomic metadata updates and safe garbage collection. We’ll also share early performance insights from large-scale tests and show how dedup can significantly reduce capacity, I/O, and network overhead—without requiring any changes to S3 applications.
If you're interested in building efficient, scalable, open-source object storage, this session shows how Ceph makes S3 smarter with zero duplicates.
Speakers
| Vidushi Mishra |