Friday afternoon. Load on our Frankfurt S3 cluster was climbing—organic traffic, normal usage patterns. We decided to deploy additional RadosGW instances to spread the load. The new instances came up with incomplete configuration. Starter tier customers lost access to objects they uploaded during a 26-minute window. No data was lost, but full remediation took around 24 hours.
Background
ZERO-Z3 has two storage classes: HDD-backed (100–200ms latency, lower cost) and NVMe-backed (2–10ms latency, higher cost). Starter tier customers only have access to HDD storage.
Some S3 clients only accept STANDARD as a storage class. For compatibility, we map requests from Starter customers so that STANDARD routes to HDD storage.
This mapping is configured per account in RadosGW. If that configuration isn't present, requests go to actual NVMe storage.
Timeline
All times in CET (UTC+1).
New RGW instances deployed. Instances pass health checks, begin advertising VIP prefix into ECMP fabric.
Alert: Unusual throughput increase on NVMe storage pool.
Placement policies pulled. NVMe load returns to normal.
Alert: Operations success rate drops to ~98%. Investigation begins.
First customer ticket: "Can't access some random objects."
Issue identified. First objects fixed.
Early morning: majority of objects fixed.
Final remediation complete. All objects uploaded during the 26-minute window verified accessible.
What Happened
Our RadosGW instances announce a shared VIP prefix via BGP into our ECMP fabric. The network hashes incoming connections based on Layer 3+4 (source IP, destination IP, source port, destination port) and distributes flows across all instances advertising the prefix. When we add instances, they start receiving a share of new flows.
The new instances came up before their configuration was complete. The placement policies for Starter accounts weren't there yet—they got pulled after the instance started and health checks passed.
- Health checks passed (they check basic S3 operations, not placement policies)
- Instances announced VIP prefix via BGP
- ECMP started hashing flows to the new instances
- ~25% of Starter account requests hit instances without placement policies
- Requests went to NVMe storage instead of HDD
For 26 minutes, roughly 25% of Starter tier PUT requests—those whose L3+L4 hash landed on the new instances—wrote objects to NVMe storage.
The objects were stored correctly, replicated properly. But once the placement policies were in place, Starter accounts could no longer access those objects—the policies routed their requests to HDD, where the objects didn't exist.
How We Detected It
Two alerts fired within minutes of deployment.
Alert 1: NVMe throughput anomaly. Our NVMe pool saw write throughput that didn't match expected patterns. The Starter tier shouldn't generate NVMe writes. Something was wrong.
Alert 2: Success rate degradation. Operations success rate dropped from ~99.9% to ~98%. That's a lot of failures for a storage service. Most were AccessDenied errors on read operations.
We started investigating immediately. We assumed two separate issues: the success rate drop looked like another DDoS attack, and the NVMe spike seemed unrelated. A customer ticket helped connect the dots.
Resolution
- Identified all objects written to NVMe during the 26-minute window
- Cross-referenced with Starter tier customer accounts
- Migrated affected objects to HDD storage
Initial migration on January 16th fixed most affected customers within 2 hours of the first ticket. Full audit took until January 18th.
Migration is a metadata-heavy operation—each object requires individual handling. Customers who uploaded large objects were fixed quickly. Customers who uploaded millions of small objects during that window took longer. No data was lost.
Root Cause Analysis
The root cause isn't "configuration sync was slow." That's a symptom. The root cause is our deployment process.
Our RGW deployment has a gap between "instance is healthy" and "instance is fully configured." During that gap, the instance handles production traffic with incomplete configuration.
The health check tests basic S3 functionality. It doesn't verify customer-specific configuration. It can't—that would require test credentials for every customer tier.
We deployed assuming health checks meant "ready for production." They don't. They mean "S3 API works." Those aren't the same thing.
Contributing Factors
- Configuration sync happens after health checks pass
- ECMP advertisement happens when health checks pass, not when config is complete
- No staging period between "healthy" and "receiving production traffic"
- Alerts detected the problem but didn't provide actionable context
What We're Changing
Immediate Changes (Completed)
- Storage class access: All Starter tier customers now have access to both storage classes. Both route to the same HDD backend. We encourage customers to use
STANDARD_IAwhere their S3 client supports it. For clients that require it,STANDARDstill works and routes to HDD.
Short-Term Changes (In Progress)
- Deployment process: New RGW instances require manual approval before announcing VIP prefix via BGP. Health checks alone no longer trigger ECMP advertisement.
- Testing procedure: Verify customer-specific configuration is loaded and functional before instance enters production.
To Our Customers
We're sorry. You trust us with your data, and we made it inaccessible. Even temporarily, even without data loss—that's not the service you signed up for.
The objects were always safe. But "safe" and "accessible" aren't the same thing, and you needed both.
We've fixed this specific failure mode. We've also identified a gap in how we handle deployment readiness. The changes listed above address both.
If you were affected by this incident and have questions, please open a support ticket.