Postmortem: ZERO-Z3 Starter Tier Incident

Published: January 19, 2026

Friday afternoon. Load on our Frankfurt S3 cluster was climbing—organic traffic, normal usage patterns. We decided to deploy additional RadosGW instances to spread the load. The new instances came up with incomplete configuration. Starter tier customers lost access to objects they uploaded during a 26-minute window. No data was lost, but full remediation took around 24 hours.

Background

ZERO-Z3 has two storage classes: HDD-backed (100–200ms latency, lower cost) and NVMe-backed (2–10ms latency, higher cost). Starter tier customers only have access to HDD storage.

Some S3 clients only accept STANDARD as a storage class. For compatibility, we map requests from Starter customers so that STANDARD routes to HDD storage.

This mapping is configured per account in RadosGW. If that configuration isn't present, requests go to actual NVMe storage.

Timeline

All times in CET (UTC+1).

Jan 16, 15:28

New RGW instances deployed. Instances pass health checks, begin advertising VIP prefix into ECMP fabric.

Jan 16, 15:31

Alert: Unusual throughput increase on NVMe storage pool.

Jan 16, 15:54

Placement policies pulled. NVMe load returns to normal.

Jan 16, 16:14

Alert: Operations success rate drops to ~98%. Investigation begins.

Jan 16, 18:24

First customer ticket: "Can't access some random objects."

Jan 16, 20:22

Issue identified. First objects fixed.

Jan 17

Early morning: majority of objects fixed.

Jan 18

Final remediation complete. All objects uploaded during the 26-minute window verified accessible.

What Happened

Our RadosGW instances announce a shared VIP prefix via BGP into our ECMP fabric. The network hashes incoming connections based on Layer 3+4 (source IP, destination IP, source port, destination port) and distributes flows across all instances advertising the prefix. When we add instances, they start receiving a share of new flows.

The new instances came up before their configuration was complete. The placement policies for Starter accounts weren't there yet—they got pulled after the instance started and health checks passed.

Health checks passed (they check basic S3 operations, not placement policies)
Instances announced VIP prefix via BGP
ECMP started hashing flows to the new instances
~25% of Starter account requests hit instances without placement policies
Requests went to NVMe storage instead of HDD

For 26 minutes, roughly 25% of Starter tier PUT requests—those whose L3+L4 hash landed on the new instances—wrote objects to NVMe storage.

The objects were stored correctly, replicated properly. But once the placement policies were in place, Starter accounts could no longer access those objects—the policies routed their requests to HDD, where the objects didn't exist.

How We Detected It

Two alerts fired within minutes of deployment.

Alert 1: NVMe throughput anomaly. Our NVMe pool saw write throughput that didn't match expected patterns. The Starter tier shouldn't generate NVMe writes. Something was wrong.

Alert 2: Success rate degradation. Operations success rate dropped from ~99.9% to ~98%. That's a lot of failures for a storage service. Most were AccessDenied errors on read operations.

We started investigating immediately. We assumed two separate issues: the success rate drop looked like another DDoS attack, and the NVMe spike seemed unrelated. A customer ticket helped connect the dots.

Resolution

Identified all objects written to NVMe during the 26-minute window
Cross-referenced with Starter tier customer accounts
Migrated affected objects to HDD storage

Initial migration on January 16th fixed most affected customers within 2 hours of the first ticket. Full audit took until January 18th.

Migration is a metadata-heavy operation—each object requires individual handling. Customers who uploaded large objects were fixed quickly. Customers who uploaded millions of small objects during that window took longer. No data was lost.

Root Cause Analysis

The root cause isn't "configuration sync was slow." That's a symptom. The root cause is our deployment process.

Our RGW deployment has a gap between "instance is healthy" and "instance is fully configured." During that gap, the instance handles production traffic with incomplete configuration.

The health check tests basic S3 functionality. It doesn't verify customer-specific configuration. It can't—that would require test credentials for every customer tier.

We deployed assuming health checks meant "ready for production." They don't. They mean "S3 API works." Those aren't the same thing.

Contributing Factors

Configuration sync happens after health checks pass
ECMP advertisement happens when health checks pass, not when config is complete
No staging period between "healthy" and "receiving production traffic"
Alerts detected the problem but didn't provide actionable context

What We're Changing

Immediate Changes (Completed)

Storage class access: All Starter tier customers now have access to both storage classes. Both route to the same HDD backend. We encourage customers to use STANDARD_IA where their S3 client supports it. For clients that require it, STANDARD still works and routes to HDD.

Short-Term Changes (In Progress)

Deployment process: New RGW instances require manual approval before announcing VIP prefix via BGP. Health checks alone no longer trigger ECMP advertisement.
Testing procedure: Verify customer-specific configuration is loaded and functional before instance enters production.

To Our Customers

We're sorry. You trust us with your data, and we made it inaccessible. Even temporarily, even without data loss—that's not the service you signed up for.

The objects were always safe. But "safe" and "accessible" aren't the same thing, and you needed both.

We've fixed this specific failure mode. We've also identified a gap in how we handle deployment readiness. The changes listed above address both.

If you were affected by this incident and have questions, please open a support ticket.