BuildDoing the work

Your Cache Hit Rate Looked Fine Until the Hour Mark

Redis did its job on every miss — your application just sent two hundred loaders to Postgres at once.

6 min read · June 15, 2026

#Programming #Databases #Redis #Caching #BackendEngineering

One expiry. Two hundred loaders. Zero coordination.
One expiry. Two hundred loaders. Zero coordination.

You shipped cache-aside, set a sensible TTL, watched hit rate climb in staging. Production traffic arrived and the graph still looked healthy. Then the top ten keys share an EXPIRE 3600 from the nightly warm job, the clock ticks, and Postgres active_connections spikes while Redis reports nothing wrong.

That is a cache stampede — thundering herd, coordinated miss, whatever label you prefer. Not a broken Redis cluster. Not a mystery eviction bug on the first pass. One hot key expired and every concurrent reader treated the miss as a private invitation to rebuild it.

The mature version of cache-aside is not "store JSON with a TTL." It is a contract: who rebuilds on miss, how waiters behave, and which keys earn coordination tax.

Stampede Metrics — What the Cliff Looks Like

The signature is boring on dashboards until you know the shape. Hit rate drops on one key class — or across a namespace — within seconds. Primary query latency p99 jumps while Redis command latency stays flat. Connection pool saturation follows, often on a table that one cached aggregate touches.

Redis returns nil correctly. The failure is fan-out: N application threads each call the loader per Redis cache-aside docs. The herd isn't a Redis failure — it's an uncoordinated miss handler. The cache layer scheduled a coordinated outage by doing exactly what you asked.

The Synchronized Expiry Trap — Jitter Buys You a Spread, Not a Shield

Identical TTLs on related keys are a time bomb with a known detonation time. Batch warm scripts love uniformity: EXPIRE 3600 on every catalog row because one constant is easy to reason about in a cron job.

Worked example — catalog warm at 02:00 UTC

A nightly job repopulates fifty thousand product keys at 02:00 and sets EXPIRE 3600 on each. By 03:00, every key in that namespace expires together. Morning traffic is not enormous — but it is concurrent. Two hundred readers hit cache:product:* in the same second. Two hundred cache misses. Two hundred identical aggregation queries against Postgres unless something coordinates rebuilds.

TTL jitter is the cheap first fix: base_ttl + random(0, jitter_window) per key at write time when batch warming keys. You trade perfect simultaneity for a spread window. You still need miss coordination on the hottest keys — jitter spreads the cliff; it does not remove it for a key that gets ten thousand reads per second.

Single-Flight Locking — One Loader, Many Waiters

For hot keys with expensive rebuilds, mutex locking is the pattern Redis's own cache-aside documentation ships today alongside stampede prevention patterns. Not aspirational pseudocode — Lua-backed acquire and release in the official guides.

On miss:

  1. Attempt SET lock:{cache_key} {unique_token} NX PX {lock_ttl_ms} via atomic SET
  2. Winner runs the loader, writes the cache (HSET + EXPIRE or your serialization shape), releases the lock
  3. Losers poll the cache briefly and return what the winner wrote — not their own primary round-trip

Release must be token-safe. A slow loader can outlive PX. Another client acquires the lock. Blind DEL on the lock key deletes someone else's lock. The standard fix is Lua: delete only if GET lock_key == token with token-checked release.

Worked example — 200 concurrent readers on cache:product:p-001

Primary read latency is ~150ms in the happy path. You set lock_ttl_ms to 2000 — longer than p99 loader time per production tuning guidance. Two hundred goroutines miss together. One acquires the lock, queries Postgres once, writes the hash, releases. The other 199 increment stampedes_suppressed, poll HGETALL for up to two seconds, and return the populated entry. Primary read count: one. Without single-flight: up to two hundred.

If lock_ttl_ms is 100ms and the loader stalls at 180ms, the lock expires mid-flight. Token-safe release prevents the wrong client from deleting the new holder's lock — but you can still get a redundant primary read. Tune lock TTL against loader p99, not median.

XFetch — Early Refresh Before the Cliff

Mutex locking works until lock polling itself becomes the bottleneck on keys with extreme read rates. Probabilistic early expiration — XFetch — refreshes before physical expiry so the key rarely goes cold under load per VLDB analysis and Redis stampede patterns.

Store recompute duration delta with the cached value. On read, recompute if:

current_time - (expiry_time - delta * beta * log(random())) >= 0

As expiry approaches, the probability that one request triggers refresh rises. Usually one "lucky" reader rebuilds while everyone else serves slightly stale data. VLDB analysis treats the exponential variant as optimal for stampede prevention in the XFetch paper. beta = 1 is the practical default.

XFetch costs metadata and accepts brief staleness. Internet Archive's reference harness compares fetch, locked, xfetch, and xlocked strategies side by side in its stampede test harness. Production teams often combine both: XFetch for steady-state smoothing, mutex fallback on hard misses per Redis pattern guidance.

Stale-While-Revalidate — Controlled Old Data

Redis does not ship stale-while-revalidate as a command flag. You implement logical expiry in the application: serve the cached blob while a background task rebuilds, or serve the last known value while the lock holder loads at the application layer.

Freshness theater — blocking two hundred threads until the rebuild finishes — loses to one controlled recompute plus slightly old JSON for a few hundred milliseconds. User-facing product pages tolerate that trade more often than engineers admit in design reviews.

The pattern needs a version or timestamp in the payload so you know what "stale" means. Pair it with single-flight so only one background refresh runs.

The Hot-Key Decision Tree

Not every key deserves a lock.

  • Cold or cheap lookups — miss, load, cache. Coordination tax exceeds database pain. Locks aren't for every key — stampede protection is a key-class policy, not middleware wallpaper.
  • Hot + expensive rebuild — single-flight mutex. Start here per Redis pattern guidance.
  • Extreme read rate + brief staleness OK — XFetch or jittered TTL plus mutex fallback.
  • User-facing + hard latency SLO — stale-while-revalidate with background refresh.

Martin Fowler's cache-aside bliki is clear: the application owns population on miss in cache-aside. Stampede policy is part of that ownership — not a Redis server configuration toggle.

Eviction Surprises — Stampede Without Expiry

TTL expiry is not the only cold-miss storm. When maxmemory is tight and eviction policy drops hot keys under pressure, you get repeated misses that look like stampedes in metrics but trace to memory sizing under eviction pressure. Fix the memory floor and policy before adding fancier locking to keys that Redis keeps throwing away.

AWS's ElastiCache guidance clusters around the same read pattern: cache-aside with disciplined TTL and invalidation — and awareness that hot keys under-provisioned memory behave like a cache that never warmed in AWS ElastiCache guidance.

What to Ship Monday

Audit your top ten keys by read rate. Mark which rebuilds cost real primary work. Add jitter to any batch warm that sets identical EXPIRE. Implement single-flight on the expensive hot set using token-safe Lua release — copy the pattern from current redis.io cache-aside docs, not a decade-old SETNX snippet.

Measure stampedes_suppressed or equivalent. If the counter stays zero under load tests that expire a hot key on purpose, you have not tested concurrency — you tested politeness.

TTL without jitter schedules the cliff. Miss handling without coordination funds the database spike. Redis stores bytes. Your loader policy decides whether one expiry becomes one query or two hundred.

Coordinate the miss. Jitter the expiry. Protect the keys that can drown the pool.

Everything else can miss like an adult.

More in Build

← Back to hub