BuildDoing the work

BullMQ Background Jobs That Survive Production

Retries with an error taxonomy, deduplication that survives cleanup, and a dead-letter queue someone actually inspects — not a five-minute `Queue` demo.

Serhii MalyshevSoftware architect and tech lead6 min read · Jun 6, 2026

#Programming #Node.js #Redis #BackgroundJobs #SoftwareEngineering

Retries stack left. Exhausted jobs turn right.

Most Node.js services ship background work the same way they ship health checks: copy a BullMQ tutorial, paste attempts: 3, call it resilient, and discover the gaps on the first Redis memory alert or the second Stripe charge on the same invoice.

The failure mode isn't ignorance. You treated the queue as transport. Production needs policy — what to retry, what to dedupe, what to bury, and which Redis connection gets to hang your HTTP thread while the cluster reboots.

BullMQ's primitives are solid. The runbook isn't in the box.

That's where the walkthrough starts.

The demo queue is not production — what breaks first

A fresh BullMQ install feels finished because jobs move. Redis accepts keys. The dashboard shows waiting and active counts. None of that proves your queue survives a deploy, a memory cap, or a worker that blocks the event loop for forty seconds.

Two failures show up before retry policy even matters.

Redis eviction. BullMQ stores job payloads in Redis. If your instance runs maxmemory-policy anything other than noeviction, keys can disappear under pressure — and BullMQ will not always surface that as a polite validation error (connections guide). You lose jobs quietly. Fix the Redis policy before you fix the worker code.

Workers colocated with the API. Running new Worker() in the same Node process that serves HTTP is convenient until one PDF render or image resize blocks the event loop. BullMQ assumes workers heartbeat while processing; stop heartbeating long enough and the job stalls — then retries, then fails on maxStalledCount (stalled jobs). Run workers as separate processes or containers. Keep the API thin: enqueue and return.

Baseline queue options — attempts, backoff, cleanup

Start every queue with explicit defaultJobOptions. Defaults are a demo, not a contract.

const emailQueue = new Queue('email', {
  connection: producerRedis,
  defaultJobOptions: {
    attempts: 5,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { age: 3600, count: 1000 },
    removeOnFail: { age: 86400, count: 5000 },
  },
});

Attempts and backoff. attempts: 1 is the silent default that leaves poison in the failed set until someone pokes at it (retrying jobs). Set it above 1. Exponential backoff spaces retries: attempt 7 with a 2s base waits roughly 3.2 minutes after the previous try.

Queue backoff spaces job attempts. Your outbound HTTP client still needs decorrelated jitter when multiple jobs fail into the same API — different layer, same storm.

Cleanup. Completed and failed jobs accumulate in Redis sets. Production needs removeOnComplete and removeOnFail with both age and count caps so a traffic spike cannot exhaust memory (auto-removal). Removal is lazy — it runs when new jobs finalize — so do not treat the caps as real-time garbage collection.

Retries that know when to quit

Not every thrown Error deserves five more tries. BullMQ distinguishes retryable failures from terminal ones.

Use UnrecoverableError when the business case is done — bad payload, unknown customer, validation that will never pass on attempt six (stop retrying). BullMQ moves the job to failed immediately, ignoring attempts.

RateLimitError is the weird cousin: BullMQ retries without incrementing attemptsMade. Fine for transient throttling. Dangerous if you loop forever. Check job.attemptsStarted against job.opts.attempts before re-throwing — or you'll rate-limit past your own cap.

Custom backoff strategies on the worker offer a third escape hatch: return -1 from backoffStrategy to fail without retrying (custom backoff). Handy when failure classification lives in one module and you do not want every processor importing UnrecoverableError.

Deduplication — works until cleanup deletes your guard

Duplicate jobs are how you send three invoice emails, run three payout transfers, or enqueue the same nightly report twelve times because a webhook retried.

BullMQ offers two families of protection: jobId and the deduplication API.

`jobId` — simple throttle while the job exists

Pass a stable jobId when adding work. Identical IDs are ignored while a job with that ID remains in the queue (job IDs). Scope is per queue. Prefix numeric business IDs — 8842 alone throws; use invoice-8842.

The throttle pattern is the same idea: one ID, one slot, duplicates dropped until the job leaves the queue.

Deduplication modes — bursts, debounce, active coalescing

For profile-update emails or search-index refreshes, raw jobId is blunt. Deduplication modes add TTL and behavior (deduplication guide):

Throttle: same deduplication ID inside the TTL window is ignored.
Debounce: extend + replace keeps the latest payload and resets the window.
keepLastIfActive: while a job runs, store the newest payload and enqueue a follow-up when the active job finishes.

Listen for the deduplicated event if you need to tell the user their click was coalesced, not lost.

Worked example — invoice email burst

Webhook fires payment.succeeded three times in two seconds.

await emailQueue.add(
  'invoice-receipt',
  { invoiceId: 'inv_8842', to: '[email protected]' },
  { jobId: 'invoice-inv_8842' },
);

While invoice-inv_8842 is waiting or active, duplicates are dropped. Good.

Now add aggressive cleanup:

removeOnComplete: true,

The moment the first receipt job completes and Redis removes it, a fourth webhook retry can enqueue invoice-inv_8842 again — auto-removal breaks the job ID guard.

You did not break BullMQ — you broke your idempotency window.

Fix paths: keep a bounded removeOnComplete: { age: 3600, count: 100 }, use deduplication with a TTL that outlasts your webhook retry horizon, or align jobId with your payment provider's idempotency key so a duplicate job still hits an idempotent API.

Dead-letter queue — BullMQ gives you events, not a bucket

No DLQ checkbox. Failed jobs land in a failed set most teams never open.

You wire the dead-letter path from events yourself.

worker.on('failed', async (job, err) => {
  if (!job) return;
 
  const exhausted =
    job.attemptsMade >= (job.opts.attempts ?? 1);
 
  if (!exhausted) return;
 
  await dlqQueue.add('exhausted', {
    originalQueue: job.queueName,
    originalId: job.id,
    name: job.name,
    data: job.data,
    failedReason: err.message,
    attemptsMade: job.attemptsMade,
    timestamp: Date.now(),
  });
});

UnrecoverableError lands here on the first failure — that is the point (stop retrying). Retryable errors only arrive after attempts exhaust.

Worked example — billing:dlq. Name the DLQ queue explicitly (billing:dlq, not a vague failed-jobs). Store enough context to replay manually: original payload, error string, attempt count, upstream idempotency key. Alert on DLQ depth, not on every transient failure. removeOnFail: { count: 500 } is retention policy; it is not triage.

Replay scripts should call the same external APIs with the same idempotency keys the worker used — otherwise DLQ replay becomes a second outage.

Connections, stalls, and shutdown

Redis connections are not interchangeable between producers and consumers.

Workers need maxRetriesPerRequest: null on ioredis so commands keep retrying through Redis blips (connections). BullMQ throws if you forget — treat that as a favor.

HTTP producers adding jobs should fail fast — the queue is not a synchronous dependency. Default ioredis retries up to 20 times; your REST handler will hang with the caller. Use a separate connection with maxRetriesPerRequest: 1 or enableOfflineQueue: false so queue.add throws and the client gets a 503 it can retry later (failing fast). Do not reuse the worker's maxRetriesPerRequest: null connection in your Express route.

Never set ioredis keyPrefix on BullMQ connections — use BullMQ's own prefix option instead.

Stalls. If the event loop blocks longer than the stalled check interval (~30s by default), BullMQ moves the job back to waiting and emits stalled (stalled jobs). Default maxStalledCount is 1 — one stall cycle can permanently fail a long job. Prefer sandboxed processors for CPU-heavy work instead of cranking stall tolerance.

Shutdown. On SIGTERM, await worker.close() stops accepting new jobs and waits for in-flight work to finish (graceful shutdown). Add your own timeout in orchestration — close() does not bail out for you. Ungraceful exits still recover via stalled-job pickup on another worker.

When not to queue

Sub-50ms work often loses to enqueue overhead — Redis round-trips, serialization, another process waking up. If the task is a cache write or a single-row update you would await inline, measure before queuing.

Cron-overlap problems sometimes look like deduplication problems. If only one nightly export may run, use jobId: 'nightly-export' or deduplication TTL across the schedule window — not a second cron daemon "just in case."

Retries without taxonomy are noise.

Deduplication without cleanup discipline is a duplicate charge waiting to happen.