Skip to content

ADR 0012: Background jobs via arq + shared Redis

Status: Accepted Date: 2026-05-01

Context

SCALR apps will inevitably need work that doesn't belong in the request/response cycle: sending invoice emails, regenerating reports, cleaning up stale records, retrying failed webhooks, anything triggered on a schedule. Up to now the platform had no convention for this — the first app that needed it would have either bolted something on ad-hoc, or stood up a parallel Celery infrastructure in a hurry.

We want a convention agreed in writing before the first app needs it, so that:

  1. Every app uses the same queue, Redis, and worker shape.
  2. Adding a worker to a new app is a small docker-compose stub, not a research project.
  3. We don't end up with 3 different queue libraries across 5 apps.

Decision

Adopt arq as the background-job library, deployed against a dedicated shared Redis at services/redis/ (separate from Authentik's Redis). Wrap arq in a tiny packages/queue-py so apps import from scalr_queue import enqueue rather than knowing about arq.

A SCALR app that uses background jobs runs two processes:

  • The backend (FastAPI, existing) calls await enqueue(...).
  • The worker is a separate container running python -m scalr_queue.worker <module>. Same image as the backend (so it can import the app's models, db helpers, etc.) but a different command.

The template-app's docker-compose.yml ships a commented-out worker stub. New apps scaffolded from the template inherit it; uncommenting it is the activation gesture when an app actually needs background work.

Why arq, not Celery / RQ / Dramatiq / Temporal

We considered four families:

  • arq (chosen). Async-native, FastAPI-aligned, ~1k lines, single Redis dependency, built-in retries with exponential backoff, cron-style scheduling, dead-letter queue. Pythonic, modern.
  • Celery. The default for Django, mature. But: heavy, sync-by-default, many moving parts (broker + result backend + beat + flower for observability), kombu's serialization quirks. The sync model conflicts with FastAPI's async stack. Configuration burden is the largest of the four.
  • RQ. Tiny, sync, simpler than Celery. But sync-only — every task runs in its own thread/process; we'd give up the asyncio benefits the rest of the stack uses.
  • Dramatiq. Middle ground. Less momentum than Celery, less asyncio-friendly than arq.
  • Temporal. Workflow engine, much bigger scope than "background jobs." Right answer for complex stateful workflows; massively oversized for "send this email."

arq's three downsides we accept:

  • Smaller ecosystem than Celery; fewer Stack Overflow answers.
  • No first-party admin UI (Celery has Flower; arq has none beyond arq health-check). When we adopt centralized logs (Tier 2 plan #5), Loki + Grafana panels cover the observability gap.
  • Less battle-tested than Celery at scale. Not relevant at our scale; if we outgrow it, the migration to Celery is mechanical (similar function-based task model).

Why a dedicated Redis, not reusing Authentik's

Authentik already runs a Redis at services/auth/redis as its Celery broker. We could share it via a different db index. We don't, because:

  • Isolation. A flooded app queue shouldn't be able to starve Authentik's task pipeline.
  • Tuning differences. Authentik's Redis is tuned for transient session caching. App jobs need durable AOF persistence. Different defaults.
  • Replaceability. If Authentik gets swapped wholesale (or moved to a managed provider), app jobs shouldn't be affected.
  • Operational clarity. "scalr-redis" being the only Redis a SCALR app talks to is one less thing to explain in the developer onboarding doc.

Cost is negligible — a second Redis container is ~6MB resident. AOF persistence with everysec fsync trades a tiny window of potential loss (one second of enqueued tasks on hard host crash) for vastly better throughput. Acceptable: any task that must survive any failure should persist its inputs to Postgres before enqueuing.

Why scalr-edge for queue traffic

Same reasoning as Authentik-behind-gateway (ADR 0011): backend ↔ Redis traffic is internal, doesn't need TLS, doesn't need to traverse the gateway. Backends and workers reach Redis at redis://scalr-redis:6379/0 over the scalr-edge Docker network.

If we ever federate or open scalr-edge to less-trusted code, we'll need to add Redis ACLs / TLS. Not relevant at single-host scale.

Consequences

Positive: - One queue convention across every app. New apps activate the pattern by uncommenting one stub. - from scalr_queue import enqueue is the only API most code touches. The arq dependency is encapsulated; swapping queue libs in the future doesn't ripple. - Shared Redis means future cross-app patterns (rate-limiters, locks, pub/sub) have somewhere to land without standing up new infra.

Negative: - One more service to keep up. Mitigated by the smoke test pass catching a dead Redis (any enqueue() call fails fast). - arq's smaller ecosystem means we'll occasionally hit edges we'd have known about with Celery. Acceptable trade for the simplicity. - No admin UI (yet). Until centralized logs land, debugging a stuck worker means docker logs <app>-worker. Fine at single-app scale.

Future work

  • Cron / scheduled jobs. arq supports cron_jobs in its settings; add to WorkerSettings when an app needs them. No platform change.
  • Dead-letter handling. arq automatically retries with exponential backoff (default 5 attempts). When one app's tasks need different retry policy, override per-task via arq.func decorator.
  • Observability. Once Loki is in place (Tier 2 #5), worker logs go to a Grafana panel. A few key metrics — queue depth, oldest pending job age, recent failures — would be a small follow-up.