ADR 0012: Background jobs via arq + shared Redis¶
Status: Accepted Date: 2026-05-01
Context¶
SCALR apps will inevitably need work that doesn't belong in the request/response cycle: sending invoice emails, regenerating reports, cleaning up stale records, retrying failed webhooks, anything triggered on a schedule. Up to now the platform had no convention for this — the first app that needed it would have either bolted something on ad-hoc, or stood up a parallel Celery infrastructure in a hurry.
We want a convention agreed in writing before the first app needs it, so that:
- Every app uses the same queue, Redis, and worker shape.
- Adding a worker to a new app is a small docker-compose stub, not a research project.
- We don't end up with 3 different queue libraries across 5 apps.
Decision¶
Adopt arq as the background-job library, deployed against a
dedicated shared Redis at services/redis/ (separate from
Authentik's Redis). Wrap arq in a tiny packages/queue-py so apps
import from scalr_queue import enqueue rather than knowing about arq.
A SCALR app that uses background jobs runs two processes:
- The backend (FastAPI, existing) calls
await enqueue(...). - The worker is a separate container running
python -m scalr_queue.worker <module>. Same image as the backend (so it can import the app's models, db helpers, etc.) but a different command.
The template-app's docker-compose.yml ships a commented-out worker
stub. New apps scaffolded from the template inherit it; uncommenting
it is the activation gesture when an app actually needs background
work.
Why arq, not Celery / RQ / Dramatiq / Temporal¶
We considered four families:
- arq (chosen). Async-native, FastAPI-aligned, ~1k lines, single Redis dependency, built-in retries with exponential backoff, cron-style scheduling, dead-letter queue. Pythonic, modern.
- Celery. The default for Django, mature. But: heavy, sync-by-default, many moving parts (broker + result backend + beat + flower for observability), kombu's serialization quirks. The sync model conflicts with FastAPI's async stack. Configuration burden is the largest of the four.
- RQ. Tiny, sync, simpler than Celery. But sync-only — every task runs in its own thread/process; we'd give up the asyncio benefits the rest of the stack uses.
- Dramatiq. Middle ground. Less momentum than Celery, less asyncio-friendly than arq.
- Temporal. Workflow engine, much bigger scope than "background jobs." Right answer for complex stateful workflows; massively oversized for "send this email."
arq's three downsides we accept:
- Smaller ecosystem than Celery; fewer Stack Overflow answers.
- No first-party admin UI (Celery has Flower; arq has none beyond
arq health-check). When we adopt centralized logs (Tier 2 plan #5), Loki + Grafana panels cover the observability gap. - Less battle-tested than Celery at scale. Not relevant at our scale; if we outgrow it, the migration to Celery is mechanical (similar function-based task model).
Why a dedicated Redis, not reusing Authentik's¶
Authentik already runs a Redis at services/auth/redis as its Celery
broker. We could share it via a different db index. We don't,
because:
- Isolation. A flooded app queue shouldn't be able to starve Authentik's task pipeline.
- Tuning differences. Authentik's Redis is tuned for transient session caching. App jobs need durable AOF persistence. Different defaults.
- Replaceability. If Authentik gets swapped wholesale (or moved to a managed provider), app jobs shouldn't be affected.
- Operational clarity. "scalr-redis" being the only Redis a SCALR app talks to is one less thing to explain in the developer onboarding doc.
Cost is negligible — a second Redis container is ~6MB resident. AOF
persistence with everysec fsync trades a tiny window of potential
loss (one second of enqueued tasks on hard host crash) for vastly
better throughput. Acceptable: any task that must survive any
failure should persist its inputs to Postgres before enqueuing.
Why scalr-edge for queue traffic¶
Same reasoning as Authentik-behind-gateway (ADR 0011): backend ↔ Redis
traffic is internal, doesn't need TLS, doesn't need to traverse the
gateway. Backends and workers reach Redis at
redis://scalr-redis:6379/0 over the scalr-edge Docker network.
If we ever federate or open scalr-edge to less-trusted code, we'll need to add Redis ACLs / TLS. Not relevant at single-host scale.
Consequences¶
Positive:
- One queue convention across every app. New apps activate the
pattern by uncommenting one stub.
- from scalr_queue import enqueue is the only API most code touches.
The arq dependency is encapsulated; swapping queue libs in the
future doesn't ripple.
- Shared Redis means future cross-app patterns (rate-limiters, locks,
pub/sub) have somewhere to land without standing up new infra.
Negative:
- One more service to keep up. Mitigated by the smoke test pass
catching a dead Redis (any enqueue() call fails fast).
- arq's smaller ecosystem means we'll occasionally hit edges we'd
have known about with Celery. Acceptable trade for the simplicity.
- No admin UI (yet). Until centralized logs land, debugging a stuck
worker means docker logs <app>-worker. Fine at single-app scale.
Future work¶
- Cron / scheduled jobs. arq supports
cron_jobsin its settings; add toWorkerSettingswhen an app needs them. No platform change. - Dead-letter handling. arq automatically retries with exponential
backoff (default 5 attempts). When one app's tasks need different
retry policy, override per-task via
arq.funcdecorator. - Observability. Once Loki is in place (Tier 2 #5), worker logs go to a Grafana panel. A few key metrics — queue depth, oldest pending job age, recent failures — would be a small follow-up.