ADR 0012: Background jobs via arq + shared Redis¶

Status: Accepted Date: 2026-05-01

Context¶

SCALR apps will inevitably need work that doesn't belong in the request/response cycle: sending invoice emails, regenerating reports, cleaning up stale records, retrying failed webhooks, anything triggered on a schedule. Up to now the platform had no convention for this — the first app that needed it would have either bolted something on ad-hoc, or stood up a parallel Celery infrastructure in a hurry.

We want a convention agreed in writing before the first app needs it, so that:

Every app uses the same queue, Redis, and worker shape.
Adding a worker to a new app is a small docker-compose stub, not a research project.
We don't end up with 3 different queue libraries across 5 apps.

Decision¶

Adopt arq as the background-job library, deployed against a dedicated shared Redis at services/redis/ (separate from Authentik's Redis). Wrap arq in a tiny packages/queue-py so apps import from scalr_queue import enqueue rather than knowing about arq.

A SCALR app that uses background jobs runs two processes:

The backend (FastAPI, existing) calls await enqueue(...).
The worker is a separate container running python -m scalr_queue.worker <module>. Same image as the backend (so it can import the app's models, db helpers, etc.) but a different command.

The template-app's docker-compose.yml ships a commented-out worker stub. New apps scaffolded from the template inherit it; uncommenting it is the activation gesture when an app actually needs background work.

Why arq, not Celery / RQ / Dramatiq / Temporal¶

We considered four families:

arq (chosen). Async-native, FastAPI-aligned, ~1k lines, single Redis dependency, built-in retries with exponential backoff, cron-style scheduling, dead-letter queue. Pythonic, modern.
Celery. The default for Django, mature. But: heavy, sync-by-default, many moving parts (broker + result backend + beat + flower for observability), kombu's serialization quirks. The sync model conflicts with FastAPI's async stack. Configuration burden is the largest of the four.
RQ. Tiny, sync, simpler than Celery. But sync-only — every task runs in its own thread/process; we'd give up the asyncio benefits the rest of the stack uses.
Dramatiq. Middle ground. Less momentum than Celery, less asyncio-friendly than arq.
Temporal. Workflow engine, much bigger scope than "background jobs." Right answer for complex stateful workflows; massively oversized for "send this email."

arq's three downsides we accept:

Smaller ecosystem than Celery; fewer Stack Overflow answers.
No first-party admin UI (Celery has Flower; arq has none beyond arq health-check). When we adopt centralized logs (Tier 2 plan #5), Loki + Grafana panels cover the observability gap.
Less battle-tested than Celery at scale. Not relevant at our scale; if we outgrow it, the migration to Celery is mechanical (similar function-based task model).

Why a dedicated Redis, not reusing Authentik's¶

Authentik already runs a Redis at services/auth/redis as its Celery broker. We could share it via a different db index. We don't, because:

Isolation. A flooded app queue shouldn't be able to starve Authentik's task pipeline.
Tuning differences. Authentik's Redis is tuned for transient session caching. App jobs need durable AOF persistence. Different defaults.
Replaceability. If Authentik gets swapped wholesale (or moved to a managed provider), app jobs shouldn't be affected.
Operational clarity. "scalr-redis" being the only Redis a SCALR app talks to is one less thing to explain in the developer onboarding doc.

Cost is negligible — a second Redis container is ~6MB resident. AOF persistence with everysec fsync trades a tiny window of potential loss (one second of enqueued tasks on hard host crash) for vastly better throughput. Acceptable: any task that must survive any failure should persist its inputs to Postgres before enqueuing.

Why scalr-edge for queue traffic¶

Same reasoning as Authentik-behind-gateway (ADR 0011): backend ↔ Redis traffic is internal, doesn't need TLS, doesn't need to traverse the gateway. Backends and workers reach Redis at redis://scalr-redis:6379/0 over the scalr-edge Docker network.

If we ever federate or open scalr-edge to less-trusted code, we'll need to add Redis ACLs / TLS. Not relevant at single-host scale.

Consequences¶

Positive: - One queue convention across every app. New apps activate the pattern by uncommenting one stub. - from scalr_queue import enqueue is the only API most code touches. The arq dependency is encapsulated; swapping queue libs in the future doesn't ripple. - Shared Redis means future cross-app patterns (rate-limiters, locks, pub/sub) have somewhere to land without standing up new infra.

Negative: - One more service to keep up. Mitigated by the smoke test pass catching a dead Redis (any enqueue() call fails fast). - arq's smaller ecosystem means we'll occasionally hit edges we'd have known about with Celery. Acceptable trade for the simplicity. - No admin UI (yet). Until centralized logs land, debugging a stuck worker means docker logs <app>-worker. Fine at single-app scale.

Future work¶

Cron / scheduled jobs. arq supports cron_jobs in its settings; add to WorkerSettings when an app needs them. No platform change.
Dead-letter handling. arq automatically retries with exponential backoff (default 5 attempts). When one app's tasks need different retry policy, override per-task via arq.func decorator.
Observability. Once Loki is in place (Tier 2 #5), worker logs go to a Grafana panel. A few key metrics — queue depth, oldest pending job age, recent failures — would be a small follow-up.