ADR 0010: Postgres backups via restic to S3-compatible storage¶

Status: Accepted Date: 2026-05-01

Context¶

Every SCALR app has a single Postgres container with one Docker volume on the deployment host. Authentik's user database is the same shape. There are no snapshots, no replicas, no off-host copies. One docker volume rm typo, one disk failure, one host loss — every customer's data is gone, including the auth database needed to sign in and assess the damage.

This is the largest single existential risk on the platform and PLAN.md Milestone 8 has called it out as outstanding since the first cut.

We need backups that are:

Off-host. A copy on the same disk only protects against rm, not hardware failure or ransomware.
Encrypted at rest. Whatever third-party storage we land on, we shouldn't be trusting them with cleartext customer data.
Cheap. Solo operator at zero-revenue stage. Backups should cost well under $5/month for the foreseeable future.
Easy to restore from. A backup you've never restored is folklore. The restore command must be one short script.
Auto-discovering. New apps added via make new-app should get backed up the next night with no edit to the backup config.

Decision¶

Adopt restic as the backup engine, deployed as a new services/backup/ container. Persist to S3-compatible object storage with Cloudflare R2 as the recommended provider. Run on a daily cron inside the container; retain 30 daily + 4 weekly + 6 monthly snapshots.

Discovery walks docker ps for any running container whose name matches scalr-*-db plus the explicit scalr-auth-postgres. Each Postgres is dumped via docker exec ... pg_dump -Fc and streamed straight into a restic snapshot tagged with the container name — no cleartext on disk.

Restore is a single script: infra/scripts/restore.sh <db-container> <snapshot-id-or-latest>. It pipes restic's output back through pg_restore into the live container after dropping and recreating the target DB.

Why restic¶

Single static binary (~25MB). Drops into alpine cleanly. No language runtime, no daemon, no central server.
Native S3-compatible support. R2, B2, AWS S3, MinIO, even SFTP and local paths — same backend interface, swap by changing one env var.
Encryption at rest by default. Everything is encrypted with a passphrase before it leaves the container. The bucket sees only opaque blobs.
Content-addressed deduplication. Daily snapshots of a slowly- changing DB cost almost nothing — 95%+ of blocks are shared with yesterday's snapshot. A 1GB DB doesn't cost 30GB after 30 days; it costs ~1.2GB.
Built-in retention. restic forget --keep-daily 30 --keep-weekly 4 --keep-monthly 6 --prune is one command, idempotent, runs at the end of every backup pass. No bucket-side lifecycle policies needed.
Streamable restore. restic dump <id> <file> writes to stdout, so the restore pipeline is restic dump | pg_restore with no intermediate files.
Active maintenance. Repository at github.com/restic/restic, last release within recent months, large user base.

Why not the alternatives¶

pg_dump to a host volume only. Doesn't survive disk loss. Was briefly considered as a stepping stone; dismissed because the marginal cost of also pushing to S3 is one env var.
pgBackRest. Heavyweight; designed for WAL streaming and point-in-time recovery on multi-instance Postgres clusters. Overkill at our scale and the configuration burden is real.
pg_basebackup + WAL archiving. Same story — gives us PITR which we don't need yet, in exchange for orchestration complexity that outweighs the value.
Managed Postgres (Neon, Supabase, RDS). Would solve backups but forces a much larger architectural change (per-app DB → external host → network policy → cost) that's out of scope for this ADR. When we outgrow single-host Postgres, that's the right time to revisit.
Bucket-side snapshots only. R2 has versioning; relying on it would protect against deletion but not against a corrupted dump being written. Restic's content-addressed model is more robust.
Borg. Similar feature set to restic but only natively supports SSH/local repos; needs rclone or similar for S3, which adds complexity restic gives us natively.

Why Cloudflare R2 (recommended) over Backblaze B2 / AWS S3¶

R2's free tier covers our needs at zero scale:

10 GB stored for free, $0.015/GB-month after.
No egress fees ever. This is the killer feature: when we eventually need to restore, we don't pay per-byte to pull the dump back.
S3-compatible API. Restic talks to it identically.

B2 is a close second; same shape, slightly worse free tier (10GB free, egress free up to 3× stored amount monthly which is more than enough). AWS S3 is fine but has egress costs that bite during restore drills.

The compose file is provider-agnostic — switching to B2 is changing RESTIC_REPOSITORY from s3:https://...r2.cloudflarestorage.com/... to s3:s3.<region>.backblazeb2.com/.... No code change.

Why `docker exec` for discovery, not networked `pg_dump`¶

Two options were considered for reaching each app's Postgres:

Network: backup container joins each app's DB network and runs pg_dump -h <db-host> against it.
Docker exec: backup container bind-mounts /var/run/docker.sock and runs pg_dump inside each DB container.

We chose (2) because:

No per-app compose edits. Option (1) requires every db service to publish itself on a shared backup network, which means every scaffolded app from now on (and a sweep of existing apps).
No DB credential plumbing. docker inspect reads the POSTGRES_USER/POSTGRES_DB from each container's env at backup time — no need to also wire those into the backup container.
No new attack surface. Apps' DB ports stay private to their own network. The backup container's privilege (Docker socket) is comparable to the gateway's socket-proxy, which we already accept.

The trade-off is that the backup container has root-equivalent on the host (Docker socket = root). That's documented in the README's "Failure modes" section. If that bothers us later, we can put a tecnativa/docker-socket-proxy in front of it limiting it to containers list/inspect/exec only — same pattern Traefik uses.

Consequences¶

Positive: - One existential failure mode closed for ~$0/month and half a day of work. - Auto-discovers new apps; future scaffolds need zero backup config. - Encryption-at-rest by default, retention by default, dedup by default. - Same code path in dev (local repo) and prod (S3) — the test surface matches reality.

Negative: - Container has Docker socket access. Mitigation: documented; can be socket-proxied later if threat model tightens. - Restic password loss = backups gone forever. Mitigation: documented in multiple places; password is generated long, stored in .env.shared (which is treated as a secret), and we recommend a password-manager copy. - docker exec pg_dump runs inside the app's DB container, briefly using its CPU/memory. At 03:00 UTC this is fine; if any DB has a heavy nightly job it should run after backups complete. - Restore is a destructive operation (drops the target DB). The script prompts for confirmation by default. Documented.

Migration / future work¶

Add an alerting hook so failed backups page someone (currently surfaces only in docker logs scalr-backup and via container exit code).
Monthly automated restore-drill (a CI job that pulls the latest snapshot into a throwaway container and asserts row counts) — closes the "backup you've never restored" loophole.
When we adopt centralized logs (Tier 2 plan item #5), backup container logs go to Loki and a Grafana panel surfaces backup health.
When/if we move to managed Postgres, this layer is no longer needed for those apps; restic-tagged-by-container makes the cutover easy (just stop tagging migrated apps).