Skip to content

ADR 0010: Postgres backups via restic to S3-compatible storage

Status: Accepted Date: 2026-05-01

Context

Every SCALR app has a single Postgres container with one Docker volume on the deployment host. Authentik's user database is the same shape. There are no snapshots, no replicas, no off-host copies. One docker volume rm typo, one disk failure, one host loss — every customer's data is gone, including the auth database needed to sign in and assess the damage.

This is the largest single existential risk on the platform and PLAN.md Milestone 8 has called it out as outstanding since the first cut.

We need backups that are:

  1. Off-host. A copy on the same disk only protects against rm, not hardware failure or ransomware.
  2. Encrypted at rest. Whatever third-party storage we land on, we shouldn't be trusting them with cleartext customer data.
  3. Cheap. Solo operator at zero-revenue stage. Backups should cost well under $5/month for the foreseeable future.
  4. Easy to restore from. A backup you've never restored is folklore. The restore command must be one short script.
  5. Auto-discovering. New apps added via make new-app should get backed up the next night with no edit to the backup config.

Decision

Adopt restic as the backup engine, deployed as a new services/backup/ container. Persist to S3-compatible object storage with Cloudflare R2 as the recommended provider. Run on a daily cron inside the container; retain 30 daily + 4 weekly + 6 monthly snapshots.

Discovery walks docker ps for any running container whose name matches scalr-*-db plus the explicit scalr-auth-postgres. Each Postgres is dumped via docker exec ... pg_dump -Fc and streamed straight into a restic snapshot tagged with the container name — no cleartext on disk.

Restore is a single script: infra/scripts/restore.sh <db-container> <snapshot-id-or-latest>. It pipes restic's output back through pg_restore into the live container after dropping and recreating the target DB.

Why restic

  • Single static binary (~25MB). Drops into alpine cleanly. No language runtime, no daemon, no central server.
  • Native S3-compatible support. R2, B2, AWS S3, MinIO, even SFTP and local paths — same backend interface, swap by changing one env var.
  • Encryption at rest by default. Everything is encrypted with a passphrase before it leaves the container. The bucket sees only opaque blobs.
  • Content-addressed deduplication. Daily snapshots of a slowly- changing DB cost almost nothing — 95%+ of blocks are shared with yesterday's snapshot. A 1GB DB doesn't cost 30GB after 30 days; it costs ~1.2GB.
  • Built-in retention. restic forget --keep-daily 30 --keep-weekly 4 --keep-monthly 6 --prune is one command, idempotent, runs at the end of every backup pass. No bucket-side lifecycle policies needed.
  • Streamable restore. restic dump <id> <file> writes to stdout, so the restore pipeline is restic dump | pg_restore with no intermediate files.
  • Active maintenance. Repository at github.com/restic/restic, last release within recent months, large user base.

Why not the alternatives

  • pg_dump to a host volume only. Doesn't survive disk loss. Was briefly considered as a stepping stone; dismissed because the marginal cost of also pushing to S3 is one env var.
  • pgBackRest. Heavyweight; designed for WAL streaming and point-in-time recovery on multi-instance Postgres clusters. Overkill at our scale and the configuration burden is real.
  • pg_basebackup + WAL archiving. Same story — gives us PITR which we don't need yet, in exchange for orchestration complexity that outweighs the value.
  • Managed Postgres (Neon, Supabase, RDS). Would solve backups but forces a much larger architectural change (per-app DB → external host → network policy → cost) that's out of scope for this ADR. When we outgrow single-host Postgres, that's the right time to revisit.
  • Bucket-side snapshots only. R2 has versioning; relying on it would protect against deletion but not against a corrupted dump being written. Restic's content-addressed model is more robust.
  • Borg. Similar feature set to restic but only natively supports SSH/local repos; needs rclone or similar for S3, which adds complexity restic gives us natively.

R2's free tier covers our needs at zero scale:

  • 10 GB stored for free, $0.015/GB-month after.
  • No egress fees ever. This is the killer feature: when we eventually need to restore, we don't pay per-byte to pull the dump back.
  • S3-compatible API. Restic talks to it identically.

B2 is a close second; same shape, slightly worse free tier (10GB free, egress free up to 3× stored amount monthly which is more than enough). AWS S3 is fine but has egress costs that bite during restore drills.

The compose file is provider-agnostic — switching to B2 is changing RESTIC_REPOSITORY from s3:https://...r2.cloudflarestorage.com/... to s3:s3.<region>.backblazeb2.com/.... No code change.

Why docker exec for discovery, not networked pg_dump

Two options were considered for reaching each app's Postgres:

  1. Network: backup container joins each app's DB network and runs pg_dump -h <db-host> against it.
  2. Docker exec: backup container bind-mounts /var/run/docker.sock and runs pg_dump inside each DB container.

We chose (2) because:

  • No per-app compose edits. Option (1) requires every db service to publish itself on a shared backup network, which means every scaffolded app from now on (and a sweep of existing apps).
  • No DB credential plumbing. docker inspect reads the POSTGRES_USER/POSTGRES_DB from each container's env at backup time — no need to also wire those into the backup container.
  • No new attack surface. Apps' DB ports stay private to their own network. The backup container's privilege (Docker socket) is comparable to the gateway's socket-proxy, which we already accept.

The trade-off is that the backup container has root-equivalent on the host (Docker socket = root). That's documented in the README's "Failure modes" section. If that bothers us later, we can put a tecnativa/docker-socket-proxy in front of it limiting it to containers list/inspect/exec only — same pattern Traefik uses.

Consequences

Positive: - One existential failure mode closed for ~$0/month and half a day of work. - Auto-discovers new apps; future scaffolds need zero backup config. - Encryption-at-rest by default, retention by default, dedup by default. - Same code path in dev (local repo) and prod (S3) — the test surface matches reality.

Negative: - Container has Docker socket access. Mitigation: documented; can be socket-proxied later if threat model tightens. - Restic password loss = backups gone forever. Mitigation: documented in multiple places; password is generated long, stored in .env.shared (which is treated as a secret), and we recommend a password-manager copy. - docker exec pg_dump runs inside the app's DB container, briefly using its CPU/memory. At 03:00 UTC this is fine; if any DB has a heavy nightly job it should run after backups complete. - Restore is a destructive operation (drops the target DB). The script prompts for confirmation by default. Documented.

Migration / future work

  • Add an alerting hook so failed backups page someone (currently surfaces only in docker logs scalr-backup and via container exit code).
  • Monthly automated restore-drill (a CI job that pulls the latest snapshot into a throwaway container and asserts row counts) — closes the "backup you've never restored" loophole.
  • When we adopt centralized logs (Tier 2 plan item #5), backup container logs go to Loki and a Grafana panel surfaces backup health.
  • When/if we move to managed Postgres, this layer is no longer needed for those apps; restic-tagged-by-container makes the cutover easy (just stop tagging migrated apps).