ADR 0010: Postgres backups via restic to S3-compatible storage¶
Status: Accepted Date: 2026-05-01
Context¶
Every SCALR app has a single Postgres container with one Docker volume on
the deployment host. Authentik's user database is the same shape. There
are no snapshots, no replicas, no off-host copies. One docker volume rm
typo, one disk failure, one host loss — every customer's data is gone,
including the auth database needed to sign in and assess the damage.
This is the largest single existential risk on the platform and PLAN.md
Milestone 8 has called it out as outstanding since the first cut.
We need backups that are:
- Off-host. A copy on the same disk only protects against
rm, not hardware failure or ransomware. - Encrypted at rest. Whatever third-party storage we land on, we shouldn't be trusting them with cleartext customer data.
- Cheap. Solo operator at zero-revenue stage. Backups should cost well under $5/month for the foreseeable future.
- Easy to restore from. A backup you've never restored is folklore. The restore command must be one short script.
- Auto-discovering. New apps added via
make new-appshould get backed up the next night with no edit to the backup config.
Decision¶
Adopt restic as the backup engine, deployed as a new
services/backup/ container. Persist to S3-compatible object storage
with Cloudflare R2 as the recommended provider. Run on a daily cron
inside the container; retain 30 daily + 4 weekly + 6 monthly snapshots.
Discovery walks docker ps for any running container whose name matches
scalr-*-db plus the explicit scalr-auth-postgres. Each Postgres is
dumped via docker exec ... pg_dump -Fc and streamed straight into a
restic snapshot tagged with the container name — no cleartext on disk.
Restore is a single script: infra/scripts/restore.sh <db-container>
<snapshot-id-or-latest>. It pipes restic's output back through
pg_restore into the live container after dropping and recreating the
target DB.
Why restic¶
- Single static binary (~25MB). Drops into alpine cleanly. No language runtime, no daemon, no central server.
- Native S3-compatible support. R2, B2, AWS S3, MinIO, even SFTP and local paths — same backend interface, swap by changing one env var.
- Encryption at rest by default. Everything is encrypted with a passphrase before it leaves the container. The bucket sees only opaque blobs.
- Content-addressed deduplication. Daily snapshots of a slowly- changing DB cost almost nothing — 95%+ of blocks are shared with yesterday's snapshot. A 1GB DB doesn't cost 30GB after 30 days; it costs ~1.2GB.
- Built-in retention.
restic forget --keep-daily 30 --keep-weekly 4 --keep-monthly 6 --pruneis one command, idempotent, runs at the end of every backup pass. No bucket-side lifecycle policies needed. - Streamable restore.
restic dump <id> <file>writes to stdout, so the restore pipeline isrestic dump | pg_restorewith no intermediate files. - Active maintenance. Repository at github.com/restic/restic, last release within recent months, large user base.
Why not the alternatives¶
pg_dumpto a host volume only. Doesn't survive disk loss. Was briefly considered as a stepping stone; dismissed because the marginal cost of also pushing to S3 is one env var.pgBackRest. Heavyweight; designed for WAL streaming and point-in-time recovery on multi-instance Postgres clusters. Overkill at our scale and the configuration burden is real.pg_basebackup+ WAL archiving. Same story — gives us PITR which we don't need yet, in exchange for orchestration complexity that outweighs the value.- Managed Postgres (Neon, Supabase, RDS). Would solve backups but forces a much larger architectural change (per-app DB → external host → network policy → cost) that's out of scope for this ADR. When we outgrow single-host Postgres, that's the right time to revisit.
- Bucket-side snapshots only. R2 has versioning; relying on it would protect against deletion but not against a corrupted dump being written. Restic's content-addressed model is more robust.
- Borg. Similar feature set to restic but only natively supports
SSH/local repos; needs
rcloneor similar for S3, which adds complexity restic gives us natively.
Why Cloudflare R2 (recommended) over Backblaze B2 / AWS S3¶
R2's free tier covers our needs at zero scale:
- 10 GB stored for free, $0.015/GB-month after.
- No egress fees ever. This is the killer feature: when we eventually need to restore, we don't pay per-byte to pull the dump back.
- S3-compatible API. Restic talks to it identically.
B2 is a close second; same shape, slightly worse free tier (10GB free, egress free up to 3× stored amount monthly which is more than enough). AWS S3 is fine but has egress costs that bite during restore drills.
The compose file is provider-agnostic — switching to B2 is changing
RESTIC_REPOSITORY from s3:https://...r2.cloudflarestorage.com/...
to s3:s3.<region>.backblazeb2.com/.... No code change.
Why docker exec for discovery, not networked pg_dump¶
Two options were considered for reaching each app's Postgres:
- Network: backup container joins each app's DB network and runs
pg_dump -h <db-host>against it. - Docker exec: backup container bind-mounts
/var/run/docker.sockand runspg_dumpinside each DB container.
We chose (2) because:
- No per-app compose edits. Option (1) requires every
dbservice to publish itself on a shared backup network, which means every scaffolded app from now on (and a sweep of existing apps). - No DB credential plumbing.
docker inspectreads thePOSTGRES_USER/POSTGRES_DBfrom each container's env at backup time — no need to also wire those into the backup container. - No new attack surface. Apps' DB ports stay private to their own network. The backup container's privilege (Docker socket) is comparable to the gateway's socket-proxy, which we already accept.
The trade-off is that the backup container has root-equivalent on the
host (Docker socket = root). That's documented in the README's "Failure
modes" section. If that bothers us later, we can put a
tecnativa/docker-socket-proxy in front of it limiting it to
containers list/inspect/exec only — same pattern Traefik uses.
Consequences¶
Positive: - One existential failure mode closed for ~$0/month and half a day of work. - Auto-discovers new apps; future scaffolds need zero backup config. - Encryption-at-rest by default, retention by default, dedup by default. - Same code path in dev (local repo) and prod (S3) — the test surface matches reality.
Negative:
- Container has Docker socket access. Mitigation: documented; can be
socket-proxied later if threat model tightens.
- Restic password loss = backups gone forever. Mitigation: documented in
multiple places; password is generated long, stored in
.env.shared (which is treated as a secret), and we recommend a
password-manager copy.
- docker exec pg_dump runs inside the app's DB container, briefly
using its CPU/memory. At 03:00 UTC this is fine; if any DB has a
heavy nightly job it should run after backups complete.
- Restore is a destructive operation (drops the target DB). The script
prompts for confirmation by default. Documented.
Migration / future work¶
- Add an alerting hook so failed backups page someone (currently surfaces
only in
docker logs scalr-backupand via container exit code). - Monthly automated restore-drill (a CI job that pulls the latest snapshot into a throwaway container and asserts row counts) — closes the "backup you've never restored" loophole.
- When we adopt centralized logs (Tier 2 plan item #5), backup container logs go to Loki and a Grafana panel surfaces backup health.
- When/if we move to managed Postgres, this layer is no longer needed for those apps; restic-tagged-by-container makes the cutover easy (just stop tagging migrated apps).