Content-Addressed Dedup

Dropbear stores file contents in the bucket keyed by their SHA-256 hash, not by their path. The path-to-content mapping lives in the manifest; the blob itself is just bytes at roots/<rootID>/blobs/<sha256>. That single design choice buys deduplication for free and gives every download a built-in integrity check.

How it works

When sync decides a local file needs uploading, it hashes the file, checks whether blobs/<sha256> already exists in the bucket, and skips the upload if it does. Two devices that both hold the same 4 GB ISO upload it exactly once. Rename a 100 MB video from vacation/clip.mov to archive/2025/clip.mov and the next sync rewrites the manifest entry but does not re-upload the bytes — only the manifest (small JSON) and the head (tiny pointer) change.

Same story across files: ten copies of the same PDF scattered across a root share one blob. Same story across roots that share a bucket prefix scheme: identical contents are stored once. (We don't currently dedup across root_id prefixes — each root lives under its own roots/<rootID>/blobs/ namespace — but within a root, dedup is total.)

What it gets you

  • Renames and moves are free. They're a manifest edit, nothing more.
  • Restore is cheap when devices overlap. Bootstrapping a second laptop from the same bucket only pulls blobs that aren't already on disk-by-hash if you've staged them locally, and never pulls the same blob twice in one restore.
  • Integrity on download is automatic. The key is the hash. After fetching blobs/<sha256>, Dropbear re-hashes the bytes and refuses to write the file out if they don't match. Bit-rot in the bucket, a truncated transfer, or a malicious mid-flight swap all fail loudly.
  • Backups are obvious. Copying roots/<rootID>/blobs/ to a second bucket gives you a complete, verifiable archive of the content. Manifests reference it by hash; nothing else is needed.

What it costs

Content addressing doesn't garbage-collect itself. When a file is deleted or modified, the old blob stays in the bucket — possibly forever — until a GC pass walks every reachable manifest, builds the live-blob set, and deletes the rest with a safety delay. Dropbear doesn't ship GC yet (see the garbage-collection idea in the wiki). For now, the bucket is a strict superset of what's reachable, and the bill follows.

Whole-file hashing also means a one-byte edit to a 10 GB file uploads 10 GB of new blob. Chunked hashing is on the roadmap but not implemented — see the chunked-large-files idea. If your workload is "tiny edits to huge files," Dropbear in its current shape is the wrong tool.

Worth knowing

  • The hash is the content hash, not a hash of the filename or any metadata. A file's mode, mtime, owner, and path are all manifest-side concerns.
  • Symlinks aren't blobs. Their target string lives in the manifest entry directly; there's nothing to deduplicate.
  • Empty files have a well-defined SHA-256 (e3b0c44...) and that blob exists exactly once in any non-trivial root. This is fine and expected.