Brad Fitzpatrick (brad) wrote,
Brad Fitzpatrick
brad

incompatible brackup change?

Brackup, when it backs up to a target, stores the files/chunks on the remote side keyed on each chunk's digest. The great thing about this is that incremental backups have cost only proportional to the amount of data changed.

Where it gets tricky is when you encrypt those chunks first. In that case, the encrypted files change each time you encrypt them. Thus they don't have a consistent digest, and thus doesn't have a consistent "key" or "filename" on the storage side (Amazon, remote filesystem,etc).

To make incremental backups cheap again, Brackup maintains a "digest database" (see notes) that caches this information. That is, it caches what the digest of that encrypted chunk was the first time. And it's stored in the backup metafile, so you can get it back.

But I've always kinda hated it, and here's why...

If you lose the digest database, incremental backups are now a full backup. That means my few hundred GB of data in Amazon (say, $10/month), is now $20/month. Or I wipe Amazon to get costs down, but lose all old snapshots. Lame.

Worse, when I was just adding a --dry-run mode, I found a major race:

-- before you store_chunk, you ask the target if it has_chunk
-- has_chunk asks the Chunk for its digest.
-- Chunk says, "I don't know, not in my cache... let me encrypt file and tell you!"
-- Chunk then caches that, but throws encrypted data away (sits in memory waiting for it to be used)
-- dry-run mode doesn't need actual data (it skips store_chunk), so it throws it away too.
-- program exits, and nobody ever touched the encrypted data, but its digest is in cache now
-- ....

And it kinda just gets worse from there.

That was the final nail in its coffin. I want to change it now.

I have two options, one of which I know sucks, but I'll list it for completeness:

1) consistent encryption. trick out gpg by changing its ~/.gpg/random_seed. seems like a terrible idea and people would hate it forever. or I'd be rationalizing it forever.

2) for encrypted files, don't make the key be the digest of the encrypted contents, but the digest of something else unique+secure.

The "unique" constraint, ignoring securing, is easy: it could be, say:

DIGEST(unencrypted-contents) + ".encrypted"

But now you're exposing to others/authorities/etc, that you have the file with those "unencrytped-contents" backed up. You want to mix in something to the key that others don't have.

What about your GPG secret key? The key could be:

HMAC_DIGEST(DIGEST(gpg-secretkey), unencrypted-contents) + ".encrypted"

I think this solves it. Now:

1) we don't need digestdatabase anymore to make incremental backups cheap. just fast. so i can rename it back to digestcache, its original name.
2) the key doesn't change each time i encrypt something, so it solves the --dry-run race.

The remaining problem is that it breaks existing users' encrypted backups. It's not 1.00 yet. Do I just document this and tell people they have to restore old backups using Brackup-0.91?

Thoughts? Complaints?
Tags: brackup, tech
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 15 comments