Log in

No account? Create an account
incompatible brackup change? - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

incompatible brackup change? [Oct. 1st, 2006|07:16 pm]
Brad Fitzpatrick
[Tags|, ]

Brackup, when it backs up to a target, stores the files/chunks on the remote side keyed on each chunk's digest. The great thing about this is that incremental backups have cost only proportional to the amount of data changed.

Where it gets tricky is when you encrypt those chunks first. In that case, the encrypted files change each time you encrypt them. Thus they don't have a consistent digest, and thus doesn't have a consistent "key" or "filename" on the storage side (Amazon, remote filesystem,etc).

To make incremental backups cheap again, Brackup maintains a "digest database" (see notes) that caches this information. That is, it caches what the digest of that encrypted chunk was the first time. And it's stored in the backup metafile, so you can get it back.

But I've always kinda hated it, and here's why...

If you lose the digest database, incremental backups are now a full backup. That means my few hundred GB of data in Amazon (say, $10/month), is now $20/month. Or I wipe Amazon to get costs down, but lose all old snapshots. Lame.

Worse, when I was just adding a --dry-run mode, I found a major race:

-- before you store_chunk, you ask the target if it has_chunk
-- has_chunk asks the Chunk for its digest.
-- Chunk says, "I don't know, not in my cache... let me encrypt file and tell you!"
-- Chunk then caches that, but throws encrypted data away (sits in memory waiting for it to be used)
-- dry-run mode doesn't need actual data (it skips store_chunk), so it throws it away too.
-- program exits, and nobody ever touched the encrypted data, but its digest is in cache now
-- ....

And it kinda just gets worse from there.

That was the final nail in its coffin. I want to change it now.

I have two options, one of which I know sucks, but I'll list it for completeness:

1) consistent encryption. trick out gpg by changing its ~/.gpg/random_seed. seems like a terrible idea and people would hate it forever. or I'd be rationalizing it forever.

2) for encrypted files, don't make the key be the digest of the encrypted contents, but the digest of something else unique+secure.

The "unique" constraint, ignoring securing, is easy: it could be, say:

DIGEST(unencrypted-contents) + ".encrypted"

But now you're exposing to others/authorities/etc, that you have the file with those "unencrytped-contents" backed up. You want to mix in something to the key that others don't have.

What about your GPG secret key? The key could be:

HMAC_DIGEST(DIGEST(gpg-secretkey), unencrypted-contents) + ".encrypted"

I think this solves it. Now:

1) we don't need digestdatabase anymore to make incremental backups cheap. just fast. so i can rename it back to digestcache, its original name.
2) the key doesn't change each time i encrypt something, so it solves the --dry-run race.

The remaining problem is that it breaks existing users' encrypted backups. It's not 1.00 yet. Do I just document this and tell people they have to restore old backups using Brackup-0.91?

Thoughts? Complaints?

[User Picture]From: xlerb
2006-10-02 04:57 am (UTC)
I suppose you've seen the famous quote from the author of make, about how it wasn't changed to accept spaces instead of a tab a zillion years ago because at the time it had 12! whole! users! or something like that.

(I think there's another semi-well-known one along those lines, but I can't remember it.)

And then there's the one about Larry Wall and perl regexp: how, when he changed parentheses and braces and such to being live metacharacters back in prehistoric times, it broke a few people's scripts, and they lamented, but just think of how many backslash keys have been saved from early deaths as a result.

IOW, despite being a non-user of Brackup, I feel that history supports the “change it now, while the pain is still relatively slight” approach.
(Reply) (Thread)
[User Picture]From: robbat2
2006-10-02 06:23 am (UTC)
couple of comments here.
why not make the first or last block of the backup be a superblock that contains the digest etc for the rest of the data. that way you can rebuild your digest database if it does get lost. Might be worth doing just to make the digestcache rebuild faster.

consistent encryption like you suggest still isn't doable, because then an attacker with knowledge of your random_seed contents (since you would HAVE to be storing them somewhere, and encrypting them would break the brackup policies of not needing human input for backups) could decrypt your backups.
in short, the goals of public-key encryption are incompatible with consistent encryption. Even GPG's public-key encryption of large chunks of data really isn't. It's public-key around a session key, and then symmetric encryption using that session key.

One problem here:
HMAC_DIGEST(DIGEST(gpg-secretkey), unencrypted-contents) + ".encrypted"
This would require decoding the secret key, so I don't see how that fits in with unattended public-key-encrypted backups.

in spite of the recent advanced hash attacks, why not DIGEST((unencrypted-contents + "some public per-user datachunk"))?
They'd have to be really good to recover DIGEST(unencrypted-contents) from that.
(Reply) (Thread)
[User Picture]From: avatraxiom
2006-10-02 06:50 am (UTC)
Per-user datachunk could be their Key ID. That's pretty public, and definitely available to Brackup.

(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-10-02 07:08 am (UTC)
Ah, true!
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-10-02 07:18 am (UTC)
Re: superblocks, etc: goes against spirit of most understandable data structures possible. I want people to understand it enough to trust it. Simple is more verifiable, too.

Consistent encryption, yeah, just had to mention it.

"This would require decoding the secret key"

Oh yeah. On my machine I was playing with, I just happened to have my secret key also installed.

I like your last idea, along with Max's idea of public fingerprint.

(Reply) (Parent) (Thread)
From: legolas
2006-10-02 07:05 pm (UTC)
[note: I'm not a brackup user]
How is:

DIGEST((unencrypted-contents) + publicly known value)+ ".encrypted"

any different from:

DIGEST(unencrypted-contents) + ".encrypted"


With this, you are still 'exposing to others/authorities/etc, that you have the file with those "unencrytped-contents" backed up', as you said in top post. Not sure what level of security you are trying to achieve, but the public Key ID acts as mere salting here. This does help with 'quickly find everyone that has a copy of illegal.mp3', but not with 'find out if brad has a copy of illegal.mp3'.

As you said" "You want to mix in something to the key that others don't have." (emphasis mine, obviously ;-)

At least, if I understand the discussion correctly. I'm not familiar with gpg at all.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-10-02 07:14 pm (UTC)
Yeah, I'm still not that happy about it which is why I haven't implemented it.

It does drop the bar down to having to test for a certain file, rather than presenting a list of index keys to authorities which they could quickly map to known files.

So now I'm going back to thining about parallel, encrypted ".meta" files on each key, which describe that encrypted chunk's orginal contents. And having the .meta be encrypted itself. So if you lose the digest db, you can re-download all meta files and rebuild it .... but then you need your private key around, so it can't be automated!

Tradeoffs, tradeoffs .....

Just haven't decided which route to go.
(Reply) (Parent) (Thread)
From: legolas
2006-10-02 08:50 pm (UTC)
I think you were/are aiming for something impossible:

- you want to identify a file (through a digest) with only public information
- you don't want anybody else to be able to do the same thing

It seems storing the database on the server is the only way out of that, although you already thought of that 20/3/2006 and rejected it due to 'race conditions if multiple backups are running' and the pain of implementing it.

About not being automated: true, but if you lost that file on your computer, certainly typing your gpg passphrase is a small price to pay? Or as you said before: 'restores may prompt for user input ("What's your Amazon S3 password?" and "Enter your GPG passphrase."), because they won't be automated or common'.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-10-02 10:52 pm (UTC)
Yeah, like I said: tradeoffs.

I'll have to sacrifice some ideals in some situations, but I have to decide which I care about and which situations are most likely and should be optimized for.
(Reply) (Parent) (Thread)
[User Picture]From: avatraxiom
2006-10-02 06:45 am (UTC)
Couldn't you just leave in a backwards-compatibility mode for restores? Perhaps just leave the old digest mode as a plugin, or something, that can be activated with a command-line option.

(Reply) (Thread)
[User Picture]From: xb95
2006-10-02 07:57 am (UTC)
Someone using brackup probably has some version of $clue. I'd say go for it.

Of course, the backwards compatible restore is an option too and might be nice if anybody's running a combined nightly auto-update of their CPAN + an automated brackup restore, which would break under said new scheme. Don't think the odds of that are very high yet, not sure brackup has been adopted very widely.
(Reply) (Thread)
[User Picture]From: mendel
2006-10-06 01:41 am (UTC)
Stupid brackup question: Are backups always incremental? I don't see anything that ever removes a chunk.

I think checkpointing to a certain date is just a matter of reading through the metafiles from that date to the present and then removing any chunks not listed in any of them, but I'm still getting my head around everything.

If all goes well I should have at least a basic Brackup::Target::SFTP soon.
(Reply) (Thread)
[User Picture]From: brad
2006-10-06 01:47 am (UTC)
Yes, always incremental. I haven't done the code to remove chunks that don't exist in the set of backup (snapshots) you care about .... but it's like you describe basically.

Let me know if you want svn commit access, or if you want to do separate cpan releases of your target, that works too.
(Reply) (Parent) (Thread)
[User Picture]From: mendel
2006-10-06 03:15 am (UTC)
One more "I'm not on crack" confirmation: When restoring, does brackup-restore expect you to have already retrieved the metafile from the backup yourself?

Also, have you done a big encrypted restore? The volume of passphrase prompts I'm getting is... impractical. Enough so that I'm wondering if that's not what's meant to happen.

(And svn access is good -- separate CPAN releases is just inconvenience for all involved. I'm still a cvs-head, though, so tell me what credentials you need
from me to set me up. For that matter I can just fire you off the module if you want once it's shiny.)
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-10-06 04:09 am (UTC)
Good questions!

When restoring, does brackup-restore expect you to have already retrieved the metafile from the backup yourself?

Currently, but only because I forgot we stored it to the target too. So I imagine a future mode to list the meta files on the server and retrieve them would be good.

The volume of passphrase prompts I'm getting is... impractical.

I figured there was a gpg-agent thing like ssh-agent. I don't know gpg, though, so I punted on that. So "no", I haven't. That's a TODO item. (Maybe you want to fix?)

As for svn, email brad@danga.com w/ the output of "htdigest" for realm "Danga" and the svn username you want. And that's it. Then use "svn" instead of "cvs" and it pretty much just works the same, except it doesn't suck. ;-)

BTW, if you haven't found it, the trunk is:

I'd love to hack on this with you, so lay on the questions.
(Reply) (Parent) (Thread)