brad's life - Project Idea: Distributed, Encrypted Backup [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Project Idea: Distributed, Encrypted Backup [Apr. 22nd, 2005|09:08 pm]
Previous Entry Add to Memories Tell a Friend Next Entry
[Tags|, ]

I've been thinking about writing a backup system lately.

Think MogileFS + git + GnuPG. Definitely GnuPG, likely parts of git, and perhaps only MogileFS in concept. We'll see.

Notes:

-- client/server. client would be pretty dumb (basically: "give me stuff to store!") and would most definitely have to run on windows (as well as unix). (either Perl w/ ActiveState on Windows, or C# w/ Mono on Linux) it'd also be able to report what it has, verify the integrity of what it has (server gives the SHA1 the file must keep over time), and manage the disk quota.

-- client machines (friend/family's computers) would throw the backup client into their "Scheduled Tasks" on Windows, or cron on Unix, and it'd connect out to the backup server, getting incremental updates, throttling its bandwidth. it'd also be able to delete backup files that are older than the backup policy says to keep. (if the client has too many old revisions of a file)

-- server would be perl for sure. server would have to keep track of new/updated files/trees (the "git" part), and encrypt them, and keep track of what copies all the clients have (the "MogileFS" part).

-- clients should be assumed to be controlled by hostile parties, or at least incredibly prone to being owned. (again: think windows boxes admined by your family). as such:

* files are encrypted
* shouldn't even get to see filenames (files are stored named by their hashes)
* clients know nothing about encryption. they get access to no keys.
* deniability: it shouldn't be possible to say a backup contains a certain file. the filenames aren't the hash of the cleartext content, but of the encrypted file itself. also, contents aren't encrypted w/ a public (at least known) key.

-- server might have to split up huge files into manageable chunks. for instance I have vmware images I'd want backed up. because they're block devices, each iterative backup wouldn't spray 5GB around the network, but the dirty chunks.

-- config file should be able to let me define multiple repositories w/ different properties: notably retention policies, but also what directories are included/excluded in that repository.

-- in case of restore, client connects (at regularly scheduled time, or when you call up your family to run it by hand) and server asks it to send certain files.

I'd imagine friends partnering with each other to automatically store their opaque backup blobs. If I'd have written this months ago, [info]evan wouldn't have lost so much stuff.

First off, does this exist?
LinkReply

Comments:
[User Picture]From: [info]bloonail
2005-04-23 04:47 am (UTC)

(Link)

This is a dangerous thing to create, but it's the only way to store information.
[User Picture]From: [info]granting
2005-04-23 04:49 am (UTC)

(Link)

Would the server quota size as well as bandwidth? Is that client adjustable?
[User Picture]From: [info]brad
2005-04-23 05:51 am (UTC)

(Link)

Owners of clients machines would be able to choose how much disk quota is allocated to each user. I figure friends would give each other equal disk quota on each other's machines.
[User Picture]From: [info]granting
2005-04-23 06:07 am (UTC)

(Link)

So only one client/server relationship?
[User Picture]From: [info]brad
2005-04-23 07:49 am (UTC)

(Link)

Of course not.
[User Picture]From: [info]bitwise
2005-04-23 05:18 am (UTC)

(Link)

Aren't the storage requirements pretty intimidating? If every user wants to backup 10GB, they also need to donate more than 10GB to the distributed pool, right? 20GB if you want every backup to live on two separate machines. Actually more unless you can prevent new members from uploading their backups before their donated storage has been used by others.

It would be really cool if there was some way to detect common file hashes (i.e. common binaries) so the network wouldn't need to store additional copies, but as far as I can tell, that's incompatible with a fully encrypted system.
[User Picture]From: [info]brad
2005-04-23 05:52 am (UTC)

(Link)

It would be really cool if there was some way to detect common file hashes (i.e. common binaries) so the network wouldn't need to store additional copies, but as far as I can tell, that's incompatible with a fully encrypted system.

I figure this is for backing up /home/, not /usr/, so dups between separate users won't really happen. And if they didn't, you certainly wouldn't want to know about it.

Now, if you have dups in /home/, you'll only be storing it once on your friends' machines.
[User Picture]From: [info]scsi
2005-04-23 05:47 am (UTC)

Question/Comment

(Link)

I assume this is only for backing up select files, not like entire machines (sorta like backuppc)..

Since the client has no keys, it means the public/private pair are on the server... Wouldnt that defeat the purpose of encryption if someone gets into the server and just decrypts the hashes? Not unless the server accepts the files, encrypts them, and splatters them back out (distributed) to all of the clients participating..

If this is distributed, I would feel much safer if each client held the private key, and encrypted everything on the fly, calculated the hashes and sent it to the server for distribution.... If the client doesnt hold any of the keys, you are trusting everyones file to a central location, which (worst case) the box gets nabbed by feds, would be very easy for them to see exactly what everyone has uploaded.. Of course this blow denyablity out of the water since you are the sole owner of the public key...

Or I could be completely missing the point, which is mostly the case.
[User Picture]From: [info]maxvt
2005-04-23 07:01 am (UTC)

Re: Question/Comment

(Link)

How about client holds own pair of keys, server gets the content already encrypted?
From: [info]legolas
2005-04-27 11:21 pm (UTC)

Re: Question/Comment

(Link)

I would feel much safer if each client held the private key

Wouldn't that defeat the whole purpose? If your client goes kaput, bye bye data if the key can't be recovered?

Then again, how does the server know if the requesting client is who he says he is (esp. after the original client machine has, say, burned (literally))?
[User Picture]From: [info]jes5199
2005-04-23 05:58 am (UTC)

(Link)

http://www.pbs.org/cringely/pulpit/pulpit20040909.html
´Cause Backing-up is Hard to Do
Introducing Baxter, a Peer-to-Peer Backup Network
[User Picture]From: [info]brad
2005-04-23 07:50 am (UTC)

(Link)

Has anybody built it?
[User Picture]From: [info]jes5199
2005-04-23 05:13 pm (UTC)

(Link)

Apparently not. Cringely got distracted and forgot to ever talk about it again:
Following last week's column about Baxter, my idea for a distributed kinda sorta peer-to-peer Internet data back-up scheme, I expected this week to write about all the problems readers found with the idea, and all the existing Baxter-like services none of us had heard about. Well, things change, and I'll be doing that column next week


the slashdot kids (ignoring the ones who went on and on about freenet) suggested:
Distributed Internet Backup System
and
The OceanStore Project

those both appear to be one-man efforts, sort of half-baked.
[User Picture]From: [info]eqe
2005-04-23 06:36 am (UTC)

(Link)

Something like duplicity, no?
[User Picture]From: [info]brad
2005-04-23 07:51 am (UTC)

(Link)

Thanks for the link. I'll look into it.
[User Picture]From: [info]greg
2005-04-23 07:53 am (UTC)

(Link)

Back in '99, I worked at undoo, which turned into avamar and this is sort of what we were working on. At the time we used hashing to break down the big files and look for commonality to avoid redundancy.
From: [info]node
2005-04-23 07:55 am (UTC)

First off, does this exist?

(Link)

I described a similar way to do it a couple of days ago, with links a company that wants to sell such a system for intranets.

[User Picture]From: [info]warrend
2005-04-23 08:13 am (UTC)

(Link)

...client machines (friend/family's computers) would throw the backup client into their "Scheduled Tasks" on Windows, or cron on Unix...

Pycron works great for cron on Windows. The "Scheduled Tasks" functionality never works quite right for me for kicking of scripts, etc.

What I do to back up my Windows machines at work is to just rsync everything over to a *nix machine using a really really long include / deny file, and then have the backup server comb through changes and pack it into an encrypted history file for that day.

However, looking at duplicity (as posted above), it looks like I can now retire my custom scripts and just use that, as it is doing exactly what I rolled myself (but with more utilities, obviously).
[User Picture]From: [info]jwz
2005-04-23 08:33 am (UTC)

(Link)

You also don't want to expose the number/sizes of files; e.g., you don't want to be able to look at it and say "this is mp3s, and this is a maildir" just by size/grouping statistics. So I think really you want one big file, or a bunch of files of equal size.
[User Picture]From: [info]mart
2005-04-23 09:06 am (UTC)

(Link)

I like this idea. It sounds a bit like a miniature Freenet. I'd be a little concerned, though, about losing whatever indexes the server is retaining and being unable to recover the data. The indexes need to be backed up as well, but how do you back them up? It's a bit chicken-and-egg.

It might also be interesting to have proxies which act like clients but which hand off their storage to other clients, thus merging a bunch of different datasets together. The use I have in mind for this is (say) a family all individually backing up to each other, but also handing off to people outside the family as one big lump where it's not obvious who created what. Of course, more degrees of separation between you and your data increases the possibility that you won't be able to get it back from that source again later. A hybrid, though, where a client essentially backs up its own backups could cause your files to end up on the disks of people you've never even met.

[User Picture]From: [info]brad
2005-04-23 05:15 pm (UTC)

(Link)

The indexes need to be backed up as well, but how do you back them up? It's a bit chicken-and-egg.

I was thinking every backup client gets a full (encrypted) copy of the global index. Compressed, it's not that big. (and I have a fair number of files)
[User Picture]From: [info]youngoat
2005-04-23 05:10 pm (UTC)

Distributed Internet Backup System

(Link)

DIBS (Distributed Internet Backup System) seems pretty similar to what you dsecribe.

Unfortunately, it doesn't seem to have a very friendly UI. Everything is command-line. It would probably need to be wrapped in a friendly GUI in order to be usable by average users...

I haven't actually used it... I just browsed the FAQ and Documentation.
[User Picture]From: [info]brad
2005-04-23 05:17 pm (UTC)

Re: Distributed Internet Backup System

(Link)

That looks damn close if not exactly what I want. Thanks! I'll read up on it more before I ramble about this whole idea any more.
[User Picture]From: [info]visions
2005-04-25 02:08 pm (UTC)

(Link)

just curious, but why would you want to use Git in this? Despite the fact that the backup is distributed, modify 1K of data in a 1M file and you get a 2M resulting storage file.

what aspect of Git were you planning on using?
From: [info]matt_trout
2005-05-06 01:51 am (UTC)

Have you looked at Steve Traugott's ideas for ISFS?

(Link)

the infrastructures mailing list (http://www.infrastructures.org/) has been discussing a reliable peer-to-peer filesystem for some time; I suggested MogileFS a couple times and was told "yes, but it needs some extra stuff" (more accurate description better gathered from list archive). Might be worth popping up on there, maybe you can get some of the features you'd need for the storage layer from the infrastructures posters. I'm certainly a potential contributor for Bad and Wrong reasons of my own.
[User Picture]From: [info]gadlen
2005-05-08 08:24 am (UTC)

Boxbackup

(Link)

Boxbackup is an interesting functional implementation of a similar idea. It doesn't have a distributed component but does use encryption and blobs well.

[User Picture]From: [info]brad
2005-05-08 10:48 am (UTC)

Re: Boxbackup

(Link)

Thanks for the link! That looks interesting.