brad's life - wsbackup -- encrypted, over-the-net, multi-versioned backup [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

wsbackup -- encrypted, over-the-net, multi-versioned backup [Mar. 19th, 2006|12:50 pm]
Previous Entry Add to Memories Share Next Entry
[Tags|, , , , ]

There are lots of ways to store files on the net lately:

-- Amazon S3 is the most interesting,
-- Google's rumored GDrive is surely soon coming
-- Apple has .Mac

I want to back up to them. And more than one. So first off, abstract out net-wide storage.... my backup tool (wsbackup) isn't targetting one. They're all just providers.

Also, don't trust sending my data in cleartext, and having it stored in cleartext, so public key encryption is a must. Then I can run automated backups from many hosts, without much fear of keys being compromised.

Don't want people being able to do size-analysis, and huge files are a pain anyway, so big files are cut into chunks.

Files stored on Amazon/Google are of form:

-- meta files: backup_rootname-yyyymmddnn.meta, encrypted (YAML?) file mapping relative paths from backup directory root to the stat() information, original SHA1, and array of chunk keys (SHA1s of encrypted chunks) that comprise the file.

-- [sha1ofencryptedchunk].chunk -- content being <= ,say, 20MB chunk of encrypted data.

Then every night different hosts/laptops recurse directory trees, consult a stat() cache (on,say, inode number, mtime, size, whatever) and do SHA1 calculations on changed files, lookup rest from cache, and build the metafile, upload any new chunks, encrypt the metafile, upload the metafile.

Result:

-- I can restore any host from any point in time, with Amazon/Google storing all my data, and only paying $0.15 cents/GB-month.

Nice.

I'm partway through writing it. Will open source it soon. Ideally tonight.
LinkReply

Comments:
(Deleted comment)
[User Picture]From: brad
2006-03-19 09:13 pm (UTC)

(Link)

GIT: yes, I'm a big fan of content-addressed filesystems. Not for everything, but they have their uses.

And yes, public key so I only have to protect my private key in one place. (even though it'll have a good passphrase)
[User Picture]From: kvance
2006-03-19 09:19 pm (UTC)

(Link)

Sweet! I'm really terrible with backups, and I've been waiting for a service that does this, almost exactly.
[User Picture]From: brad
2006-03-19 09:23 pm (UTC)

(Link)

I started thinking of this when evan lost all his data a year back (he's recovered it since, he was just lazy), but at the time Amazon S3 didn't exist, so the storage provider layer was a lot hairier, since we were all scheming up P2P solutions which weren't exactly trust-inspiring (or fast).

Now that hurdle is over, this should be a breeze.
From: (Anonymous)
2006-03-19 09:22 pm (UTC)

Really cool

(Link)

This is awesome -- really, really cool. Can't wait to see it. I'd actually like to see it implemented for Openomy (http://www.openomy.com), too (selfish, yes, because Openomy is my project). Then you'd really have a full-on solution: you wrote this, and you wrote a lot of the actual storage infrastructure for Openomy (we use MogileFS). ;)

Can't wait to see the code.

Ian
[User Picture]From: brad
2006-03-19 09:24 pm (UTC)

Re: Really cool

(Link)

Whoa, neat. Hadn't heard of Openomy.

Write a StorageProvider subclass when I release the code, then I'll include it in future releases.
[User Picture]From: brad
2006-03-19 09:27 pm (UTC)

Re: Really cool

(Link)

P.S. Go get a TypeKey or MyOpenID or Level9 account or something that can do OpenID. Anonymous comments are painful, since I can't trust you to add future comments without moderation.

Or a LiveJournal.
From: iseff [typekey.com]
2006-03-19 10:41 pm (UTC)

Re: Really cool

(Link)

I'll definitely write the subclass when it's released. Should be cool. :)

Also: got myself a TypeKey.

Ian
[User Picture]From: mollyblack
2006-03-19 10:10 pm (UTC)

Thank you for sharing this!

(Link)

My husband and I look forward to this and figuring out how to use it for ourselves. Especially since we're planning on setting up our own full-time business server for our various sites. Having an off-site encrypted-nightly backup will help out a TON.
[User Picture]From: xlerb
2006-03-19 10:10 pm (UTC)

(Link)

It looks like, if you have lots of tiny files, you'll get a lot of tiny files on the server; they might not like that, aside from any efficiency issues. I guess the metadata could store an offset into the given data segment — but then deleting backups will get “interesting”.

Obligatory nitpicking: ctime, not mtime, is what's traditionally used for incremental backups, because mtime can be arbitrarily changed (and is often preserved by archive-extraction and file-transfer utilities). (ISTR hearing that ctime was originally going to be “creation time”, and then the Unix people realized they needed something for incremental dumps and made it be “change time” instead.)

More obligatory nitpicking: Uh, I hope there's a plan for migrating away from SHA1 when NIST blesses a successor in a few years, or when someone publishes an actual collision, whichever comes first.
[User Picture]From: brad
2006-03-19 10:14 pm (UTC)

(Link)

I was originally aggregating little files into bigger files and doing the offset stuff but it got complicated/ugly really quickly. I still plan to add it later, but not for initial release.

ctime: thanks.

SHA1: yes. i'll probably make the files be [hashname]-[hash].chunk. but then again, you're backing up your own data: not really going to be attacking yourself, are you?
From: evan
2006-03-19 10:49 pm (UTC)

(Link)

To be honest, this sounds really similar to git. Likely because they're sorta addressing the same need (consider: git manages new revisions as well...).
I wonder if it would be useful to support git directly?

Check out the text following the "Discussion" heading at:
http://www.kernel.org/pub/software/scm/git/docs/

File blobs are addressed by SHA-1, then you have tree" blobs that correspond to directories (containing filenames and hashes, etc.), and then "commit" blobs that point to a tree and previous trees. Then git itself is a bunch of utilities that interact with your local stat cache, syncing/pulling it from a commit id.

(Of course, this is also nearly the same as monotone, as it inspired git's design and it's also content-addressable.)
From: evan
2006-03-19 10:53 pm (UTC)

(Link)

Oh, and the main point these designs diverge is that git has a tree of directories while you have a flat index. The advantage of a tree is that if I modify a file, it only has to rewrite its directory (and the directories above, to point at the new directory) instead of the entire index. But it also means more file futzing. Dunno which way is better.
[User Picture]From: brad
2006-03-19 10:58 pm (UTC)

(Link)

I debated that a bunch. In the end I decided I wanted to carry my backup YAML index around on a USB stick or whatever, without a tree of dependent files. Also makes purging old backups and chunks with zero reference counts easier, which you don't do in a source control system.
[User Picture]From: brad
2006-03-19 10:56 pm (UTC)

(Link)

I'm adequately familiar with git. The problems with using it are:

-- it's in C, so adding new providers for GDrive/etc mean writing C when there's no reason it has to be C, and will just slow down development

-- the encryption stuff

-- the chunking stuff.

-- git's in heavily development, so it wouldn't be fun trying to get patches merged, especially for things that don't make any sense for the kernel.

In the end, taking the idea of "a content addressable filesystem" isn't that hard to re-implement, especially when:

-- it's in Perl

-- don't have to deal with all the merging/branching stuff that a SCM needs to do

If you want to do it in git, though, I'll race you. I know your C foo is good, but that good? :-)
From: evan
2006-03-19 10:58 pm (UTC)

(Link)

That's about the answer I expected, and I agree with your reasoning.

(And jesus do I not enjoy writing C. Well, I enjoy it about as much as javascript, in that the whole time I'm simultaneously thinking "ouch ouch this is so painful, yet amusing in a brainfuck sort of way" and "this is just what you have to go through to achieve goal [x]".)
[User Picture]From: xlerb
2006-03-20 05:42 am (UTC)

(Link)

Plan9's Venti is another content-addressed archival storage thing, if anyone's interested in yet more related work.
[User Picture]From: brad
2006-03-20 04:43 pm (UTC)

(Link)

Thanks for the link!
[User Picture]From: edm
2006-03-20 12:46 am (UTC)

ar, tar, zip, et al

(Link)

I suspect the simplest way to handle small file aggregation is to push them into an known archive format (ar, tar, zip, war, whatever) "on the fly" and then treat that as a file to back up (and the same in reverse). Ideally you want a format that is byte-for-byte identical given the same input (a lot of things have archive-creation-dates and the like which would fool that sort of check).

Some people will probably require small file aggregation (eg, my mail is in MH format, in _lots_ of 1kb-4kb files) in order for it to be useful as a backup. But getting things going first, then adding that sort of support is obviously the sensible way to go.

Ewen
[User Picture]From: quelrod
2006-03-19 10:16 pm (UTC)

(Link)

Isn't the successor SHA-2 ?
[User Picture]From: crucially
2006-03-20 12:29 am (UTC)

(Link)

Uhm

ctime is completely useless, it is inode change time, not data change time

[User Picture]From: xlerb
2006-03-20 05:30 am (UTC)

(Link)

The Open Group believes otherwise:Upon successful completion, where nbyte is greater than 0, write() shall mark for update the st_ctime and st_mtime fields of the file….
[User Picture]From: ydna
2006-03-19 10:33 pm (UTC)

(Link)

Plus $0.20/GB transfer on S3.

It's just amazing how quick you crank out from idea to implementation. Someone should nominate you for some award for all this work. I know you're just scratching all those itches you have, but the benefit to the community has been spectacular.
[User Picture]From: mart
2006-03-19 10:44 pm (UTC)

(Link)

I preferred your idea from ages ago for co-operative backups between friends who have lots of disk space. It had the nice advantage that it's a co-op rather than a paid service, so you can easily add more backups just by striking deals with your friends.

A combination of both would be killer, though.

From: evan
2006-03-19 10:51 pm (UTC)

(Link)

And really, those ideas together cascade into that guy's distributed filesystem discussed on the memcached list -- you plug in n computers and everything sorta syncs around.
[User Picture]From: brad
2006-03-19 10:51 pm (UTC)

(Link)

Write a storageprovider subclass for P2P?
[User Picture]From: codetoad
2006-03-19 10:53 pm (UTC)

(Link)

I was going to do exactly this, convinced the world needed it, and then I found out that someone implemented exactly what I wanted!
[User Picture]From: mart
2006-03-20 08:11 am (UTC)

(Link)

I thought that was brilliant until I saw that it requires Python. Python's easy on a UNIX-like system, but many of my peers run Windows. Although there's a Python distro for Windows, it's ugly and I doubt any of my peers — or even me, for that matter — would want to install it just to do this. If it were Perl I'd use ActiveState's dev kit tools to bundle it up as a service or a standalone app, but I've no idea how to do a comparable thing for Python. Also, if it were Perl I'd probably hack it into a storage provider for Brad's thingy. ;)

I guess I have another project on my todo list: reinvent the wheel!

[User Picture]From: codetoad
2006-03-20 08:55 am (UTC)

(Link)

I'm a Python guy, so it's good news for me :)

As far as windows goes, there's something called py2exe that creates independent and stand-alone applications (and a similar py2app). I'll see about trying to put something together for DIBS once I start hacking on it..
[User Picture]From: dossy
2006-03-19 11:38 pm (UTC)

(Link)

If you make it easy enough to write a storage provider, I'll use DreamHost as my storage provider. $16/mo for 60 GB disk and 1.6 TB of bandwidth ... if I'm doing my math right, means it'll cost roughly $0.12/GB for the disk and $0.005/GB for the bandwidth. And, as DreamHost increases your limits weekly, it just drives the price down over time.

Definitely announce when wsbackup is available for playing with. :-)
[User Picture]From: octal
2006-03-20 01:07 am (UTC)

(Link)

S3 pricing is 1-10x cost in bulk, for arbitrarily small quantities.

The thing it lacks, though, is secure computation. Someone needs to add a very restricted server-side execution environment, rented per unit per month, and available in a similarly small unit. Right now, the smallest cost-effective unit is a cheap 1U server, which isn't as reliable as S3 -- you'd need a cluster of several 1Us distributed across each S3 site, so maybe $1k/month minimum cost including hw and bw.
[User Picture]From: brad
2006-03-20 02:37 am (UTC)

(Link)

What are you talking about?

How does secure computation have anything to do with backups?

And while one might get disk/bandwidth cheaper than Amazon, Most Users can't.
And while one might build something as big/reliable/fast as Amazon, almost no users can.
[User Picture]From: octal
2006-03-21 03:47 pm (UTC)

(Link)

1-10x pricing is excellent -- that is minimal markup for retailing something like that. They actually undercut anyone spending less than $100k/mo, due to salaries.

I think their target userbase is application developers developing for the web, NOT end-users doing storage. For that, they need some way to execute server side code, and it needs to be some kind of sandbox to keep user A from messing with user B.
[User Picture]From: scsi
2006-03-20 01:21 am (UTC)

(Link)

This is distributed right?
[User Picture]From: brad
2006-03-20 01:54 am (UTC)

(Link)

Did you read the post? :-)

It depends on your target. I don't give a shit about distributed targets because I don't trust my peers. I trust Google and Amazon to hold my data and stay online. (after I encrypt it)
From: dan_erat
2006-03-20 06:08 am (UTC)

(Link)

Sounds neat. I've been hacking on something with a similar purpose, except that mine's just a wrapper around tar and (soon) gpg, with S3 as the backend (I don't mind several-hundred-meg files for the full backups and want to be able to do restores with standard tools).
[User Picture]From: mulix
2006-03-20 06:09 am (UTC)

(Link)

Correct me if I'm wrong, but if a single byte in a chunk changes, you're going to have to retransmit the entire chunk, correct?

You might be interested in incorporating some of the ideas behind rsyncrypto (http://sourceforge.net/projects/rsyncrypto) to work around it.
[User Picture]From: brad
2006-03-20 04:44 pm (UTC)

(Link)

Yes, but I'm not backing up databases.... I'm backing up my $HOME. So not a huge deal.
[User Picture]From: endquote
2006-03-20 08:27 am (UTC)

(Link)

Strongspace is pretty cool too. I rsync my important (client) stuff there.

I hadn't heard of the S3 thing though, that's cool.