Log in

No account? Create an account
wsbackup -- encrypted, over-the-net, multi-versioned backup - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

wsbackup -- encrypted, over-the-net, multi-versioned backup [Mar. 19th, 2006|12:50 pm]
Brad Fitzpatrick
[Tags|, , , , ]

There are lots of ways to store files on the net lately:

-- Amazon S3 is the most interesting,
-- Google's rumored GDrive is surely soon coming
-- Apple has .Mac

I want to back up to them. And more than one. So first off, abstract out net-wide storage.... my backup tool (wsbackup) isn't targetting one. They're all just providers.

Also, don't trust sending my data in cleartext, and having it stored in cleartext, so public key encryption is a must. Then I can run automated backups from many hosts, without much fear of keys being compromised.

Don't want people being able to do size-analysis, and huge files are a pain anyway, so big files are cut into chunks.

Files stored on Amazon/Google are of form:

-- meta files: backup_rootname-yyyymmddnn.meta, encrypted (YAML?) file mapping relative paths from backup directory root to the stat() information, original SHA1, and array of chunk keys (SHA1s of encrypted chunks) that comprise the file.

-- [sha1ofencryptedchunk].chunk -- content being <= ,say, 20MB chunk of encrypted data.

Then every night different hosts/laptops recurse directory trees, consult a stat() cache (on,say, inode number, mtime, size, whatever) and do SHA1 calculations on changed files, lookup rest from cache, and build the metafile, upload any new chunks, encrypt the metafile, upload the metafile.


-- I can restore any host from any point in time, with Amazon/Google storing all my data, and only paying $0.15 cents/GB-month.


I'm partway through writing it. Will open source it soon. Ideally tonight.

(Deleted comment)
[User Picture]From: brad
2006-03-19 09:13 pm (UTC)
GIT: yes, I'm a big fan of content-addressed filesystems. Not for everything, but they have their uses.

And yes, public key so I only have to protect my private key in one place. (even though it'll have a good passphrase)
(Reply) (Parent) (Thread)
[User Picture]From: kvance
2006-03-19 09:19 pm (UTC)
Sweet! I'm really terrible with backups, and I've been waiting for a service that does this, almost exactly.
(Reply) (Thread)
[User Picture]From: brad
2006-03-19 09:23 pm (UTC)
I started thinking of this when evan lost all his data a year back (he's recovered it since, he was just lazy), but at the time Amazon S3 didn't exist, so the storage provider layer was a lot hairier, since we were all scheming up P2P solutions which weren't exactly trust-inspiring (or fast).

Now that hurdle is over, this should be a breeze.
(Reply) (Parent) (Thread)
From: (Anonymous)
2006-03-19 09:22 pm (UTC)

Really cool

This is awesome -- really, really cool. Can't wait to see it. I'd actually like to see it implemented for Openomy (http://www.openomy.com), too (selfish, yes, because Openomy is my project). Then you'd really have a full-on solution: you wrote this, and you wrote a lot of the actual storage infrastructure for Openomy (we use MogileFS). ;)

Can't wait to see the code.

(Reply) (Thread)
[User Picture]From: brad
2006-03-19 09:24 pm (UTC)

Re: Really cool

Whoa, neat. Hadn't heard of Openomy.

Write a StorageProvider subclass when I release the code, then I'll include it in future releases.
(Reply) (Parent) (Thread)
[User Picture]From: mollyblack
2006-03-19 10:10 pm (UTC)

Thank you for sharing this!

My husband and I look forward to this and figuring out how to use it for ourselves. Especially since we're planning on setting up our own full-time business server for our various sites. Having an off-site encrypted-nightly backup will help out a TON.
(Reply) (Thread)
[User Picture]From: xlerb
2006-03-19 10:10 pm (UTC)
It looks like, if you have lots of tiny files, you'll get a lot of tiny files on the server; they might not like that, aside from any efficiency issues. I guess the metadata could store an offset into the given data segment — but then deleting backups will get “interesting”.

Obligatory nitpicking: ctime, not mtime, is what's traditionally used for incremental backups, because mtime can be arbitrarily changed (and is often preserved by archive-extraction and file-transfer utilities). (ISTR hearing that ctime was originally going to be “creation time”, and then the Unix people realized they needed something for incremental dumps and made it be “change time” instead.)

More obligatory nitpicking: Uh, I hope there's a plan for migrating away from SHA1 when NIST blesses a successor in a few years, or when someone publishes an actual collision, whichever comes first.
(Reply) (Thread)
[User Picture]From: brad
2006-03-19 10:14 pm (UTC)
I was originally aggregating little files into bigger files and doing the offset stuff but it got complicated/ugly really quickly. I still plan to add it later, but not for initial release.

ctime: thanks.

SHA1: yes. i'll probably make the files be [hashname]-[hash].chunk. but then again, you're backing up your own data: not really going to be attacking yourself, are you?
(Reply) (Parent) (Thread) (Expand)
[User Picture]From: ydna
2006-03-19 10:33 pm (UTC)
Plus $0.20/GB transfer on S3.

It's just amazing how quick you crank out from idea to implementation. Someone should nominate you for some award for all this work. I know you're just scratching all those itches you have, but the benefit to the community has been spectacular.
(Reply) (Thread)
[User Picture]From: mart
2006-03-19 10:44 pm (UTC)

I preferred your idea from ages ago for co-operative backups between friends who have lots of disk space. It had the nice advantage that it's a co-op rather than a paid service, so you can easily add more backups just by striking deals with your friends.

A combination of both would be killer, though.

(Reply) (Thread)
From: evan
2006-03-19 10:51 pm (UTC)
And really, those ideas together cascade into that guy's distributed filesystem discussed on the memcached list -- you plug in n computers and everything sorta syncs around.
(Reply) (Parent) (Thread)
[User Picture]From: dossy
2006-03-19 11:38 pm (UTC)
If you make it easy enough to write a storage provider, I'll use DreamHost as my storage provider. $16/mo for 60 GB disk and 1.6 TB of bandwidth ... if I'm doing my math right, means it'll cost roughly $0.12/GB for the disk and $0.005/GB for the bandwidth. And, as DreamHost increases your limits weekly, it just drives the price down over time.

Definitely announce when wsbackup is available for playing with. :-)
(Reply) (Thread)
[User Picture]From: octal
2006-03-20 01:07 am (UTC)
S3 pricing is 1-10x cost in bulk, for arbitrarily small quantities.

The thing it lacks, though, is secure computation. Someone needs to add a very restricted server-side execution environment, rented per unit per month, and available in a similarly small unit. Right now, the smallest cost-effective unit is a cheap 1U server, which isn't as reliable as S3 -- you'd need a cluster of several 1Us distributed across each S3 site, so maybe $1k/month minimum cost including hw and bw.
(Reply) (Thread)
[User Picture]From: brad
2006-03-20 02:37 am (UTC)
What are you talking about?

How does secure computation have anything to do with backups?

And while one might get disk/bandwidth cheaper than Amazon, Most Users can't.
And while one might build something as big/reliable/fast as Amazon, almost no users can.
(Reply) (Parent) (Thread) (Expand)
[User Picture]From: scsi
2006-03-20 01:21 am (UTC)
This is distributed right?
(Reply) (Thread)
[User Picture]From: brad
2006-03-20 01:54 am (UTC)
Did you read the post? :-)

It depends on your target. I don't give a shit about distributed targets because I don't trust my peers. I trust Google and Amazon to hold my data and stay online. (after I encrypt it)
(Reply) (Parent) (Thread)
From: dan_erat
2006-03-20 06:08 am (UTC)
Sounds neat. I've been hacking on something with a similar purpose, except that mine's just a wrapper around tar and (soon) gpg, with S3 as the backend (I don't mind several-hundred-meg files for the full backups and want to be able to do restores with standard tools).
(Reply) (Thread)
[User Picture]From: mulix
2006-03-20 06:09 am (UTC)
Correct me if I'm wrong, but if a single byte in a chunk changes, you're going to have to retransmit the entire chunk, correct?

You might be interested in incorporating some of the ideas behind rsyncrypto (http://sourceforge.net/projects/rsyncrypto) to work around it.
(Reply) (Thread)
[User Picture]From: brad
2006-03-20 04:44 pm (UTC)
Yes, but I'm not backing up databases.... I'm backing up my $HOME. So not a huge deal.
(Reply) (Parent) (Thread)
[User Picture]From: endquote
2006-03-20 08:27 am (UTC)
Strongspace is pretty cool too. I rsync my important (client) stuff there.

I hadn't heard of the S3 thing though, that's cool.
(Reply) (Thread)