?

Log in

No account? Create an account
Treearrange: a compliment to rsync - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Treearrange: a compliment to rsync [Aug. 27th, 2006|10:59 pm]
Brad Fitzpatrick
[Tags|, ]

My quick hack of the evening is treearrange, which rearranges a directory tree based on a description of a directory tree, which the tool also generates.

What problem does this solve? Here's my typical photo-uploading workflow:

-- bring GBs of unorganized photos to work
-- upload GBs of photos from work to my personal server at 100 Mbps.
-- go home
-- rsync down from my server all photos at 6 Mbps. still pretty fast.
-- rearrange/rename. instead of DCIM/nnnCANON/, I rearrange into, say, "Day5-Paris/".

Now, how do I get my photos online? Two choices:

1) upload them from home.
2) upload them from my server (not my home server)

But the problems with the above are, respectively:

1) slow upstream. Not 100 Mbps. More like 1. GBs would take forever.
2) the files aren't in the right places on the server. I only rearranged then locally.

Rsync won't do. Rsync doesn't deal with files changing directories.

Enter treearrange:

Here's my server, where it all begins:
bradfitz@personal_web:~/honeymoon_pics$ find -type d
.
./DCIM
./DCIM/179CANON
./DCIM/180CANON
./DCIM/181CANON
./DCIM/182CANON
./DCIM/183CANON
./DCIM/184CANON
./DCIM/185CANON
./DCIM/186CANON
./Elph
./Elph/DCIM
./Elph/DCIM/135CANON
./Elph/DCIM/136CANON
./Elph/DCIM/CANONMSC
I rsync them down to my house (pretty fast), and rearrange them:
sammy:Sorted $ find -type d
.
./Barcelona_Airport_Hell
./Barcelona-1
./Barcelona-2
./Boat
./Lisa
./Malta
./Marseille
./Midnight_Buffet
./Naples-Vesuvius-Pompeii
./Palma_de_Mallorca
./Rome
./Stockholm-1
Now, using treearrange, I snapshot where the files are supposed to live:
sammy:Sorted $ ./treearrange --to=arrange.dat

$ head arrange.dat 
945fc334853b4c5edfca34c9908258eacfc86823        Barcelona_Airport_Hell/IMG_8675.JPG
fe5551ad173e425c1c8f40c4f06e72389df7c2ab        Barcelona_Airport_Hell/IMG_8676.JPG
c9f0589a24a8de4a65e8670b8bbb4f570a4452ca        Barcelona_Airport_Hell/IMG_8677.JPG
b244692481c84857d2e7824ec310ca074eee5e6c        Barcelona_Airport_Hell/IMG_8678.JPG
20c6dd346021689b32702c28ec62cde6a2c3a7be        Barcelona_Airport_Hell/IMG_8679.JPG
f1fdd495d10aee11a1cb96019b7b6c0a11e5465f        Barcelona_Airport_Hell/IMG_8680.JPG
f429010fe9a906c8bf513016e03e371ee711f3f6        Barcelona_Airport_Hell/IMG_8681.JPG
7885c7b71cd21a28c11985a591e69e81a12ee316        Barcelona_Airport_Hell/IMG_8683.JPG
Next I upload the arrange.dat and treearrange to my server, and do the opposite:
bradfitz@personal_web:~/honeymoon_pics$  ./treearrange --from=arrange.dat
file 1 / 738...
file 2 / 738...
  DCIM/179CANON/IMG_7977.JPG -> Barcelona-1/IMG_7977.JPG
file 3 / 738...
  DCIM/179CANON/IMG_7978.JPG -> Barcelona-1/IMG_7978.JPG
file 4 / 738...
  DCIM/179CANON/IMG_7979.JPG -> Barcelona-1/IMG_7979.JPG
file 5 / 738...
  DCIM/179CANON/IMG_7980.JPG -> Barcelona-1/IMG_7980.JPG
file 6 / 738...
  DCIM/179CANON/IMG_7981.JPG -> Barcelona-1/IMG_7981.JPG
.....

bradfitz@personal_web:~/honeymoon_pics$ find -type d
.
./Barcelona-1
./Boat
./Marseille
./Lisa
./Rome
./Naples-Vesuvius-Pompeii
./Malta
./Midnight_Buffet
./Palma_de_Mallorca
./Barcelona-2
./Barcelona_Airport_Hell
./Stockholm-1
Tada!

(then I can rsync and get any rotations/adjustments/etc that I did locally which weren't just a directory move...)
LinkReply

Comments:
[User Picture]From: gaal
2006-08-28 06:55 am (UTC)
Oh, excellent, I've been wanting something like this for music. I wonder though if this can't be made better by knowing more about tags. The problem is syncing files several ways, when sometimes the updates are to metadata. Unfortunately the filenames can sometimes change too, so there's no key for this!
(Reply) (Thread)
From: evan
2006-08-28 07:03 am (UTC)
I have exactly this problem! I guess you could fingerprint the files minus the tags -- the one part that doesn't change is the music data itself.
(Reply) (Parent) (Thread)
[User Picture]From: gaal
2006-08-28 07:24 am (UTC)
But then how can resume work?

I start syncing by pulling a new file from remotehost to my localhost. Then the download is interrupted, and I resume it. What identifies the partial file on localhost?
(Reply) (Parent) (Thread)
[User Picture]From: gaal
2006-08-28 07:26 am (UTC)
Hm, maybe the syncer should pre-tag all files with their own fingerprint and make sure that gets transmitted early?
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-08-28 05:14 pm (UTC)
Sure there is. The digest of the non-tag part of the file is the key. Screw audio fingerprinting.... ignoring the ID3 stuff of the mp3/ogg when digesting is eash enough. Then just fix the ID3 up on the other side, if the modtime is older.
(Reply) (Parent) (Thread)
[User Picture]From: goldfischegirl
2006-08-28 07:52 am (UTC)
I just have to say, that is sure one homosexual icon.
Lovin' it.
(Reply) (Thread)
[User Picture]From: jwz
2006-08-28 08:20 am (UTC)
You know the file names are unique, so you don't need to hash: I do this kind of thing on the fly with a keyboard macro that generates "mv" commands...
(Reply) (Thread)
[User Picture]From: iamo
2006-08-28 08:29 am (UTC)
However, by using a fingerprint it's somewhat more flexible. It'll work even when the original structure of the two trees were not the same.
(Reply) (Parent) (Thread)
[User Picture]From: ciphergoth
2006-08-28 09:18 am (UTC)
I would say that makes it less flexible, as well as somewhat slower.

I'd rather something that was the same "shape" as rsync (ie runs on both ends at once) that tries to move files around to make "rsync" work, based on a number of heuristics (file name, size, last modified date, first bytes, last bytes...) applied in order.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-08-28 04:32 pm (UTC)
Yeah, that's what I wanted too, but I realized it was only a few minute problem if I did the minimal work first.

Later I can add an rsync-ish interface, maybe just ssh'ing to the remote host, running Perl, and piping the original script into it, so the other side doesn't even need treearrange.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-08-28 04:32 pm (UTC)
Not necessarily. I was shooting with two cameras. Both Canons.
(Reply) (Parent) (Thread)
From: yuval_kogman
2006-08-28 09:13 am (UTC)

unison

unison can do this, albeit not intelligently.

It doesn't have smart tracking of mv's for 3-way merging, but it does have the xferbycopying feature which will check if a remote file with the same checksum exists.
(Reply) (Thread)
[User Picture]From: edm
2006-08-28 09:35 pm (UTC)

Re: unison

Excellent. I was going to suggest that this feature should really be part of something like rsync (which already walks the directories, takes checksums, compares them, etc). Having a separate script is useful, but it'd be even nicer if it "just worked" in the face of running something like rsync (ie, copied/moved files that have moved, and uploaded any new ones).

Ewen
(Reply) (Parent) (Thread)
[User Picture]From: awwaiid
2006-08-29 06:05 am (UTC)

Re: unison

Yes, I adore unison... use it to sync my home directory between about six machines, many of them using ssh rsa keys to do it unattended. Work on a project on my machine here tonight, tomorrow I go in to work and it's there waiting.
(Reply) (Parent) (Thread)
[User Picture]From: lakeguy
2006-08-28 11:18 am (UTC)
you could always just sort at work :P
(Reply) (Thread)
[User Picture]From: herbie
2006-08-28 06:01 pm (UTC)
That's pretty sweet. How long does it typically take, though, if it has to generate all those hashes on large data?

(Also, complement?)
(Reply) (Thread)
[User Picture]From: brad
2006-08-28 06:04 pm (UTC)
Faster than uploading them.
(Reply) (Parent) (Thread)
[User Picture]From: fanf
2006-08-28 11:33 pm (UTC)
The idea sounds a bit like magicmirror, but I expect that's too dependent on the public FTP archive model.
(Reply) (Thread)