?

Log in

No account? Create an account
Detecting equivalance of audio files - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Detecting equivalance of audio files [Dec. 29th, 2006|01:44 pm]
Brad Fitzpatrick
[Tags|, , ]

After several days of hacking on MogileFS, I decided to switch gears today and work on personal stuff. I boiled my personal problems down to:
  • getting all my DVDs ripped
  • consolidating three computers' mp3s into one unified, tagged collection
  • finishing brackup
Somehow I decided that the answer to all the above was more disk space, so I bought the Norco DS-1220 which gholam had recommended (after deciding that the Port Multiplier support in Linux was basically mature). And researched the best disks to populate it with, but can get those later (Western Digital 500GB WD-5000YS)

After all that, I remembered I had a 300 GB external drive that I could use for mp3 consolidation scratch space. So on to that project....

Previously, my canonical location for mp3s was on my home server. Then my laptop kinda became the new place (and where I had everything auto-tagged, which I wrote on an airplane). Then I have some mp3s on my desktop at home. In other words: fucking gross.

Problem statement: How to merge all my music together?

Sub-problem: How to tell if two files are the same, if their paths are different, and their checksums are different (because, say, one's been tagged)?

Answer: decode the mp3/ogg to stdout, rather than a soundcard, and checksum the audio stream! (source: audmd5)

Demo:

$ md5sum "sammy/Weezer/02 Pinkerton/01 Tired of Sex.mp3"
b0298cdf1c2135f13d788863cb221ca3
$ md5sum "laptop/Weezer/1996-Pinkerton/Weezer - Pinkerton - 01 - Tired of Sex.mp3"
aa637f841945da67c2aad1f8c2b4ce16
$ audmd5 "sammy/Weezer/02 Pinkerton/01 Tired of Sex.mp3"
8c0952de1e8d13c3ab079adc4a21a400
$ audmd5 "laptop/Weezer/1996-Pinkerton/Weezer - Pinkerton - 01 - Tired of Sex.mp3"
8c0952de1e8d13c3ab079adc4a21a400

... one more tool in my arsenal to fight my mp3 collection.
LinkReply

Comments:
From: reizar
2006-12-29 10:43 pm (UTC)
I've been half-tempted before to just throw all my MP3s into one great big folder, no subdirectories, then load them all up in WinAmp (or XMMS if I'm using Linux), turn on the Shuffle option, hit Play, and never look back.

Sorting music is SUCH a pain when you get past the 200 mark. Just imagine the people who go through their entire hard drive and clean it out every month.

Uh-oh, time to defrag...
(Reply) (Thread)
[User Picture]From: scsi
2006-12-29 10:46 pm (UTC)
apt-get install fdupes
(Reply) (Thread)
[User Picture]From: scsi
2006-12-29 10:48 pm (UTC)
Ack, forgot about diff's b/c of tagging.. :(
(Reply) (Parent) (Thread)
[User Picture]From: midendian
2006-12-29 10:51 pm (UTC)
I boiled my personal problems down to:

I envy the depth of your personal problems!
(Reply) (Thread)
[User Picture]From: brad
2006-12-29 10:54 pm (UTC)
I'm ignoring my house problems, bills, medical things, etc.

Just things I can solve sitting on my ass at the computer. (which probably includes bills, but fuck it...)
(Reply) (Parent) (Thread) (Expand)
[User Picture]From: muerte
2006-12-29 10:55 pm (UTC)
Genius! That's suck a kickass and simple solution to a semi-complex problem. I updated the script to handle multiple file inputs (/mnt/mp3/*.mp3) like how md5sum works.
(Reply) (Thread)
[User Picture]From: scsi
2006-12-29 11:04 pm (UTC)
I demand this script be updated with some sort of progress twirly baton so I have something to look at while I wait.. :)
(Reply) (Parent) (Thread)
From: evan
2006-12-29 11:06 pm (UTC)
Your solution is cool, but it will probably be much quicker (no expensive mp3 decoding) to just strip the tags before hashing. You can do it without mutating the file -- for id3v1 it's a fixed number of bytes at the end of the file (trivial), while for id3v2 I don't know the spec but surely you could skip over the tags while streaming the file into an md5 summer.
(Reply) (Thread)
[User Picture]From: brad
2006-12-29 11:17 pm (UTC)
I looked into that first. Yes, id3v1 was easy, but id3v2 got ugly quickly. With extensive caching of stat() info -> raw_md5 and raw_md5 -> music_md5, I won't have to do the mp3 decoding often (or even the raw md5 often), once I do it the first time.

I actually was temped to do the id3v2 parsing for "fun" but then I kicked myself and moved on, remembering the real goal.
(Reply) (Parent) (Thread) (Expand)
(Deleted comment)
[User Picture]From: kvance
2006-12-29 11:18 pm (UTC)
I was about to say something about MPlayer, and then I remembered that it still can't dump to stdout. If your music collection was full of weird shit like flacs and mp4s, it would probably still be worth it to make the FIFO.
(Reply) (Thread)
[User Picture]From: valiskeogh
2006-12-29 11:41 pm (UTC)
i highly reccomend those wd5000ys drives. i've got five of them in raid 5 at the moment and hoping to add more in the near future. i got 4 of them from newegg, one from zipzoomfly.com , zipzoom has them for 170 right now. a point tho, they work great IF they work. in my initial order of 4, two were DOA and the newegg review page for that drive is full of the same story. it seems as long as they work when you first hook them up you'll be good, just dont lose any receipts ;)

Valis
(Reply) (Thread)
[User Picture]From: erik
2006-12-30 12:42 am (UTC)
The only problem with this that I see (and I don't know much about how checksums work, so feel free to correct me), is that two MP3 files of the same song that you got from two different sources, and were ripped in two different ways, could be slightly different. One could be a second longer than the other, for instance. In this case wouldn't a checksum consider them two different songs?

Smarter would be to analyze the waveform and look for macro-scale similarities.
(Reply) (Thread)
[User Picture]From: brad
2006-12-30 12:48 am (UTC)
In my case, dups are because I copied the same files around.

But yes, there's audio fingerprinting stuff too. I don't think I'll need to go that far, though.
(Reply) (Parent) (Thread) (Expand)
(Deleted comment)
[User Picture]From: gaal
2006-12-30 08:05 am (UTC)
I think you can add an option to only look at the first n bytes of each stream, to speed up your total scan time w/o risking broken files too much. If you weren't looking at streams, I'd have said look at the last n bytes too, but then it's not dead simple any more.
(Reply) (Thread)
[User Picture]From: brad
2006-12-30 08:27 am (UTC)
Ah, good call.

But really I should just take mpg123 and ogg123 and combine them into a tool that can parse the files quickly, and emit to stdout the audio frames, and throw out the rest. Then I can just do:

$ cataudio song.ogg | md5sum

That'd be both fast and reliable.
(Reply) (Parent) (Thread)
[User Picture]From: joshuak
2006-12-30 08:39 am (UTC)
You should consider using MusicBrainz audio fingerprinting (PUID, TRM) on all your files? Encoding differences might cause different md5sums even in cases where the songs are the same.
(Reply) (Thread)
[User Picture]From: brad
2006-12-30 07:05 pm (UTC)
Not the case for me. Dups are because I copied the same file around.
(Reply) (Parent) (Thread)
From: ex_swined
2007-01-01 05:57 am (UTC)
how about different bitrates? it seems to me, that two files with the same song, but different bitrates would not be detected as a duplicate.
(Reply) (Thread)
[User Picture]From: sweetjannette
2007-01-11 01:45 pm (UTC)
Hi brad! Thanks for sharing. I like the quarantee that you give;)
(Reply) (Thread)