?

Log in

Detecting equivalance of audio files - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Detecting equivalance of audio files [Dec. 29th, 2006|01:44 pm]
Brad Fitzpatrick
[Tags|, , ]

After several days of hacking on MogileFS, I decided to switch gears today and work on personal stuff. I boiled my personal problems down to:
  • getting all my DVDs ripped
  • consolidating three computers' mp3s into one unified, tagged collection
  • finishing brackup
Somehow I decided that the answer to all the above was more disk space, so I bought the Norco DS-1220 which gholam had recommended (after deciding that the Port Multiplier support in Linux was basically mature). And researched the best disks to populate it with, but can get those later (Western Digital 500GB WD-5000YS)

After all that, I remembered I had a 300 GB external drive that I could use for mp3 consolidation scratch space. So on to that project....

Previously, my canonical location for mp3s was on my home server. Then my laptop kinda became the new place (and where I had everything auto-tagged, which I wrote on an airplane). Then I have some mp3s on my desktop at home. In other words: fucking gross.

Problem statement: How to merge all my music together?

Sub-problem: How to tell if two files are the same, if their paths are different, and their checksums are different (because, say, one's been tagged)?

Answer: decode the mp3/ogg to stdout, rather than a soundcard, and checksum the audio stream! (source: audmd5)

Demo:

$ md5sum "sammy/Weezer/02 Pinkerton/01 Tired of Sex.mp3"
b0298cdf1c2135f13d788863cb221ca3
$ md5sum "laptop/Weezer/1996-Pinkerton/Weezer - Pinkerton - 01 - Tired of Sex.mp3"
aa637f841945da67c2aad1f8c2b4ce16
$ audmd5 "sammy/Weezer/02 Pinkerton/01 Tired of Sex.mp3"
8c0952de1e8d13c3ab079adc4a21a400
$ audmd5 "laptop/Weezer/1996-Pinkerton/Weezer - Pinkerton - 01 - Tired of Sex.mp3"
8c0952de1e8d13c3ab079adc4a21a400

... one more tool in my arsenal to fight my mp3 collection.
LinkReply

Comments:
From: reizar
2006-12-29 10:43 pm (UTC)
I've been half-tempted before to just throw all my MP3s into one great big folder, no subdirectories, then load them all up in WinAmp (or XMMS if I'm using Linux), turn on the Shuffle option, hit Play, and never look back.

Sorting music is SUCH a pain when you get past the 200 mark. Just imagine the people who go through their entire hard drive and clean it out every month.

Uh-oh, time to defrag...
(Reply) (Thread)
[User Picture]From: scsi
2006-12-29 10:46 pm (UTC)
apt-get install fdupes
(Reply) (Thread)
[User Picture]From: scsi
2006-12-29 10:48 pm (UTC)
Ack, forgot about diff's b/c of tagging.. :(
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-12-29 10:50 pm (UTC)
Did you bother reading my post?

Did you miss even the pretty colorful sections I so artfully styled for you?

The checksums of the "duplicate" files are different! Nothing on the fdupes website, debian page, or wikipedia entry says it's smart about the contents of audio files.
(Reply) (Parent) (Thread)
[User Picture]From: scsi
2006-12-29 10:54 pm (UTC)
Artfully styling is sorta hard when you're browsing via lynx.
Next time be more considerate for us graphically challenged.. :) heh
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-12-29 10:54 pm (UTC)
Oh, heh, you're working today! I have today off.
(Reply) (Parent) (Thread)
[User Picture]From: scsi
2006-12-29 11:05 pm (UTC)
Lucky ass. I wish i could even have a chance to sit down.. :(
(Reply) (Parent) (Thread)
[User Picture]From: midendian
2006-12-29 10:51 pm (UTC)
I boiled my personal problems down to:

I envy the depth of your personal problems!
(Reply) (Thread)
[User Picture]From: brad
2006-12-29 10:54 pm (UTC)
I'm ignoring my house problems, bills, medical things, etc.

Just things I can solve sitting on my ass at the computer. (which probably includes bills, but fuck it...)
(Reply) (Parent) (Thread)
[User Picture]From: fweebles
2006-12-30 12:28 am (UTC)
I wouldn't be surprised if I tuned in next month to find you'd finished billsd, the automatic bill-reading and -paying daemon.
(Reply) (Parent) (Thread)
[User Picture]From: mart
2006-12-30 11:04 pm (UTC)

What's wrong with standing orders and direct debit?

(Reply) (Parent) (Thread)
[User Picture]From: fweebles
2006-12-30 11:59 pm (UTC)
Some people (not me) are paranoid of direct debit.
(Reply) (Parent) (Thread)
[User Picture]From: scsi
2006-12-30 12:57 am (UTC)
Medical? :\
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-12-30 01:52 am (UTC)
Just routine bullshit. No real problem. (like: SixApart just changed all their healthcare, 401k, payroll, etc, so have to deal with paperwork in that regard....)
(Reply) (Parent) (Thread)
[User Picture]From: muerte
2006-12-29 10:55 pm (UTC)
Genius! That's suck a kickass and simple solution to a semi-complex problem. I updated the script to handle multiple file inputs (/mnt/mp3/*.mp3) like how md5sum works.
(Reply) (Thread)
[User Picture]From: scsi
2006-12-29 11:04 pm (UTC)
I demand this script be updated with some sort of progress twirly baton so I have something to look at while I wait.. :)
(Reply) (Parent) (Thread)
From: evan
2006-12-29 11:06 pm (UTC)
Your solution is cool, but it will probably be much quicker (no expensive mp3 decoding) to just strip the tags before hashing. You can do it without mutating the file -- for id3v1 it's a fixed number of bytes at the end of the file (trivial), while for id3v2 I don't know the spec but surely you could skip over the tags while streaming the file into an md5 summer.
(Reply) (Thread)
[User Picture]From: brad
2006-12-29 11:17 pm (UTC)
I looked into that first. Yes, id3v1 was easy, but id3v2 got ugly quickly. With extensive caching of stat() info -> raw_md5 and raw_md5 -> music_md5, I won't have to do the mp3 decoding often (or even the raw md5 often), once I do it the first time.

I actually was temped to do the id3v2 parsing for "fun" but then I kicked myself and moved on, remembering the real goal.
(Reply) (Parent) (Thread)
From: evan
2006-12-29 11:20 pm (UTC)
In that case, you can copy the file to a tmpfs dir, run one of the existing tag-strippers on it, then md5 that.

(Of course, I'm not recommending you do anything now that you have a workable solution. I just know how slow mp3 decoding is on my home machine and wouldn't be able to wait that long.)
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-12-29 11:31 pm (UTC)
tmpfs + stripping tags is clever. If I get impatient later when I run this all, I might resort to that... thanks.
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: muerte
2006-12-30 03:20 am (UTC)
Seems like with multicore processors this would scale pretty well. You could power through and entire directory of MP3s over night. 5 seconds per MP3, 10000 MP3s, you're looking at ~14 hours. Throw a couple cores at that and you can cut that down pretty quick.
(Reply) (Parent) (Thread)
[User Picture]From: waider
2006-12-30 12:17 pm (UTC)
This is pretty much part of what my previously mentioned MP3 retagger/sorter script does. If it finds that it's putting two files into the same place (on account of their tagging, or the command line options you've fed it) it'll copy both to /tmp, use id3convert to rip out the tags, then compare the files. There's also some other garbage in there like "if the files *aren't* identical, check if one's got a higher bitrate and if so ditch the lesser one" and that sort of thing.
(Reply) (Parent) (Thread)
[User Picture]From: kvance
2006-12-29 11:18 pm (UTC)
I was about to say something about MPlayer, and then I remembered that it still can't dump to stdout. If your music collection was full of weird shit like flacs and mp4s, it would probably still be worth it to make the FIFO.
(Reply) (Thread)
[User Picture]From: valiskeogh
2006-12-29 11:41 pm (UTC)
i highly reccomend those wd5000ys drives. i've got five of them in raid 5 at the moment and hoping to add more in the near future. i got 4 of them from newegg, one from zipzoomfly.com , zipzoom has them for 170 right now. a point tho, they work great IF they work. in my initial order of 4, two were DOA and the newegg review page for that drive is full of the same story. it seems as long as they work when you first hook them up you'll be good, just dont lose any receipts ;)

Valis
(Reply) (Thread)
[User Picture]From: erik
2006-12-30 12:42 am (UTC)
The only problem with this that I see (and I don't know much about how checksums work, so feel free to correct me), is that two MP3 files of the same song that you got from two different sources, and were ripped in two different ways, could be slightly different. One could be a second longer than the other, for instance. In this case wouldn't a checksum consider them two different songs?

Smarter would be to analyze the waveform and look for macro-scale similarities.
(Reply) (Thread)
[User Picture]From: brad
2006-12-30 12:48 am (UTC)
In my case, dups are because I copied the same files around.

But yes, there's audio fingerprinting stuff too. I don't think I'll need to go that far, though.
(Reply) (Parent) (Thread)
[User Picture]From: supersat
2006-12-30 01:02 am (UTC)
I've been planning on doing something like this with audio fingerprinting for a while, and trying to do something intelligent when it detects a match (e.g. keep the one with the higher bit-rate). Of course, I've been saying this for, oh, five years now. Maybe someday. ;)
(Reply) (Parent) (Thread)
[User Picture]From: maxvt
2006-12-30 01:57 pm (UTC)
You assume both files are MP3's, but what if you have a MP3 and an OGG or a FLAC of the same song? The checksum is quite useless in this case, as even if you md5 the uncompressed waveform the compression artifacts differ between the formats.

However, the only case of one version of MP3 song longer than other would be if they've come from different albums, as the time-length of the audio file shouldn't differ no matter how the song is encoded. Of course, this assuming MP3 encoders are sane and don't strip audio data left and right :)
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-12-30 07:05 pm (UTC)
Not the case for me. Dups are because I copied the same file around.
(Reply) (Parent) (Thread)
From: chrislightfoot
2006-12-31 07:06 pm (UTC)
Manber hashing and a Shazam-style summarisation of each track? (The latter I think uses something really simple, along the lines of a spectrogram reduced to eight frequency buckets with one byte recorded per window period in the input, but I can't find the reference here.)
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: gaal
2006-12-30 08:05 am (UTC)
I think you can add an option to only look at the first n bytes of each stream, to speed up your total scan time w/o risking broken files too much. If you weren't looking at streams, I'd have said look at the last n bytes too, but then it's not dead simple any more.
(Reply) (Thread)
[User Picture]From: brad
2006-12-30 08:27 am (UTC)
Ah, good call.

But really I should just take mpg123 and ogg123 and combine them into a tool that can parse the files quickly, and emit to stdout the audio frames, and throw out the rest. Then I can just do:

$ cataudio song.ogg | md5sum

That'd be both fast and reliable.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-12-30 08:39 am (UTC)
Btw, mpg123 has a --skip N, skip N frames option, which I was initially scared might include id3 frames as frames, rendering it entirely useless....

But I compared a file with an id3v1 vs id3v2 tags, and mpg123 doesn't count the leading frames as skipped frames:

$ file a.mp3
MPEG ADTS, layer III, v1, 128 kBits, 44.1 kHz, JntStereo
$ file b.mp3
MP3 file with ID3 version 2.3.0 tag

And indeed, they start different:

$ dd if=a.mp3 count=2 bs=1024 | md5sum
1a42f5c7e71b6f66c6b5f8f8eeecd390 -
$ dd if=b.mp3 count=2 bs=1024 | md5sum
620914747096d641bfb4b86b5e567a31 -

But it works, thanks!

$ mpg123 -n 500 -q -s a.mp3 | md5sum
ccccb9398fee97f7fe5b44246528f83d -
$ mpg123 -n 500 -q -s b.mp3 | md5sum
ccccb9398fee97f7fe5b44246528f83d -
(Reply) (Parent) (Thread)
[User Picture]From: gaal
2006-12-30 09:34 am (UTC)
This is worth checking, but it could be that the lead-in / first second or two of the audio isn't very reliably unique, so the start hash can skip a few frames as well rather than just make sure it's "big enough". Then again I don't know how much of a difference this makes.
(Reply) (Parent) (Thread)
[User Picture]From: joshuak
2006-12-30 08:39 am (UTC)
You should consider using MusicBrainz audio fingerprinting (PUID, TRM) on all your files? Encoding differences might cause different md5sums even in cases where the songs are the same.
(Reply) (Thread)
[User Picture]From: brad
2006-12-30 07:05 pm (UTC)
Not the case for me. Dups are because I copied the same file around.
(Reply) (Parent) (Thread)
[User Picture]From: photomatt
2006-12-31 01:10 am (UTC)
Second the recommendation for MusicBrainz, it's very intelligent at matching things and can also clean up and normalize your metadata and file naming. (Which is my biggest annoyance.)
(Reply) (Parent) (Thread)
From: ex_swined
2007-01-01 05:57 am (UTC)
how about different bitrates? it seems to me, that two files with the same song, but different bitrates would not be detected as a duplicate.
(Reply) (Thread)
[User Picture]From: sweetjannette
2007-01-11 01:45 pm (UTC)
Hi brad! Thanks for sharing. I like the quarantee that you give;)
(Reply) (Thread)