I've been half-tempted before to just throw all my MP3s into one great big folder, no subdirectories, then load them all up in WinAmp (or XMMS if I'm using Linux), turn on the Shuffle option, hit Play, and never look back.
Sorting music is SUCH a pain when you get past the 200 mark. Just imagine the people who go through their entire hard drive and clean it out every month.
Uh-oh, time to defrag...
![[User Picture]](http://l-userpic.livejournal.com/36951816/24078) | From: scsi 2006-12-29 10:46 pm (UTC)
| (Link)
|
apt-get install fdupes
![[User Picture]](http://l-userpic.livejournal.com/36951816/24078) | From: scsi 2006-12-29 10:48 pm (UTC)
| (Link)
|
Ack, forgot about diff's b/c of tagging.. :(
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-29 10:50 pm (UTC)
| (Link)
|
Did you bother reading my post?
Did you miss even the pretty colorful sections I so artfully styled for you?
The checksums of the "duplicate" files are different! Nothing on the fdupes website, debian page, or wikipedia entry says it's smart about the contents of audio files.
![[User Picture]](http://l-userpic.livejournal.com/36951816/24078) | From: scsi 2006-12-29 10:54 pm (UTC)
| (Link)
|
Artfully styling is sorta hard when you're browsing via lynx. Next time be more considerate for us graphically challenged.. :) heh
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-29 10:54 pm (UTC)
| (Link)
|
Oh, heh, you're working today! I have today off.
![[User Picture]](http://l-userpic.livejournal.com/36951816/24078) | From: scsi 2006-12-29 11:05 pm (UTC)
| (Link)
|
Lucky ass. I wish i could even have a chance to sit down.. :(
I boiled my personal problems down to:
I envy the depth of your personal problems!
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-29 10:54 pm (UTC)
| (Link)
|
I'm ignoring my house problems, bills, medical things, etc.
Just things I can solve sitting on my ass at the computer. (which probably includes bills, but fuck it...)
I wouldn't be surprised if I tuned in next month to find you'd finished billsd, the automatic bill-reading and -paying daemon.
![[User Picture]](http://l-userpic.livejournal.com/34474992/3171) | From: mart 2006-12-30 11:04 pm (UTC)
| (Link)
|
What's wrong with standing orders and direct debit?
Some people (not me) are paranoid of direct debit.
![[User Picture]](http://l-userpic.livejournal.com/36951816/24078) | From: scsi 2006-12-30 12:57 am (UTC)
| (Link)
|
Medical? :\
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-30 01:52 am (UTC)
| (Link)
|
Just routine bullshit. No real problem. (like: SixApart just changed all their healthcare, 401k, payroll, etc, so have to deal with paperwork in that regard....)
Genius! That's suck a kickass and simple solution to a semi-complex problem. I updated the script to handle multiple file inputs (/mnt/mp3/*.mp3) like how md5sum works.
![[User Picture]](http://l-userpic.livejournal.com/36951816/24078) | From: scsi 2006-12-29 11:04 pm (UTC)
| (Link)
|
I demand this script be updated with some sort of progress twirly baton so I have something to look at while I wait.. :)
![[User Picture]](http://l-userpic.livejournal.com/9624370/1571) | From: evan 2006-12-29 11:06 pm (UTC)
| (Link)
|
Your solution is cool, but it will probably be much quicker (no expensive mp3 decoding) to just strip the tags before hashing. You can do it without mutating the file -- for id3v1 it's a fixed number of bytes at the end of the file (trivial), while for id3v2 I don't know the spec but surely you could skip over the tags while streaming the file into an md5 summer.
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-29 11:17 pm (UTC)
| (Link)
|
I looked into that first. Yes, id3v1 was easy, but id3v2 got ugly quickly. With extensive caching of stat() info -> raw_md5 and raw_md5 -> music_md5, I won't have to do the mp3 decoding often (or even the raw md5 often), once I do it the first time.
I actually was temped to do the id3v2 parsing for "fun" but then I kicked myself and moved on, remembering the real goal.
![[User Picture]](http://l-userpic.livejournal.com/9624370/1571) | From: evan 2006-12-29 11:20 pm (UTC)
| (Link)
|
In that case, you can copy the file to a tmpfs dir, run one of the existing tag-strippers on it, then md5 that.
(Of course, I'm not recommending you do anything now that you have a workable solution. I just know how slow mp3 decoding is on my home machine and wouldn't be able to wait that long.)
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-29 11:31 pm (UTC)
| (Link)
|
tmpfs + stripping tags is clever. If I get impatient later when I run this all, I might resort to that... thanks.
Stripping non-MPEG data is probably a little more robust - some tagging operations will insert blocks of zeroes. It's not particularly difficult to do, either - the hardest part is calculating frame size.
Seems like with multicore processors this would scale pretty well. You could power through and entire directory of MP3s over night. 5 seconds per MP3, 10000 MP3s, you're looking at ~14 hours. Throw a couple cores at that and you can cut that down pretty quick.
This is pretty much part of what my previously mentioned MP3 retagger/sorter script does. If it finds that it's putting two files into the same place (on account of their tagging, or the command line options you've fed it) it'll copy both to /tmp, use id3convert to rip out the tags, then compare the files. There's also some other garbage in there like "if the files *aren't* identical, check if one's got a higher bitrate and if so ditch the lesser one" and that sort of thing.
I was about to say something about MPlayer, and then I remembered that it still can't dump to stdout. If your music collection was full of weird shit like flacs and mp4s, it would probably still be worth it to make the FIFO.
i highly reccomend those wd5000ys drives. i've got five of them in raid 5 at the moment and hoping to add more in the near future. i got 4 of them from newegg, one from zipzoomfly.com , zipzoom has them for 170 right now. a point tho, they work great IF they work. in my initial order of 4, two were DOA and the newegg review page for that drive is full of the same story. it seems as long as they work when you first hook them up you'll be good, just dont lose any receipts ;)
Valis
![[User Picture]](http://l-userpic.livejournal.com/39022159/14) | From: erik 2006-12-30 12:42 am (UTC)
| (Link)
|
The only problem with this that I see (and I don't know much about how checksums work, so feel free to correct me), is that two MP3 files of the same song that you got from two different sources, and were ripped in two different ways, could be slightly different. One could be a second longer than the other, for instance. In this case wouldn't a checksum consider them two different songs?
Smarter would be to analyze the waveform and look for macro-scale similarities.
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-30 12:48 am (UTC)
| (Link)
|
In my case, dups are because I copied the same files around.
But yes, there's audio fingerprinting stuff too. I don't think I'll need to go that far, though.
I've been planning on doing something like this with audio fingerprinting for a while, and trying to do something intelligent when it detects a match (e.g. keep the one with the higher bit-rate). Of course, I've been saying this for, oh, five years now. Maybe someday. ;)
You assume both files are MP3's, but what if you have a MP3 and an OGG or a FLAC of the same song? The checksum is quite useless in this case, as even if you md5 the uncompressed waveform the compression artifacts differ between the formats.
However, the only case of one version of MP3 song longer than other would be if they've come from different albums, as the time-length of the audio file shouldn't differ no matter how the song is encoded. Of course, this assuming MP3 encoders are sane and don't strip audio data left and right :)
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-30 07:05 pm (UTC)
| (Link)
|
Not the case for me. Dups are because I copied the same file around.
Manber hashing and a Shazam-style summarisation of each track? (The latter I think uses something really simple, along the lines of a spectrogram reduced to eight frequency buckets with one byte recorded per window period in the input, but I can't find the reference here.)
Ooh, good job. I've been meaning to do a similar project to this too. And you've done the work for me!
![[User Picture]](http://l-userpic.livejournal.com/265901/210712) | From: gaal 2006-12-30 08:05 am (UTC)
| (Link)
|
I think you can add an option to only look at the first n bytes of each stream, to speed up your total scan time w/o risking broken files too much. If you weren't looking at streams, I'd have said look at the last n bytes too, but then it's not dead simple any more.
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-30 08:27 am (UTC)
| (Link)
|
Ah, good call.
But really I should just take mpg123 and ogg123 and combine them into a tool that can parse the files quickly, and emit to stdout the audio frames, and throw out the rest. Then I can just do:
$ cataudio song.ogg | md5sum
That'd be both fast and reliable.
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-30 08:39 am (UTC)
| (Link)
|
Btw, mpg123 has a --skip N, skip N frames option, which I was initially scared might include id3 frames as frames, rendering it entirely useless....
But I compared a file with an id3v1 vs id3v2 tags, and mpg123 doesn't count the leading frames as skipped frames:
$ file a.mp3 MPEG ADTS, layer III, v1, 128 kBits, 44.1 kHz, JntStereo $ file b.mp3 MP3 file with ID3 version 2.3.0 tag
And indeed, they start different:
$ dd if=a.mp3 count=2 bs=1024 | md5sum 1a42f5c7e71b6f66c6b5f8f8eeecd390 - $ dd if=b.mp3 count=2 bs=1024 | md5sum 620914747096d641bfb4b86b5e567a31 -
But it works, thanks!
$ mpg123 -n 500 -q -s a.mp3 | md5sum ccccb9398fee97f7fe5b44246528f83d - $ mpg123 -n 500 -q -s b.mp3 | md5sum ccccb9398fee97f7fe5b44246528f83d -
![[User Picture]](http://l-userpic.livejournal.com/265901/210712) | From: gaal 2006-12-30 09:34 am (UTC)
| (Link)
|
This is worth checking, but it could be that the lead-in / first second or two of the audio isn't very reliably unique, so the start hash can skip a few frames as well rather than just make sure it's "big enough". Then again I don't know how much of a difference this makes.
You should consider using MusicBrainz audio fingerprinting (PUID, TRM) on all your files? Encoding differences might cause different md5sums even in cases where the songs are the same.
![[User Picture]](http://l-userpic.livejournal.com/54541970/2) | From: brad 2006-12-30 07:05 pm (UTC)
| (Link)
|
Not the case for me. Dups are because I copied the same file around.
Second the recommendation for MusicBrainz, it's very intelligent at matching things and can also clean up and normalize your metadata and file naming. (Which is my biggest annoyance.)
how about different bitrates? it seems to me, that two files with the same song, but different bitrates would not be detected as a duplicate.
Hi brad! Thanks for sharing. I like the quarantee that you give;) |