Log in

No account? Create an account
brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Parallel compression [Jan. 11th, 2005|01:24 pm]
Brad Fitzpatrick
Are there any multi-threaded compression algorithms, or at least wrappers/formats for interleaved compression, with variable interleave size?

It'd be nice to take advantage of multiple processors when gzipping 380 GB, while still doing sequential reads, even if the resultant file was non-standard.

From: evan
2005-01-11 09:45 pm (UTC)
If CPU is your bottleneck, then it seems obvious to me that you could take a block-based system like gzip or bzip2 and send every other block to a second thread.
(Reply) (Thread)
From: evan
2005-01-11 09:46 pm (UTC)
(And by "obvious" I mean that I wouldn't be surprised if either it hadn't already been implemented or it doesn't work for some reason.)
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-01-11 09:48 pm (UTC)
I think it'd only achieve good results in certain situations, but I think passing off every other cluster of 16 (or 32?) 16k innodb tablespace page to a different thread should be good, since the segments of pages are largely self-contained.
(Reply) (Parent) (Thread)
From: evan
2005-01-11 10:38 pm (UTC)
As I understood it, these systems are stream-based. There's no central dictionary created, so regardless of whether the data is broken into pieces the compressed blocks are completely independent.

(From here: bzip2 compresses blocks of data independent from each other. For each block, a checksum is created and stored. That's why all non-corrupted compressed data blocks can be easily detected and recovered. The tool bzip2recover which is available from the bzip2 homepage can do this.)
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-01-11 09:46 pm (UTC)
That's what I was thinking. Just asking around if that already exists before I write one.
(Reply) (Parent) (Thread)
[User Picture]From: j_b
2005-01-11 10:55 pm (UTC)

lzma? LGPL, multithreaded?

(Reply) (Thread)
[User Picture]From: brad
2005-01-11 10:58 pm (UTC)

Re: lzma? LGPL, multithreaded?

Thanks for the links!
(Reply) (Parent) (Thread)
[User Picture]From: bitwise
2005-01-11 11:07 pm (UTC)
You could do it with some sort of script (posting again, corrected):

(somehow know I am thread i of n, file size is z 1k blocks)
dd if=inputfile bs=1k skip=$STARTBLOCK count=$CHUNKSIZE \
|gzip >of=outputfile.gz.chunk_$i

This would work across multiple machines as well as SMP.
(Reply) (Thread)
[User Picture]From: bitwise
2005-01-12 12:02 am (UTC)
Interestingly, you don't need an equivalent decompression script. You can simply cat all the .gz.chunk files together and still have a legal gzip file.

I think.
(Reply) (Parent) (Thread)
[User Picture]From: gaal
2005-01-12 06:26 am (UTC)
Yes, and gzip >> logfile works too (not in this case, obviously, because you need to synchronize the parts)
(Reply) (Parent) (Thread)
[User Picture]From: bitwise
2005-01-12 06:33 pm (UTC)
Has been tested to work correctly on a 400MB ISO file.

# widezip compresses files in parallel.
# usage: widezip [my-thread#] [total-thread-count] [inputfile] [outputfile]

# example: to run on three separate machines:
# pluto> widezip 1 3 bigfile bigfile.gz
# earth> widezip 2 3 bigfile bigfile.gz
# venus> widezip 3 3 bigfile bigfile.gz
# ... results in three files, bigfile.gz.chunk1 ..chunk2 ..chunk3.
# with gzip you can cat all three files together to make one archive.


[ $# -ne 4 ] && {
  echo "Usage: $0 my-thread total-thread input output"
  exit 1

INPUTSIZE=`stat -c "%s" $3` || exit 1
[ $INPUTSIZE -le 0 ] && exit 1
let STARTBLOCK=($1-1)*CHUNKSIZE # note -1 so #1 gets offset 0, etc.
echo "File size $INPUTSIZE, chunksize $CHUNKSIZE blocks, startblock $STARTBLOCK."
[ $1 -ne $2 ] && DDOPTS=count=$CHUNKSIZE #only last thread runs to end.

dd if=$3 bs=$BLOCKSIZE skip=$STARTBLOCK $DDOPTS |gzip >$4.chunk$1
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-01-12 08:44 pm (UTC)
Heh, nice.
(Reply) (Parent) (Thread)
[User Picture]From: bitwise
2005-01-12 09:03 pm (UTC)
Hm. Look what I just stumbled on:


PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2.</a>
(Reply) (Parent) (Thread)
[User Picture]From: brad
2005-01-12 09:09 pm (UTC)
(Reply) (Parent) (Thread)