501 lines
16 KiB
HTML
501 lines
16 KiB
HTML
<HTML>
|
|
<HEAD>
|
|
<!-- This HTML file has been created by texi2html 1.51
|
|
from manual.texi on 23 August 1998 -->
|
|
|
|
<TITLE>bzip2 and libbzip2 - Miscellanea</TITLE>
|
|
</HEAD>
|
|
<BODY>
|
|
Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.
|
|
<P><HR><P>
|
|
|
|
|
|
<H1><A NAME="SEC33" HREF="manual_toc.html#TOC33">Miscellanea</A></H1>
|
|
|
|
<P>
|
|
These are just some random thoughts of mine. Your mileage may
|
|
vary.
|
|
|
|
</P>
|
|
|
|
|
|
<H2><A NAME="SEC34" HREF="manual_toc.html#TOC34">Limitations of the compressed file format</A></H2>
|
|
<P>
|
|
<CODE>bzip2-0.9.0</CODE> uses exactly the same file format as the previous
|
|
version, <CODE>bzip2-0.1</CODE>. This decision was made in the interests of
|
|
stability. Creating yet another incompatible compressed file format
|
|
would create further confusion and disruption for users.
|
|
|
|
</P>
|
|
<P>
|
|
Nevertheless, this is not a painless decision. Development
|
|
work since the release of <CODE>bzip2-0.1</CODE> in August 1997
|
|
has shown complexities in the file format which slow down
|
|
decompression and, in retrospect, are unnecessary. These are:
|
|
|
|
<UL>
|
|
<LI>The run-length encoder, which is the first of the
|
|
|
|
compression transformations, is entirely irrelevant.
|
|
The original purpose was to protect the sorting algorithm
|
|
from the very worst case input: a string of repeated
|
|
symbols. But algorithm steps Q6a and Q6b in the original
|
|
Burrows-Wheeler technical report (SRC-124) show how
|
|
repeats can be handled without difficulty in block
|
|
sorting.
|
|
<LI>The randomisation mechanism doesn't really need to be
|
|
|
|
there. Udi Manber and Gene Myers published a suffix
|
|
array construction algorithm a few years back, which
|
|
can be employed to sort any block, no matter how
|
|
repetitive, in O(N log N) time. Subsequent work by
|
|
Kunihiko Sadakane has produced a derivative O(N (log N)^2)
|
|
algorithm which usually outperforms the Manber-Myers
|
|
algorithm.
|
|
|
|
I could have changed to Sadakane's algorithm, but I find
|
|
it to be slower than <CODE>bzip2</CODE>'s existing algorithm for
|
|
most inputs, and the randomisation mechanism protects
|
|
adequately against bad cases. I didn't think it was
|
|
a good tradeoff to make. Partly this is due to the fact
|
|
that I was not flooded with email complaints about
|
|
<CODE>bzip2-0.1</CODE>'s performance on repetitive data, so
|
|
perhaps it isn't a problem for real inputs.
|
|
|
|
Probably the best long-term solution
|
|
is to use the existing sorting
|
|
algorithm initially, and fall back to a O(N (log N)^2)
|
|
algorithm if the standard algorithm gets into difficulties.
|
|
This can be done without much difficulty; I made
|
|
a prototype implementation of it some months now.
|
|
<LI>The compressed file format was never designed to be
|
|
|
|
handled by a library, and I have had to jump though
|
|
some hoops to produce an efficient implementation of
|
|
decompression. It's a bit hairy. Try passing
|
|
<CODE>decompress.c</CODE> through the C preprocessor
|
|
and you'll see what I mean. Much of this complexity
|
|
could have been avoided if the compressed size of
|
|
each block of data was recorded in the data stream.
|
|
<LI>An Adler-32 checksum, rather than a CRC32 checksum,
|
|
|
|
would be faster to compute.
|
|
</UL>
|
|
|
|
<P>
|
|
It would be fair to say that the <CODE>bzip2</CODE> format was frozen
|
|
before I properly and fully understood the performance
|
|
consequences of doing so.
|
|
|
|
</P>
|
|
<P>
|
|
Improvements which I have been able to incorporate into
|
|
0.9.0, despite using the same file format, are:
|
|
|
|
<UL>
|
|
<LI>Single array implementation of the inverse BWT. This
|
|
|
|
significantly speeds up decompression, presumably
|
|
because it reduces the number of cache misses.
|
|
<LI>Faster inverse MTF transform for large MTF values. The
|
|
|
|
new implementation is based on the notion of sliding blocks
|
|
of values.
|
|
<LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE>
|
|
|
|
and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>.
|
|
Duh! I'm embarrassed at my own moronicness (moronicity?) on this
|
|
one.
|
|
|
|
</UL>
|
|
|
|
<P>
|
|
Further ahead, it would be nice
|
|
to be able to do random access into files. This will
|
|
require some careful design of compressed file formats.
|
|
|
|
</P>
|
|
|
|
|
|
|
|
<H2><A NAME="SEC35" HREF="manual_toc.html#TOC35">Portability issues</A></H2>
|
|
<P>
|
|
After some consideration, I have decided not to use
|
|
GNU <CODE>autoconf</CODE> to configure 0.9.0.
|
|
|
|
</P>
|
|
<P>
|
|
<CODE>autoconf</CODE>, admirable and wonderful though it is,
|
|
mainly assists with portability problems between Unix-like
|
|
platforms. But <CODE>bzip2</CODE> doesn't have much in the way
|
|
of portability problems on Unix; most of the difficulties appear
|
|
when porting to the Mac, or to Microsoft's operating systems.
|
|
<CODE>autoconf</CODE> doesn't help in those cases, and brings in a
|
|
whole load of new complexity.
|
|
|
|
</P>
|
|
<P>
|
|
Most people should be able to compile the library and program
|
|
under Unix straight out-of-the-box, so to speak, especially
|
|
if you have a version of GNU C available.
|
|
|
|
</P>
|
|
<P>
|
|
There are a couple of <CODE>__inline__</CODE> directives in the code. GNU C
|
|
(<CODE>gcc</CODE>) should be able to handle them. If your compiler doesn't
|
|
like them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be null. One
|
|
easy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>,
|
|
which should be understood by most Unix compilers.
|
|
|
|
</P>
|
|
<P>
|
|
If you still have difficulties, try compiling with the macro
|
|
<CODE>BZ_STRICT_ANSI</CODE> defined. This should enable you to build the
|
|
library in a strictly ANSI compliant environment. Building the program
|
|
itself like this is dangerous and not supported, since you remove
|
|
<CODE>bzip2</CODE>'s checks against compressing directories, symbolic links,
|
|
devices, and other not-really-a-file entities. This could cause
|
|
filesystem corruption!
|
|
|
|
</P>
|
|
<P>
|
|
One other thing: if you create a <CODE>bzip2</CODE> binary for public
|
|
distribution, please try and link it statically (<CODE>gcc -s</CODE>). This
|
|
avoids all sorts of library-version issues that others may encounter
|
|
later on.
|
|
|
|
</P>
|
|
|
|
|
|
|
|
<H2><A NAME="SEC36" HREF="manual_toc.html#TOC36">Reporting bugs</A></H2>
|
|
<P>
|
|
I tried pretty hard to make sure <CODE>bzip2</CODE> is
|
|
bug free, both by design and by testing. Hopefully
|
|
you'll never need to read this section for real.
|
|
|
|
</P>
|
|
<P>
|
|
Nevertheless, if <CODE>bzip2</CODE> dies with a segmentation
|
|
fault, a bus error or an internal assertion failure, it
|
|
will ask you to email me a bug report. Experience with
|
|
version 0.1 shows that almost all these problems can
|
|
be traced to either compiler bugs or hardware problems.
|
|
|
|
<UL>
|
|
<LI>
|
|
|
|
Recompile the program with no optimisation, and see if it
|
|
works. And/or try a different compiler.
|
|
I heard all sorts of stories about various flavours
|
|
of GNU C (and other compilers) generating bad code for
|
|
<CODE>bzip2</CODE>, and I've run across two such examples myself.
|
|
|
|
2.7.X versions of GNU C are known to generate bad code from
|
|
time to time, at high optimisation levels.
|
|
If you get problems, try using the flags
|
|
<CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>.
|
|
You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>.
|
|
|
|
You may notice that the Makefile runs four tests as part of
|
|
the build process. If the program passes all of these, it's
|
|
a pretty good (but not 100%) indication that the compiler has
|
|
done its job correctly.
|
|
<LI>
|
|
|
|
If <CODE>bzip2</CODE> crashes randomly, and the crashes are not
|
|
repeatable, you may have a flaky memory subsystem. <CODE>bzip2</CODE>
|
|
really hammers your memory hierarchy, and if it's a bit marginal,
|
|
you may get these problems. Ditto if your disk or I/O subsystem
|
|
is slowly failing. Yup, this really does happen.
|
|
|
|
Try using a different machine of the same type, and see if
|
|
you can repeat the problem.
|
|
<LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tells
|
|
|
|
you your file is corrupted on decompression, and you
|
|
obtained the file via FTP, there is a possibility that you
|
|
forgot to tell FTP to do a binary mode transfer. That absolutely
|
|
will cause the file to be non-decompressible. You'll have to transfer
|
|
it again.
|
|
</UL>
|
|
|
|
<P>
|
|
If you've incorporated <CODE>libbzip2</CODE> into your own program
|
|
and are getting problems, please, please, please, check that the
|
|
parameters you are passing in calls to the library, are
|
|
correct, and in accordance with what the documentation says
|
|
is allowable. I have tried to make the library robust against
|
|
such problems, but I'm sure I haven't succeeded.
|
|
|
|
</P>
|
|
<P>
|
|
Finally, if the above comments don't help, you'll have to send
|
|
me a bug report. Now, it's just amazing how many people will
|
|
send me a bug report saying something like
|
|
|
|
<PRE>
|
|
bzip2 crashed with segmentation fault on my machine
|
|
</PRE>
|
|
|
|
<P>
|
|
and absolutely nothing else. Needless to say, a such a report
|
|
is <EM>totally, utterly, completely and comprehensively 100% useless;
|
|
a waste of your time, my time, and net bandwidth</EM>.
|
|
With no details at all, there's no way I can possibly begin
|
|
to figure out what the problem is.
|
|
|
|
</P>
|
|
<P>
|
|
The rules of the game are: facts, facts, facts. Don't omit
|
|
them because "oh, they won't be relevant". At the bare
|
|
minimum:
|
|
|
|
<PRE>
|
|
Machine type. Operating system version.
|
|
Exact version of <CODE>bzip2</CODE> (do <CODE>bzip2 -V</CODE>).
|
|
Exact version of the compiler used.
|
|
Flags passed to the compiler.
|
|
</PRE>
|
|
|
|
<P>
|
|
However, the most important single thing that will help me is
|
|
the file that you were trying to compress or decompress at the
|
|
time the problem happened. Without that, my ability to do anything
|
|
more than speculate about the cause, is limited.
|
|
|
|
</P>
|
|
<P>
|
|
Please remember that I connect to the Internet with a modem, so
|
|
you should contact me before mailing me huge files.
|
|
|
|
</P>
|
|
|
|
|
|
|
|
<H2><A NAME="SEC37" HREF="manual_toc.html#TOC37">Did you get the right package?</A></H2>
|
|
|
|
<P>
|
|
<CODE>bzip2</CODE> is a resource hog. It soaks up large amounts of CPU cycles
|
|
and memory. Also, it gives very large latencies. In the worst case, you
|
|
can feed many megabytes of uncompressed data into the library before
|
|
getting any compressed output, so this probably rules out applications
|
|
requiring interactive behaviour.
|
|
|
|
</P>
|
|
<P>
|
|
These aren't faults of my implementation, I hope, but more
|
|
an intrinsic property of the Burrows-Wheeler transform (unfortunately).
|
|
Maybe this isn't what you want.
|
|
|
|
</P>
|
|
<P>
|
|
If you want a compressor and/or library which is faster, uses less
|
|
memory but gets pretty good compression, and has minimal latency,
|
|
consider Jean-loup
|
|
Gailly's and Mark Adler's work, <CODE>zlib-1.1.2</CODE> and
|
|
<CODE>gzip-1.2.4</CODE>. Look for them at
|
|
<CODE>http://www.cdrom.com/pub/infozip/zlib</CODE> and
|
|
<CODE>http://www.gzip.org</CODE> respectively.
|
|
|
|
</P>
|
|
<P>
|
|
For something faster and lighter still, you might try Markus F X J
|
|
Oberhumer's <CODE>LZO</CODE> real-time compression/decompression library, at
|
|
<BR> <CODE>http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html</CODE>.
|
|
|
|
</P>
|
|
<P>
|
|
If you want to use the <CODE>bzip2</CODE> algorithms to compress small blocks
|
|
of data, 64k bytes or smaller, for example on an on-the-fly disk
|
|
compressor, you'd be well advised not to use this library. Instead,
|
|
I've made a special library tuned for that kind of use. It's part of
|
|
<CODE>e2compr-0.40</CODE>, an on-the-fly disk compressor for the Linux
|
|
<CODE>ext2</CODE> filesystem. Look at
|
|
<CODE>http://www.netspace.net.au/~reiter/e2compr</CODE>.
|
|
|
|
</P>
|
|
|
|
|
|
|
|
<H2><A NAME="SEC38" HREF="manual_toc.html#TOC38">Testing</A></H2>
|
|
|
|
<P>
|
|
A record of the tests I've done.
|
|
|
|
</P>
|
|
<P>
|
|
First, some data sets:
|
|
|
|
<UL>
|
|
<LI>B: a directory containing a 6001 files, one for every length in the
|
|
|
|
range 0 to 6000 bytes. The files contain random lowercase
|
|
letters. 18.7 megabytes.
|
|
<LI>H: my home directory tree. Documents, source code, mail files,
|
|
|
|
compressed data. H contains B, and also a directory of
|
|
files designed as boundary cases for the sorting; mostly very
|
|
repetitive, nasty files. 445 megabytes.
|
|
<LI>A: directory tree holding various applications built from source:
|
|
|
|
<CODE>egcs-1.0.2</CODE>, <CODE>gcc-2.8.1</CODE>, KDE Beta 4, GTK, Octave, etc.
|
|
827 megabytes.
|
|
<LI>P: directory tree holding large amounts of source code (<CODE>.tar</CODE>
|
|
|
|
files) of the entire GNU distribution, plus a couple of
|
|
Linux distributions. 2400 megabytes.
|
|
</UL>
|
|
|
|
<P>
|
|
The tests conducted are as follows. Each test means compressing
|
|
(a copy of) each file in the data set, decompressing it and
|
|
comparing it against the original.
|
|
|
|
</P>
|
|
<P>
|
|
First, a bunch of tests with block sizes, internal buffer
|
|
sizes and randomisation lengths set very small,
|
|
to detect any problems with the
|
|
blocking, buffering and randomisation mechanisms.
|
|
This required modifying the source code so as to try to
|
|
break it.
|
|
|
|
<OL>
|
|
<LI>Data set H, with
|
|
|
|
buffer size of 1 byte, and block size of 23 bytes.
|
|
<LI>Data set B, buffer sizes 1 byte, block size 1 byte.
|
|
|
|
<LI>As (2) but small-mode decompression (first 1700 files).
|
|
|
|
<LI>As (2) with block size 2 bytes.
|
|
|
|
<LI>As (2) with block size 3 bytes.
|
|
|
|
<LI>As (2) with block size 4 bytes.
|
|
|
|
<LI>As (2) with block size 5 bytes.
|
|
|
|
<LI>As (2) with block size 6 bytes and small-mode decompression.
|
|
|
|
<LI>H with normal buffer sizes (5000 bytes), normal block
|
|
|
|
size (up to 900000 bytes), but with randomisation
|
|
mechanism running intensely (randomising approximately every
|
|
third byte).
|
|
<LI>As (9) with small-mode decompression.
|
|
|
|
</OL>
|
|
|
|
<P>
|
|
Then some tests with unmodified source code.
|
|
|
|
<OL>
|
|
<LI>H, all settings normal.
|
|
|
|
<LI>As (1), with small-mode decompress.
|
|
|
|
<LI>H, compress with flag <CODE>-1</CODE>.
|
|
|
|
<LI>H, compress with flag <CODE>-s</CODE>, decompress with flag <CODE>-s</CODE>.
|
|
|
|
<LI>Forwards compatibility: H, <CODE>bzip2-0.1pl2</CODE> compressing,
|
|
|
|
<CODE>bzip2-0.9.0</CODE> decompressing, all settings normal.
|
|
<LI>Backwards compatibility: H, <CODE>bzip2-0.9.0</CODE> compressing,
|
|
|
|
<CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
|
|
<LI>Bigger tests: A, all settings normal.
|
|
|
|
<LI>P, all settings normal.
|
|
|
|
<LI>Misc test: about 100 megabytes of <CODE>.tar</CODE> files with
|
|
|
|
<CODE>bzip2</CODE> compiled with Purify.
|
|
<LI>Misc tests to make sure it builds and runs ok on non-Linux/x86
|
|
|
|
platforms.
|
|
</OL>
|
|
|
|
<P>
|
|
These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running
|
|
Linux 2.0.32. They represent nearly a week of continuous computation.
|
|
All tests completed successfully.
|
|
|
|
</P>
|
|
|
|
|
|
|
|
<H2><A NAME="SEC39" HREF="manual_toc.html#TOC39">Further reading</A></H2>
|
|
<P>
|
|
<CODE>bzip2</CODE> is not research work, in the sense that it doesn't present
|
|
any new ideas. Rather, it's an engineering exercise based on existing
|
|
ideas.
|
|
|
|
</P>
|
|
<P>
|
|
Four documents describe essentially all the ideas behind <CODE>bzip2</CODE>:
|
|
|
|
<PRE>
|
|
Michael Burrows and D. J. Wheeler:
|
|
"A block-sorting lossless data compression algorithm"
|
|
10th May 1994.
|
|
Digital SRC Research Report 124.
|
|
ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
|
|
If you have trouble finding it, try searching at the
|
|
New Zealand Digital Library, http://www.nzdl.org.
|
|
|
|
Daniel S. Hirschberg and Debra A. LeLewer
|
|
"Efficient Decoding of Prefix Codes"
|
|
Communications of the ACM, April 1990, Vol 33, Number 4.
|
|
You might be able to get an electronic copy of this
|
|
from the ACM Digital Library.
|
|
|
|
David J. Wheeler
|
|
Program bred3.c and accompanying document bred3.ps.
|
|
This contains the idea behind the multi-table Huffman
|
|
coding scheme.
|
|
ftp://ftp.cl.cam.ac.uk/pub/user/djw3/
|
|
|
|
Jon L. Bentley and Robert Sedgewick
|
|
"Fast Algorithms for Sorting and Searching Strings"
|
|
Available from Sedgewick's web page,
|
|
www.cs.princeton.edu/~rs
|
|
</PRE>
|
|
|
|
<P>
|
|
The following paper gives valuable additional insights into the
|
|
algorithm, but is not immediately the basis of any code
|
|
used in bzip2.
|
|
|
|
<PRE>
|
|
Peter Fenwick:
|
|
Block Sorting Text Compression
|
|
Proceedings of the 19th Australasian Computer Science Conference,
|
|
Melbourne, Australia. Jan 31 - Feb 2, 1996.
|
|
ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
|
|
</PRE>
|
|
|
|
<P>
|
|
Kunihiko Sadakane's sorting algorithm, mentioned above,
|
|
is available from:
|
|
|
|
<PRE>
|
|
http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
|
|
</PRE>
|
|
|
|
<P>
|
|
The Manber-Myers suffix array construction
|
|
algorithm is described in a paper
|
|
available from:
|
|
|
|
<PRE>
|
|
http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
|
|
</PRE>
|
|
|
|
<P><HR><P>
|
|
Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.
|
|
</BODY>
|
|
</HTML>
|