2102 lines
82 KiB
Plaintext
2102 lines
82 KiB
Plaintext
\input texinfo @c -*- Texinfo -*-
|
|
@setfilename bzip2.info
|
|
|
|
@ignore
|
|
$NetBSD: manual.texi,v 1.2 1998/09/14 03:13:53 ross Exp $
|
|
|
|
This file documents bzip2 version 0.9.0, and associated library
|
|
libbzip2, written by Julian Seward (jseward@acm.org).
|
|
|
|
Copyright (C) 1996-1998 Julian R Seward
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for verbatim copies.
|
|
@end ignore
|
|
|
|
@ifinfo
|
|
@format
|
|
START-INFO-DIR-ENTRY
|
|
* Bzip2: (bzip2). A program and library for data compression.
|
|
END-INFO-DIR-ENTRY
|
|
@end format
|
|
|
|
@end ifinfo
|
|
|
|
@iftex
|
|
@c @finalout
|
|
@settitle bzip2 and libbzip2
|
|
@titlepage
|
|
@title bzip2 and libbzip2
|
|
@subtitle a program and library for data compression
|
|
@subtitle copyright (C) 1996-1998 Julian Seward
|
|
@subtitle version 0.9.0 of 23 August 1998
|
|
@author Julian Seward
|
|
|
|
@end titlepage
|
|
@end iftex
|
|
|
|
|
|
@parindent 0mm
|
|
@parskip 2mm
|
|
|
|
|
|
This program, @code{bzip2},
|
|
and associated library @code{libbzip2}, are
|
|
Copyright (C) 1996-1998 Julian R Seward. All rights reserved.
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions
|
|
are met:
|
|
@itemize @bullet
|
|
@item
|
|
Redistributions of source code must retain the above copyright
|
|
notice, this list of conditions and the following disclaimer.
|
|
@item
|
|
The origin of this software must not be misrepresented; you must
|
|
not claim that you wrote the original software. If you use this
|
|
software in a product, an acknowledgment in the product
|
|
documentation would be appreciated but is not required.
|
|
@item
|
|
Altered source versions must be plainly marked as such, and must
|
|
not be misrepresented as being the original software.
|
|
@item
|
|
The name of the author may not be used to endorse or promote
|
|
products derived from this software without specific prior written
|
|
permission.
|
|
@end itemize
|
|
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
|
|
OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
|
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
|
|
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
|
|
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
|
|
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
|
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
|
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
Julian Seward, Guildford, Surrey, UK.
|
|
|
|
@code{jseward@@acm.org}
|
|
|
|
@code{http://www.muraroa.demon.co.uk}
|
|
|
|
@code{bzip2}/@code{libbzip2} version 0.9.0 of 23 August 1998.
|
|
|
|
PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented
|
|
algorithms. However, I do not have the resources available to carry out
|
|
a full patent search. Therefore I cannot give any guarantee of the
|
|
above statement.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@node Overview, Implementation, Top, Top
|
|
@chapter Introduction
|
|
|
|
@code{bzip2} compresses files using the Burrows-Wheeler
|
|
block-sorting text compression algorithm, and Huffman coding.
|
|
Compression is generally considerably better than that
|
|
achieved by more conventional LZ77/LZ78-based compressors,
|
|
and approaches the performance of the PPM family of statistical compressors.
|
|
|
|
@code{bzip2} is built on top of @code{libbzip2}, a flexible library
|
|
for handling compressed data in the @code{bzip2} format. This manual
|
|
describes both how to use the program and
|
|
how to work with the library interface. Most of the
|
|
manual is devoted to this library, not the program,
|
|
which is good news if your interest is only in the program.
|
|
|
|
Chapter 2 describes how to use @code{bzip2}; this is the only part
|
|
you need to read if you just want to know how to operate the program.
|
|
Chapter 3 describes the programming interfaces in detail, and
|
|
Chapter 4 records some miscellaneous notes which I thought
|
|
ought to be recorded somewhere.
|
|
|
|
|
|
@chapter How to use @code{bzip2}
|
|
|
|
This chapter contains a copy of the @code{bzip2} man page,
|
|
and nothing else.
|
|
@example
|
|
NAME
|
|
bzip2, bunzip2 - a block-sorting file compressor, v0.9.0
|
|
bzcat - decompresses files to stdout
|
|
bzip2recover - recovers data from damaged bzip2 files
|
|
|
|
|
|
SYNOPSIS
|
|
bzip2 [ -cdfkstvzVL123456789 ] [ filenames ... ]
|
|
bunzip2 [ -fkvsVL ] [ filenames ... ]
|
|
bzcat [ -s ] [ filenames ... ]
|
|
bzip2recover filename
|
|
|
|
|
|
DESCRIPTION
|
|
bzip2 compresses files using the Burrows-Wheeler block-
|
|
sorting text compression algorithm, and Huffman coding.
|
|
Compression is generally considerably better than that
|
|
achieved by more conventional LZ77/LZ78-based compressors,
|
|
and approaches the performance of the PPM family of sta-
|
|
tistical compressors.
|
|
|
|
The command-line options are deliberately very similar to
|
|
those of GNU Gzip, but they are not identical.
|
|
|
|
bzip2 expects a list of file names to accompany the com-
|
|
mand-line flags. Each file is replaced by a compressed
|
|
version of itself, with the name "original_name.bz2".
|
|
Each compressed file has the same modification date and
|
|
permissions as the corresponding original, so that these
|
|
properties can be correctly restored at decompression
|
|
time. File name handling is naive in the sense that there
|
|
is no mechanism for preserving original file names, per-
|
|
missions and dates in filesystems which lack these con-
|
|
cepts, or have serious file name length restrictions, such
|
|
as MS-DOS.
|
|
|
|
bzip2 and bunzip2 will by default not overwrite existing
|
|
files; if you want this to happen, specify the -f flag.
|
|
|
|
If no file names are specified, bzip2 compresses from
|
|
standard input to standard output. In this case, bzip2
|
|
will decline to write compressed output to a terminal, as
|
|
this would be entirely incomprehensible and therefore
|
|
pointless.
|
|
|
|
bunzip2 (or bzip2 -d ) decompresses and restores all spec-
|
|
ified files whose names end in ".bz2". Files without this
|
|
suffix are ignored. Again, supplying no filenames causes
|
|
decompression from standard input to standard output.
|
|
|
|
bunzip2 will correctly decompress a file which is the con-
|
|
catenation of two or more compressed files. The result is
|
|
the concatenation of the corresponding uncompressed files.
|
|
Integrity testing (-t) of concatenated compressed files is
|
|
also supported.
|
|
|
|
You can also compress or decompress files to the standard
|
|
output by giving the -c flag. Multiple files may be com-
|
|
pressed and decompressed like this. The resulting outputs
|
|
are fed sequentially to stdout. Compression of multiple
|
|
files in this manner generates a stream containing multi-
|
|
ple compressed file representations. Such a stream can be
|
|
decompressed correctly only by bzip2 version 0.9.0 or
|
|
later. Earlier versions of bzip2 will stop after decom-
|
|
pressing the first file in the stream.
|
|
|
|
bzcat (or bzip2 -dc ) decompresses all specified files to
|
|
the standard output.
|
|
|
|
Compression is always performed, even if the compressed
|
|
file is slightly larger than the original. Files of less
|
|
than about one hundred bytes tend to get larger, since the
|
|
compression mechanism has a constant overhead in the
|
|
region of 50 bytes. Random data (including the output of
|
|
most file compressors) is coded at about 8.05 bits per
|
|
byte, giving an expansion of around 0.5%.
|
|
|
|
As a self-check for your protection, bzip2 uses 32-bit
|
|
CRCs to make sure that the decompressed version of a file
|
|
is identical to the original. This guards against corrup-
|
|
tion of the compressed data, and against undetected bugs
|
|
in bzip2 (hopefully very unlikely). The chances of data
|
|
corruption going undetected is microscopic, about one
|
|
chance in four billion for each file processed. Be aware,
|
|
though, that the check occurs upon decompression, so it
|
|
can only tell you that that something is wrong. It can't
|
|
help you recover the original uncompressed data. You can
|
|
use bzip2recover to try to recover data from damaged
|
|
files.
|
|
|
|
Return values: 0 for a normal exit, 1 for environmental
|
|
problems (file not found, invalid flags, I/O errors, &c),
|
|
2 to indicate a corrupt compressed file, 3 for an internal
|
|
consistency error (eg, bug) which caused bzip2 to panic.
|
|
|
|
|
|
MEMORY MANAGEMENT
|
|
Bzip2 compresses large files in blocks. The block size
|
|
affects both the compression ratio achieved, and the
|
|
amount of memory needed both for compression and decom-
|
|
pression. The flags -1 through -9 specify the block size
|
|
to be 100,000 bytes through 900,000 bytes (the default)
|
|
respectively. At decompression-time, the block size used
|
|
for compression is read from the header of the compressed
|
|
file, and bunzip2 then allocates itself just enough memory
|
|
to decompress the file. Since block sizes are stored in
|
|
compressed files, it follows that the flags -1 to -9 are
|
|
irrelevant to and so ignored during decompression.
|
|
|
|
Compression and decompression requirements, in bytes, can
|
|
be estimated as:
|
|
|
|
Compression: 400k + ( 7 x block size )
|
|
|
|
Decompression: 100k + ( 4 x block size ), or
|
|
100k + ( 2.5 x block size )
|
|
|
|
Larger block sizes give rapidly diminishing marginal
|
|
returns; most of the compression comes from the first two
|
|
or three hundred k of block size, a fact worth bearing in
|
|
mind when using bzip2 on small machines. It is also
|
|
important to appreciate that the decompression memory
|
|
requirement is set at compression-time by the choice of
|
|
block size.
|
|
|
|
For files compressed with the default 900k block size,
|
|
bunzip2 will require about 3700 kbytes to decompress. To
|
|
support decompression of any file on a 4 megabyte machine,
|
|
bunzip2 has an option to decompress using approximately
|
|
half this amount of memory, about 2300 kbytes. Decompres-
|
|
sion speed is also halved, so you should use this option
|
|
only where necessary. The relevant flag is -s.
|
|
|
|
In general, try and use the largest block size memory con-
|
|
straints allow, since that maximises the compression
|
|
achieved. Compression and decompression speed are virtu-
|
|
ally unaffected by block size.
|
|
|
|
Another significant point applies to files which fit in a
|
|
single block -- that means most files you'd encounter
|
|
using a large block size. The amount of real memory
|
|
touched is proportional to the size of the file, since the
|
|
file is smaller than a block. For example, compressing a
|
|
file 20,000 bytes long with the flag -9 will cause the
|
|
compressor to allocate around 6700k of memory, but only
|
|
touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the
|
|
decompressor will allocate 3700k but only touch 100k +
|
|
20000 * 4 = 180 kbytes.
|
|
|
|
Here is a table which summarises the maximum memory usage
|
|
for different block sizes. Also recorded is the total
|
|
compressed size for 14 files of the Calgary Text Compres-
|
|
sion Corpus totalling 3,141,622 bytes. This column gives
|
|
some feel for how compression varies with block size.
|
|
These figures tend to understate the advantage of larger
|
|
block sizes for larger files, since the Corpus is domi-
|
|
nated by smaller files.
|
|
|
|
Compress Decompress Decompress Corpus
|
|
Flag usage usage -s usage Size
|
|
|
|
-1 1100k 500k 350k 914704
|
|
-2 1800k 900k 600k 877703
|
|
-3 2500k 1300k 850k 860338
|
|
-4 3200k 1700k 1100k 846899
|
|
-5 3900k 2100k 1350k 845160
|
|
-6 4600k 2500k 1600k 838626
|
|
-7 5400k 2900k 1850k 834096
|
|
-8 6000k 3300k 2100k 828642
|
|
-9 6700k 3700k 2350k 828642
|
|
|
|
|
|
OPTIONS
|
|
-c --stdout
|
|
Compress or decompress to standard output. -c will
|
|
decompress multiple files to stdout, but will only
|
|
compress a single file to stdout.
|
|
|
|
-d --decompress
|
|
Force decompression. bzip2, bunzip2 and bzcat are
|
|
really the same program, and the decision about
|
|
what actions to take is done on the basis of which
|
|
name is used. This flag overrides that mechanism,
|
|
and forces bzip2 to decompress.
|
|
|
|
-z --compress
|
|
The complement to -d: forces compression, regard-
|
|
less of the invokation name.
|
|
|
|
-t --test
|
|
Check integrity of the specified file(s), but don't
|
|
decompress them. This really performs a trial
|
|
decompression and throws away the result.
|
|
|
|
-f --force
|
|
Force overwrite of output files. Normally, bzip2
|
|
will not overwrite existing output files.
|
|
|
|
-k --keep
|
|
Keep (don't delete) input files during compression
|
|
or decompression.
|
|
|
|
-s --small
|
|
Reduce memory usage, for compression, decompression
|
|
and testing. Files are decompressed and tested
|
|
using a modified algorithm which only requires 2.5
|
|
bytes per block byte. This means any file can be
|
|
decompressed in 2300k of memory, albeit at about
|
|
half the normal speed.
|
|
|
|
During compression, -s selects a block size of
|
|
200k, which limits memory use to around the same
|
|
figure, at the expense of your compression ratio.
|
|
In short, if your machine is low on memory (8
|
|
megabytes or less), use -s for everything. See
|
|
MEMORY MANAGEMENT above.
|
|
|
|
-v --verbose
|
|
Verbose mode -- show the compression ratio for each
|
|
file processed. Further -v's increase the ver-
|
|
bosity level, spewing out lots of information which
|
|
is primarily of interest for diagnostic purposes.
|
|
|
|
-L --license -V --version
|
|
Display the software version, license terms and
|
|
conditions.
|
|
|
|
-1 to -9
|
|
Set the block size to 100 k, 200 k .. 900 k when
|
|
compressing. Has no effect when decompressing.
|
|
See MEMORY MANAGEMENT above.
|
|
|
|
--repetitive-fast
|
|
bzip2 injects some small pseudo-random variations
|
|
into very repetitive blocks to limit worst-case
|
|
performance during compression. If sorting runs
|
|
into difficulties, the block is randomised, and
|
|
sorting is restarted. Very roughly, bzip2 persists
|
|
for three times as long as a well-behaved input
|
|
would take before resorting to randomisation. This
|
|
flag makes it give up much sooner.
|
|
|
|
--repetitive-best
|
|
Opposite of --repetitive-fast; try a lot harder
|
|
before resorting to randomisation.
|
|
|
|
|
|
RECOVERING DATA FROM DAMAGED FILES
|
|
bzip2 compresses files in blocks, usually 900kbytes long.
|
|
Each block is handled independently. If a media or trans-
|
|
mission error causes a multi-block .bz2 file to become
|
|
damaged, it may be possible to recover data from the
|
|
undamaged blocks in the file.
|
|
|
|
The compressed representation of each block is delimited
|
|
by a 48-bit pattern, which makes it possible to find the
|
|
block boundaries with reasonable certainty. Each block
|
|
also carries its own 32-bit CRC, so damaged blocks can be
|
|
distinguished from undamaged ones.
|
|
|
|
bzip2recover is a simple program whose purpose is to
|
|
search for blocks in .bz2 files, and write each block out
|
|
into its own .bz2 file. You can then use bzip2 -t to test
|
|
the integrity of the resulting files, and decompress those
|
|
which are undamaged.
|
|
|
|
bzip2recover takes a single argument, the name of the dam-
|
|
aged file, and writes a number of files "rec0001file.bz2",
|
|
"rec0002file.bz2", etc, containing the extracted blocks.
|
|
The output filenames are designed so that the use of
|
|
wildcards in subsequent processing -- for example, "bzip2
|
|
-dc rec*file.bz2 > recovered_data" -- lists the files in
|
|
the "right" order.
|
|
|
|
bzip2recover should be of most use dealing with large .bz2
|
|
files, as these will contain many blocks. It is clearly
|
|
futile to use it on damaged single-block files, since a
|
|
damaged block cannot be recovered. If you wish to min-
|
|
imise any potential data loss through media or transmis-
|
|
sion errors, you might consider compressing with a smaller
|
|
block size.
|
|
|
|
|
|
PERFORMANCE NOTES
|
|
The sorting phase of compression gathers together similar
|
|
strings in the file. Because of this, files containing
|
|
very long runs of repeated symbols, like "aabaabaabaab
|
|
..." (repeated several hundred times) may compress
|
|
extraordinarily slowly. You can use the -vvvvv option to
|
|
monitor progress in great detail, if you want. Decompres-
|
|
sion speed is unaffected.
|
|
|
|
Such pathological cases seem rare in practice, appearing
|
|
mostly in artificially-constructed test files, and in low-
|
|
level disk images. It may be inadvisable to use bzip2 to
|
|
compress the latter. If you do get a file which causes
|
|
severe slowness in compression, try making the block size
|
|
as small as possible, with flag -1.
|
|
|
|
bzip2 usually allocates several megabytes of memory to
|
|
operate in, and then charges all over it in a fairly ran-
|
|
dom fashion. This means that performance, both for com-
|
|
pressing and decompressing, is largely determined by the
|
|
speed at which your machine can service cache misses.
|
|
Because of this, small changes to the code to reduce the
|
|
miss rate have been observed to give disproportionately
|
|
large performance improvements. I imagine bzip2 will per-
|
|
form best on machines with very large caches.
|
|
|
|
|
|
CAVEATS
|
|
I/O error messages are not as helpful as they could be.
|
|
Bzip2 tries hard to detect I/O errors and exit cleanly,
|
|
but the details of what the problem is sometimes seem
|
|
rather misleading.
|
|
|
|
This manual page pertains to version 0.9.0 of bzip2. Com-
|
|
pressed data created by this version is entirely forwards
|
|
and backwards compatible with the previous public release,
|
|
version 0.1pl2, but with the following exception: 0.9.0
|
|
can correctly decompress multiple concatenated compressed
|
|
files. 0.1pl2 cannot do this; it will stop after decom-
|
|
pressing just the first file in the stream.
|
|
|
|
Wildcard expansion for Windows 95 and NT is flaky.
|
|
|
|
bzip2recover uses 32-bit integers to represent bit posi-
|
|
tions in compressed files, so it cannot handle compressed
|
|
files more than 512 megabytes long. This could easily be
|
|
fixed.
|
|
|
|
|
|
AUTHOR
|
|
Julian Seward, jseward@@acm.org.
|
|
|
|
The ideas embodied in bzip2 are due to (at least) the fol-
|
|
lowing people: Michael Burrows and David Wheeler (for the
|
|
block sorting transformation), David Wheeler (again, for
|
|
the Huffman coder), Peter Fenwick (for the structured cod-
|
|
ing model in the original bzip, and many refinements), and
|
|
Alistair Moffat, Radford Neal and Ian Witten (for the
|
|
arithmetic coder in the original bzip). I am much
|
|
indebted for their help, support and advice. See the man-
|
|
ual in the source distribution for pointers to sources of
|
|
documentation. Christian von Roques encouraged me to look
|
|
for faster sorting algorithms, so as to speed up compres-
|
|
sion. Bela Lubkin encouraged me to improve the worst-case
|
|
compression performance. Many people sent patches, helped
|
|
with portability problems, lent machines, gave advice and
|
|
were generally helpful.
|
|
@end example
|
|
|
|
|
|
|
|
|
|
|
|
@chapter Programming with @code{libbzip2}
|
|
|
|
This chapter describes the programming interface to @code{libbzip2}.
|
|
|
|
For general background information, particularly about memory
|
|
use and performance aspects, you'd be well advised to read Chapter 2
|
|
as well.
|
|
|
|
@section Top-level structure
|
|
|
|
@code{libbzip2} is a flexible library for compressing and decompressing
|
|
data in the @code{bzip2} data format. Although packaged as a single
|
|
entity, it helps to regard the library as three separate parts: the low
|
|
level interface, and the high level interface, and some utility
|
|
functions.
|
|
|
|
The structure of @code{libbzip2}'s interfaces is similar to
|
|
that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib}
|
|
library.
|
|
|
|
@subsection Low-level summary
|
|
|
|
This interface provides services for compressing and decompressing
|
|
data in memory. There's no provision for dealing with files, streams
|
|
or any other I/O mechanisms, just straight memory-to-memory work.
|
|
In fact, this part of the library can be compiled without inclusion
|
|
of @code{stdio.h}, which may be helpful for embedded applications.
|
|
|
|
The low-level part of the library has no global variables and
|
|
is therefore thread-safe.
|
|
|
|
Six routines make up the low level interface:
|
|
@code{bzCompressInit}, @code{bzCompress}, and @* @code{bzCompressEnd}
|
|
for compression,
|
|
and a corresponding trio @code{bzDecompressInit}, @* @code{bzDecompress}
|
|
and @code{bzDecompressEnd} for decompression.
|
|
The @code{*Init} functions allocate
|
|
memory for compression/decompression and do other
|
|
initialisations, whilst the @code{*End} functions close down operations
|
|
and release memory.
|
|
|
|
The real work is done by @code{bzCompress} and @code{bzDecompress}.
|
|
These compress/decompress data from a user-supplied input buffer
|
|
to a user-supplied output buffer. These buffers can be any size;
|
|
arbitrary quantities of data are handled by making repeated calls
|
|
to these functions. This is a flexible mechanism allowing a
|
|
consumer-pull style of activity, or producer-push, or a mixture of
|
|
both.
|
|
|
|
|
|
|
|
@subsection High-level summary
|
|
|
|
This interface provides some handy wrappers around the low-level
|
|
interface to facilitate reading and writing @code{bzip2} format
|
|
files (@code{.bz2} files). The routines provide hooks to facilitate
|
|
reading files in which the @code{bzip2} data stream is embedded
|
|
within some larger-scale file structure, or where there are
|
|
multiple @code{bzip2} data streams concatenated end-to-end.
|
|
|
|
For reading files, @code{bzReadOpen}, @code{bzRead}, @code{bzReadClose}
|
|
and @code{bzReadGetUnused} are supplied. For writing files,
|
|
@code{bzWriteOpen}, @code{bzWrite} and @code{bzWriteFinish} are
|
|
available.
|
|
|
|
As with the low-level library, no global variables are used
|
|
so the library is per se thread-safe. However, if I/O errors
|
|
occur whilst reading or writing the underlying compressed files,
|
|
you may have to consult @code{errno} to determine the cause of
|
|
the error. In that case, you'd need a C library which correctly
|
|
supports @code{errno} in a multithreaded environment.
|
|
|
|
To make the library a little simpler and more portable,
|
|
@code{bzReadOpen} and @code{bzWriteOpen} require you to pass them file
|
|
handles (@code{FILE*}s) which have previously been opened for reading or
|
|
writing respectively. That avoids portability problems associated with
|
|
file operations and file attributes, whilst not being much of an
|
|
imposition on the programmer.
|
|
|
|
|
|
|
|
@subsection Utility functions summary
|
|
For very simple needs, @code{bzBuffToBuffCompress} and
|
|
@code{bzBuffToBuffDecompress} are provided. These compress
|
|
data in memory from one buffer to another buffer in a single
|
|
function call. You should assess whether these functions
|
|
fulfill your memory-to-memory compression/decompression
|
|
requirements before investing effort in understanding the more
|
|
general but more complex low-level interface.
|
|
|
|
Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} /
|
|
@code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to
|
|
give better @code{zlib} compatibility. These functions are
|
|
@code{bzopen}, @code{bzread}, @code{bzwrite}, @code{bzflush},
|
|
@code{bzclose},
|
|
@code{bzerror} and @code{bzlibVersion}. You may find these functions
|
|
more convenient for simple file reading and writing, than those in the
|
|
high-level interface. These functions are not (yet) officially part of
|
|
the library, and are not further documented here. If they break, you
|
|
get to keep all the pieces. I hope to document them properly when time
|
|
permits.
|
|
|
|
Yoshioka also contributed modifications to allow the library to be
|
|
built as a Windows DLL.
|
|
|
|
|
|
@section Error handling
|
|
|
|
The library is designed to recover cleanly in all situations, including
|
|
the worst-case situation of decompressing random data. I'm not
|
|
100% sure that it can always do this, so you might want to add
|
|
a signal handler to catch segmentation violations during decompression
|
|
if you are feeling especially paranoid. I would be interested in
|
|
hearing more about the robustness of the library to corrupted
|
|
compressed data.
|
|
|
|
The file @code{bzlib.h} contains all definitions needed to use
|
|
the library. In particular, you should definitely not include
|
|
@code{bzlib_private.h}.
|
|
|
|
In @code{bzlib.h}, the various return values are defined. The following
|
|
list is not intended as an exhaustive description of the circumstances
|
|
in which a given value may be returned -- those descriptions are given
|
|
later. Rather, it is intended to convey the rough meaning of each
|
|
return value. The first five actions are normal and not intended to
|
|
denote an error situation.
|
|
@table @code
|
|
@item BZ_OK
|
|
The requested action was completed successfully.
|
|
@item BZ_RUN_OK
|
|
@itemx BZ_FLUSH_OK
|
|
@itemx BZ_FINISH_OK
|
|
In @code{bzCompress}, the requested flush/finish/nothing-special action
|
|
was completed successfully.
|
|
@item BZ_STREAM_END
|
|
Compression of data was completed, or the logical stream end was
|
|
detected during decompression.
|
|
@end table
|
|
|
|
The following return values indicate an error of some kind.
|
|
@table @code
|
|
@item BZ_SEQUENCE_ERROR
|
|
When using the library, it is important to call the functions in the
|
|
correct sequence and with data structures (buffers etc) in the correct
|
|
states. @code{libbzip2} checks as much as it can to ensure this is
|
|
happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which
|
|
complies precisely with the function semantics, as detailed below,
|
|
should never receive this value; such an event denotes buggy code
|
|
which you should investigate.
|
|
@item BZ_PARAM_ERROR
|
|
Returned when a parameter to a function call is out of range
|
|
or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR},
|
|
this denotes a bug in the client code. The distinction between
|
|
@code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth
|
|
making.
|
|
@item BZ_MEM_ERROR
|
|
Returned when a request to allocate memory failed. Note that the
|
|
quantity of memory needed to decompress a stream cannot be determined
|
|
until the stream's header has been read. So @code{bzDecompress} and
|
|
@code{bzRead} may return @code{BZ_MEM_ERROR} even though some of
|
|
the compressed data has been read. The same is not true for
|
|
compression; once @code{bzCompressInit} or @code{bzWriteOpen} have
|
|
successfully completed, @code{BZ_MEM_ERROR} cannot occur.
|
|
@item BZ_DATA_ERROR
|
|
Returned when a data integrity error is detected during decompression.
|
|
Most importantly, this means when stored and computed CRCs for the
|
|
data do not match. This value is also returned upon detection of any
|
|
other anomaly in the compressed data.
|
|
@item BZ_DATA_ERROR_MAGIC
|
|
As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to
|
|
know when the compressed stream does not start with the correct
|
|
magic bytes (@code{'B' 'Z' 'h'}).
|
|
@item BZ_IO_ERROR
|
|
Returned by @code{bzRead} and @code{bzRead} when there is an error
|
|
reading or writing in the compressed file, and by @code{bzReadOpen}
|
|
and @code{bzWriteOpen} for attempts to use a file for which the
|
|
error indicator (viz, @code{ferror(f)}) is set.
|
|
On receipt of @code{BZ_IO_ERROR}, the caller should consult
|
|
@code{errno} and/or @code{perror} to acquire operating-system
|
|
specific information about the problem.
|
|
@item BZ_UNEXPECTED_EOF
|
|
Returned by @code{bzRead} when the compressed file finishes
|
|
before the logical end of stream is detected.
|
|
@item BZ_OUTBUFF_FULL
|
|
Returned by @code{bzBuffToBuffCompress} and
|
|
@code{bzBuffToBuffDecompress} to indicate that the output data
|
|
will not fit into the output buffer provided.
|
|
@end table
|
|
|
|
|
|
|
|
@section Low-level interface
|
|
|
|
@subsection @code{bzCompressInit}
|
|
@example
|
|
typedef
|
|
struct @{
|
|
char *next_in;
|
|
unsigned int avail_in;
|
|
unsigned int total_in;
|
|
|
|
char *next_out;
|
|
unsigned int avail_out;
|
|
unsigned int total_out;
|
|
|
|
void *state;
|
|
|
|
void *(*bzalloc)(void *,int,int);
|
|
void (*bzfree)(void *,void *);
|
|
void *opaque;
|
|
@}
|
|
bz_stream;
|
|
|
|
int bzCompressInit ( bz_stream *strm,
|
|
int blockSize100k,
|
|
int verbosity,
|
|
int workFactor );
|
|
|
|
@end example
|
|
|
|
Prepares for compression. The @code{bz_stream} structure
|
|
holds all data pertaining to the compression activity.
|
|
A @code{bz_stream} structure should be allocated and initialised
|
|
prior to the call.
|
|
The fields of @code{bz_stream}
|
|
comprise the entirety of the user-visible data. @code{state}
|
|
is a pointer to the private data structures required for compression.
|
|
|
|
Custom memory allocators are supported, via fields @code{bzalloc},
|
|
@code{bzfree},
|
|
and @code{opaque}. The value
|
|
@code{opaque} is passed to as the first argument to
|
|
all calls to @code{bzalloc} and @code{bzfree}, but is
|
|
otherwise ignored by the library.
|
|
The call @code{bzalloc ( opaque, n, m )} is expected to return a
|
|
pointer @code{p} to
|
|
@code{n * m} bytes of memory, and @code{bzfree ( opaque, p )}
|
|
should free
|
|
that memory.
|
|
|
|
If you don't want to use a custom memory allocator, set @code{bzalloc},
|
|
@code{bzfree} and
|
|
@code{opaque} to @code{NULL},
|
|
and the library will then use the standard @code{malloc}/@code{free}
|
|
routines.
|
|
|
|
Before calling @code{bzCompressInit}, fields @code{bzalloc},
|
|
@code{bzfree} and @code{opaque} should
|
|
be filled appropriately, as just described. Upon return, the internal
|
|
state will have been allocated and initialised, and @code{total_in} and
|
|
@code{total_out} will have been set to zero.
|
|
These last two fields are used by the library
|
|
to inform the caller of the total amount of data passed into and out of
|
|
the library, respectively. You should not try to change them.
|
|
|
|
Parameter @code{blockSize100k} specifies the block size to be used for
|
|
compression. It should be a value between 1 and 9 inclusive, and the
|
|
actual block size used is 100000 x this figure. 9 gives the best
|
|
compression but takes most memory.
|
|
|
|
Parameter @code{verbosity} should be set to a number between 0 and 4
|
|
inclusive. 0 is silent, and greater numbers give increasingly verbose
|
|
monitoring/debugging output. If the library has been compiled with
|
|
@code{-DBZ_NO_STDIO}, no such output will appear for any verbosity
|
|
setting.
|
|
|
|
Parameter @code{workFactor} controls how the compression phase behaves
|
|
when presented with worst case, highly repetitive, input data.
|
|
If compression runs into difficulties caused by repetitive data,
|
|
some pseudo-random variations are inserted into the block, and
|
|
compression is restarted. Lower values of @code{workFactor}
|
|
reduce the tolerance of compression to repetitive data.
|
|
You should set this parameter carefully; too low, and
|
|
compression ratio suffers, too high, and your average-to-worst
|
|
case compression times can become very large.
|
|
The default value of 30
|
|
gives reasonable behaviour over a wide range of circumstances.
|
|
|
|
Allowable values range from 0 to 250 inclusive. 0 is a special
|
|
case, equivalent to using the default value of 30.
|
|
|
|
Note that the randomisation process is entirely transparent.
|
|
If the library decides to randomise and restart compression on a
|
|
block, it does so without comment. Randomised blocks are
|
|
automatically de-randomised during decompression, so data
|
|
integrity is never compromised.
|
|
|
|
Possible return values:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{strm} is @code{NULL}
|
|
or @code{blockSize} < 1 or @code{blockSize} > 9
|
|
or @code{verbosity} < 0 or @code{verbosity} > 4
|
|
or @code{workFactor} < 0 or @code{workFactor} > 250
|
|
@code{BZ_MEM_ERROR}
|
|
if not enough memory is available
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
Allowable next actions:
|
|
@display
|
|
@code{bzCompress}
|
|
if @code{BZ_OK} is returned
|
|
no specific action needed in case of error
|
|
@end display
|
|
|
|
@subsection @code{bzCompress}
|
|
@example
|
|
int bzCompress ( bz_stream *strm, int action );
|
|
@end example
|
|
Provides more input and/or output buffer space for the library. The
|
|
caller maintains input and output buffers, and calls @code{bzCompress} to
|
|
transfer data between them.
|
|
|
|
Before each call to @code{bzCompress}, @code{next_in} should point at
|
|
the data to be compressed, and @code{avail_in} should indicate how many
|
|
bytes the library may read. @code{bzCompress} updates @code{next_in},
|
|
@code{avail_in} and @code{total_in} to reflect the number of bytes it
|
|
has read.
|
|
|
|
Similarly, @code{next_out} should point to a buffer in which the
|
|
compressed data is to be placed, with @code{avail_out} indicating how
|
|
much output space is available. @code{bzCompress} updates
|
|
@code{next_out}, @code{avail_out} and @code{total_out} to reflect the
|
|
number of bytes output.
|
|
|
|
You may provide and remove as little or as much data as you like on each
|
|
call of @code{bzCompress}. In the limit, it is acceptable to supply and
|
|
remove data one byte at a time, although this would be terribly
|
|
inefficient. You should always ensure that at least one byte of output
|
|
space is available at each call.
|
|
|
|
A second purpose of @code{bzCompress} is to request a change of mode of the
|
|
compressed stream.
|
|
|
|
Conceptually, a compressed stream can be in one of four states: IDLE,
|
|
RUNNING, FLUSHING and FINISHING. Before initialisation
|
|
(@code{bzCompressInit}) and after termination (@code{bzCompressEnd}), a
|
|
stream is regarded as IDLE.
|
|
|
|
Upon initialisation (@code{bzCompressInit}), the stream is placed in the
|
|
RUNNING state. Subsequent calls to @code{bzCompress} should pass
|
|
@code{BZ_RUN} as the requested action; other actions are illegal and
|
|
will result in @code{BZ_SEQUENCE_ERROR}.
|
|
|
|
At some point, the calling program will have provided all the input data
|
|
it wants to. It will then want to finish up -- in effect, asking the
|
|
library to process any data it might have buffered internally. In this
|
|
state, @code{bzCompress} will no longer attempt to read data from
|
|
@code{next_in}, but it will want to write data to @code{next_out}.
|
|
Because the output buffer supplied by the user can be arbitrarily small,
|
|
the finishing-up operation cannot necessarily be done with a single call
|
|
of @code{bzCompress}.
|
|
|
|
Instead, the calling program passes @code{BZ_FINISH} as an action to
|
|
@code{bzCompress}. This changes the stream's state to FINISHING. Any
|
|
remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and
|
|
transferred to the output buffer. To do this, @code{bzCompress} must be
|
|
called repeatedly until all the output has been consumed. At that
|
|
point, @code{bzCompress} returns @code{BZ_STREAM_END}, and the stream's
|
|
state is set back to IDLE. @code{bzCompressEnd} should then be
|
|
called.
|
|
|
|
Just to make sure the calling program does not cheat, the library makes
|
|
a note of @code{avail_in} at the time of the first call to
|
|
@code{bzCompress} which has @code{BZ_FINISH} as an action (ie, at the
|
|
time the program has announced its intention to not supply any more
|
|
input). By comparing this value with that of @code{avail_in} over
|
|
subsequent calls to @code{bzCompress}, the library can detect any
|
|
attempts to slip in more data to compress. Any calls for which this is
|
|
detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a
|
|
programming mistake which should be corrected.
|
|
|
|
Instead of asking to finish, the calling program may ask
|
|
@code{bzCompress} to take all the remaining input, compress it and
|
|
terminate the current (Burrows-Wheeler) compression block. This could
|
|
be useful for error control purposes. The mechanism is analogous to
|
|
that for finishing: call @code{bzCompress} with an action of
|
|
@code{BZ_FLUSH}, remove output data, and persist with the
|
|
@code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As
|
|
with finishing, @code{bzCompress} detects any attempt to provide more
|
|
input data once the flush has begun.
|
|
|
|
Once the flush is complete, the stream returns to the normal RUNNING
|
|
state.
|
|
|
|
This all sounds pretty complex, but isn't really. Here's a table
|
|
which shows which actions are allowable in each state, what action
|
|
will be taken, what the next state is, and what the non-error return
|
|
values are. Note that you can't explicitly ask what state the
|
|
stream is in, but nor do you need to -- it can be inferred from the
|
|
values returned by @code{bzCompress}.
|
|
@display
|
|
IDLE/@code{any}
|
|
Illegal. IDLE state only exists after @code{bzCompressEnd} or
|
|
before @code{bzCompressInit}.
|
|
Return value = @code{BZ_SEQUENCE_ERROR}
|
|
|
|
RUNNING/@code{BZ_RUN}
|
|
Compress from @code{next_in} to @code{next_out} as much as possible.
|
|
Next state = RUNNING
|
|
Return value = @code{BZ_RUN_OK}
|
|
|
|
RUNNING/@code{BZ_FLUSH}
|
|
Remember current value of @code{next_in}. Compress from @code{next_in}
|
|
to @code{next_out} as much as possible, but do not accept any more input.
|
|
Next state = FLUSHING
|
|
Return value = @code{BZ_FLUSH_OK}
|
|
|
|
RUNNING/@code{BZ_FINISH}
|
|
Remember current value of @code{next_in}. Compress from @code{next_in}
|
|
to @code{next_out} as much as possible, but do not accept any more input.
|
|
Next state = FINISHING
|
|
Return value = @code{BZ_FINISH_OK}
|
|
|
|
FLUSHING/@code{BZ_FLUSH}
|
|
Compress from @code{next_in} to @code{next_out} as much as possible,
|
|
but do not accept any more input.
|
|
If all the existing input has been used up
|
|
Next state = RUNNING; Return value = @code{BZ_RUN_OK}
|
|
else
|
|
Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK}
|
|
|
|
FLUSHING/other
|
|
Illegal.
|
|
Return value = @code{BZ_SEQUENCE_ERROR}
|
|
|
|
FINISHING/@code{BZ_FINISH}
|
|
Compress from @code{next_in} to @code{next_out} as much as possible,
|
|
but to not accept any more input.
|
|
If all the existing input has been used up and all compressed
|
|
output has been removed
|
|
Next state = IDLE; Return value = @code{BZ_STREAM_END}
|
|
else
|
|
Next state = FINISHING; Return value = @code{BZ_FINISHING}
|
|
|
|
FINISHING/other
|
|
Illegal.
|
|
Return value = @code{BZ_SEQUENCE_ERROR}
|
|
@end display
|
|
|
|
That still looks complicated? Well, fair enough. The usual sequence
|
|
of calls for compressing a load of data is:
|
|
@itemize @bullet
|
|
@item Get started with @code{bzCompressInit}.
|
|
@item Shovel data in and shlurp out its compressed form using zero or more
|
|
calls of @code{bzCompress} with action = @code{BZ_RUN}.
|
|
@item Finish up.
|
|
Repeatedly call @code{bzCompress} with action = @code{BZ_FINISH},
|
|
copying out the compressed output, until @code{BZ_STREAM_END} is returned.
|
|
@item Close up and go home. Call @code{bzCompressEnd}.
|
|
@end itemize
|
|
If the data you want to compress fits into your input buffer all
|
|
at once, you can skip the calls of @code{bzCompress ( ..., BZ_RUN )} and
|
|
just do the @code{bzCompress ( ..., BZ_FINISH )} calls.
|
|
|
|
All required memory is allocated by @code{bzCompressInit}. The
|
|
compression library can accept any data at all (obviously). So you
|
|
shouldn't get any error return values from the @code{bzCompress} calls.
|
|
If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in
|
|
your programming.
|
|
|
|
Trivial other possible return values:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL}
|
|
@end display
|
|
|
|
@subsection @code{bzCompressEnd}
|
|
@example
|
|
int bzCompressEnd ( bz_stream *strm );
|
|
@end example
|
|
Releases all memory associated with a compression stream.
|
|
|
|
Possible return values:
|
|
@display
|
|
@code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
|
|
@code{BZ_OK} otherwise
|
|
@end display
|
|
|
|
|
|
@subsection @code{bzDecompressInit}
|
|
@example
|
|
int bzDecompressInit ( bz_stream *strm, int verbosity, int small );
|
|
@end example
|
|
Prepares for decompression. As with @code{bzCompressInit}, a
|
|
@code{bz_stream} record should be allocated and initialised before the
|
|
call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be
|
|
set if a custom memory allocator is required, or made @code{NULL} for
|
|
the normal @code{malloc}/@code{free} routines. Upon return, the internal
|
|
state will have been initialised, and @code{total_in} and
|
|
@code{total_out} will be zero.
|
|
|
|
For the meaning of parameter @code{verbosity}, see @code{bzCompressInit}.
|
|
|
|
If @code{small} is nonzero, the library will use an alternative
|
|
decompression algorithm which uses less memory but at the cost of
|
|
decompressing more slowly (roughly speaking, half the speed, but the
|
|
maximum memory requirement drops to around 2300k). See Chapter 2 for
|
|
more information on memory management.
|
|
|
|
Note that the amount of memory needed to decompress
|
|
a stream cannot be determined until the stream's header has been read,
|
|
so even if @code{bzDecompressInit} succeeds, a subsequent
|
|
@code{bzDecompress} could fail with @code{BZ_MEM_ERROR}.
|
|
|
|
Possible return values:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{(small != 0 && small != 1)}
|
|
or @code{(verbosity < 0 || verbosity > 4)}
|
|
@code{BZ_MEM_ERROR}
|
|
if insufficient memory is available
|
|
@end display
|
|
|
|
Allowable next actions:
|
|
@display
|
|
@code{bzDecompress}
|
|
if @code{BZ_OK} was returned
|
|
no specific action required in case of error
|
|
@end display
|
|
|
|
|
|
|
|
@subsection @code{bzDecompress}
|
|
@example
|
|
int bzDecompress ( bz_stream *strm );
|
|
@end example
|
|
Provides more input and/out output buffer space for the library. The
|
|
caller maintains input and output buffers, and uses @code{bzDecompress}
|
|
to transfer data between them.
|
|
|
|
Before each call to @code{bzDecompress}, @code{next_in}
|
|
should point at the compressed data,
|
|
and @code{avail_in} should indicate how many bytes the library
|
|
may read. @code{bzDecompress} updates @code{next_in}, @code{avail_in}
|
|
and @code{total_in}
|
|
to reflect the number of bytes it has read.
|
|
|
|
Similarly, @code{next_out} should point to a buffer in which the uncompressed
|
|
output is to be placed, with @code{avail_out} indicating how much output space
|
|
is available. @code{bzCompress} updates @code{next_out},
|
|
@code{avail_out} and @code{total_out} to reflect
|
|
the number of bytes output.
|
|
|
|
You may provide and remove as little or as much data as you like on
|
|
each call of @code{bzDecompress}.
|
|
In the limit, it is acceptable to
|
|
supply and remove data one byte at a time, although this would be
|
|
terribly inefficient. You should always ensure that at least one
|
|
byte of output space is available at each call.
|
|
|
|
Use of @code{bzDecompress} is simpler than @code{bzCompress}.
|
|
|
|
You should provide input and remove output as described above, and
|
|
repeatedly call @code{bzDecompress} until @code{BZ_STREAM_END} is
|
|
returned. Appearance of @code{BZ_STREAM_END} denotes that
|
|
@code{bzDecompress} has detected the logical end of the compressed
|
|
stream. @code{bzDecompress} will not produce @code{BZ_STREAM_END} until
|
|
all output data has been placed into the output buffer, so once
|
|
@code{BZ_STREAM_END} appears, you are guaranteed to have available all
|
|
the decompressed output, and @code{bzDecompressEnd} can safely be
|
|
called.
|
|
|
|
If case of an error return value, you should call @code{bzDecompressEnd}
|
|
to clean up and release memory.
|
|
|
|
Possible return values:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
|
|
or @code{strm->avail_out < 1}
|
|
@code{BZ_DATA_ERROR}
|
|
if a data integrity error is detected in the compressed stream
|
|
@code{BZ_DATA_ERROR_MAGIC}
|
|
if the compressed stream doesn't begin with the right magic bytes
|
|
@code{BZ_MEM_ERROR}
|
|
if there wasn't enough memory available
|
|
@code{BZ_STREAM_END}
|
|
if the logical end of the data stream was detected and all
|
|
output in has been consumed, eg @code{s->avail_out > 0}
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
Allowable next actions:
|
|
@display
|
|
@code{bzDecompress}
|
|
if @code{BZ_OK} was returned
|
|
@code{bzDecompressEnd}
|
|
otherwise
|
|
@end display
|
|
|
|
|
|
@subsection @code{bzDecompressEnd}
|
|
@example
|
|
int bzDecompressEnd ( bz_stream *strm );
|
|
@end example
|
|
Releases all memory associated with a decompression stream.
|
|
|
|
Possible return values:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
|
|
Allowable next actions:
|
|
@display
|
|
None.
|
|
@end display
|
|
|
|
|
|
@section High-level interface
|
|
|
|
This interface provides functions for reading and writing
|
|
@code{bzip2} format files. First, some general points.
|
|
|
|
@itemize @bullet
|
|
@item All of the functions take an @code{int*} first argument,
|
|
@code{bzerror}.
|
|
After each call, @code{bzerror} should be consulted first to determine
|
|
the outcome of the call. If @code{bzerror} is @code{BZ_OK},
|
|
the call completed
|
|
successfully, and only then should the return value of the function
|
|
(if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR},
|
|
there was an error
|
|
reading/writing the underlying compressed file, and you should
|
|
then consult @code{errno}/@code{perror} to determine the
|
|
cause of the difficulty.
|
|
@code{bzerror} may also be set to various other values; precise details are
|
|
given on a per-function basis below.
|
|
@item If @code{bzerror} indicates an error
|
|
(ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}),
|
|
you should immediately call @code{bzReadClose} (or @code{bzWriteClose},
|
|
depending on whether you are attempting to read or to write)
|
|
to free up all resources associated
|
|
with the stream. Once an error has been indicated, behaviour of all calls
|
|
except @code{bzReadClose} (@code{bzWriteClose}) is undefined.
|
|
The implication is that (1) @code{bzerror} should
|
|
be checked after each call, and (2) if @code{bzerror} indicates an error,
|
|
@code{bzReadClose} (@code{bzWriteClose}) should then be called to clean up.
|
|
@item The @code{FILE*} arguments passed to
|
|
@code{bzReadOpen}/@code{bzWriteOpen}
|
|
should be set to binary mode.
|
|
Most Unix systems will do this by default, but other platforms,
|
|
including Windows and Mac, will not. If you omit this, you may
|
|
encounter problems when moving code to new platforms.
|
|
@item Memory allocation requests are handled by
|
|
@code{malloc}/@code{free}.
|
|
At present
|
|
there is no facility for user-defined memory allocators in the file I/O
|
|
functions (could easily be added, though).
|
|
@end itemize
|
|
|
|
|
|
|
|
@subsection @code{bzReadOpen}
|
|
@example
|
|
typedef void BZFILE;
|
|
|
|
BZFILE *bzReadOpen ( int *bzerror, FILE *f,
|
|
int small, int verbosity,
|
|
void *unused, int nUnused );
|
|
@end example
|
|
Prepare to read compressed data from file handle @code{f}. @code{f}
|
|
should refer to a file which has been opened for reading, and for which
|
|
the error indicator (@code{ferror(f)})is not set. If @code{small} is 1,
|
|
the library will try to decompress using less memory, at the expense of
|
|
speed.
|
|
|
|
For reasons explained below, @code{bzRead} will decompress the
|
|
@code{nUnused} bytes starting at @code{unused}, before starting to read
|
|
from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be
|
|
supplied like this. If this facility is not required, you should pass
|
|
@code{NULL} and @code{0} for @code{unused} and n@code{Unused}
|
|
respectively.
|
|
|
|
For the meaning of parameters @code{small} and @code{verbosity},
|
|
see @code{bzDecompressInit}.
|
|
|
|
The amount of memory needed to decompress a file cannot be determined
|
|
until the file's header has been read. So it is possible that
|
|
@code{bzReadOpen} returns @code{BZ_OK} but a subsequent call of
|
|
@code{bzRead} will return @code{BZ_MEM_ERROR}.
|
|
|
|
Possible assignments to @code{bzerror}:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{f} is @code{NULL}
|
|
or @code{small} is neither @code{0} nor @code{1}
|
|
or @code{(unused == NULL && nUnused != 0)}
|
|
or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))}
|
|
@code{BZ_IO_ERROR}
|
|
if @code{ferror(f)} is nonzero
|
|
@code{BZ_MEM_ERROR}
|
|
if insufficient memory is available
|
|
@code{BZ_OK}
|
|
otherwise.
|
|
@end display
|
|
|
|
Possible return values:
|
|
@display
|
|
Pointer to an abstract @code{BZFILE}
|
|
if @code{bzerror} is @code{BZ_OK}
|
|
@code{NULL}
|
|
otherwise
|
|
@end display
|
|
|
|
Allowable next actions:
|
|
@display
|
|
@code{bzRead}
|
|
if @code{bzerror} is @code{BZ_OK}
|
|
@code{bzClose}
|
|
otherwise
|
|
@end display
|
|
|
|
|
|
@subsection @code{bzRead}
|
|
@example
|
|
int bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
|
|
@end example
|
|
Reads up to @code{len} (uncompressed) bytes from the compressed file
|
|
@code{b} into
|
|
the buffer @code{buf}. If the read was successful,
|
|
@code{bzerror} is set to @code{BZ_OK}
|
|
and the number of bytes read is returned. If the logical end-of-stream
|
|
was detected, @code{bzerror} will be set to @code{BZ_STREAM_END},
|
|
and the number
|
|
of bytes read is returned. All other @code{bzerror} values denote an error.
|
|
|
|
@code{bzRead} will supply @code{len} bytes,
|
|
unless the logical stream end is detected
|
|
or an error occurs. Because of this, it is possible to detect the
|
|
stream end by observing when the number of bytes returned is
|
|
less than the number
|
|
requested. Nevertheless, this is regarded as inadvisable; you should
|
|
instead check @code{bzerror} after every call and watch out for
|
|
@code{BZ_STREAM_END}.
|
|
|
|
Internally, @code{bzRead} copies data from the compressed file in chunks
|
|
of size @code{BZ_MAX_UNUSED} bytes
|
|
before decompressing it. If the file contains more bytes than strictly
|
|
needed to reach the logical end-of-stream, @code{bzRead} will almost certainly
|
|
read some of the trailing data before signalling @code{BZ_SEQUENCE_END}.
|
|
To collect the read but unused data once @code{BZ_SEQUENCE_END} has
|
|
appeared, call @code{bzReadGetUnused} immediately before @code{bzReadClose}.
|
|
|
|
Possible assignments to @code{bzerror}:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
|
|
@code{BZ_SEQUENCE_ERROR}
|
|
if @code{b} was opened with @code{bzWriteOpen}
|
|
@code{BZ_IO_ERROR}
|
|
if there is an error reading from the compressed file
|
|
@code{BZ_UNEXPECTED_EOF}
|
|
if the compressed file ended before the logical end-of-stream was detected
|
|
@code{BZ_DATA_ERROR}
|
|
if a data integrity error was detected in the compressed stream
|
|
@code{BZ_DATA_ERROR_MAGIC}
|
|
if the stream does not begin with the requisite header bytes (ie, is not
|
|
a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}.
|
|
@code{BZ_MEM_ERROR}
|
|
if insufficient memory was available
|
|
@code{BZ_STREAM_END}
|
|
if the logical end of stream was detected.
|
|
@code{BZ_OK}
|
|
otherwise.
|
|
@end display
|
|
|
|
Possible return values:
|
|
@display
|
|
number of bytes read
|
|
if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END}
|
|
undefined
|
|
otherwise
|
|
@end display
|
|
|
|
Allowable next actions:
|
|
@display
|
|
collect data from @code{buf}, then @code{bzRead} or @code{bzReadClose}
|
|
if @code{bzerror} is @code{BZ_OK}
|
|
collect data from @code{buf}, then @code{bzReadClose} or @code{bzReadGetUnused}
|
|
if @code{bzerror} is @code{BZ_SEQUENCE_END}
|
|
@code{bzReadClose}
|
|
otherwise
|
|
@end display
|
|
|
|
|
|
|
|
@subsection @code{bzReadGetUnused}
|
|
@example
|
|
void bzReadGetUnused ( int* bzerror, BZFILE *b,
|
|
void** unused, int* nUnused );
|
|
@end example
|
|
Returns data which was read from the compressed file but was not needed
|
|
to get to the logical end-of-stream. @code{*unused} is set to the address
|
|
of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will
|
|
be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive.
|
|
|
|
This function may only be called once @code{bzRead} has signalled
|
|
@code{BZ_STREAM_END} but before @code{bzReadClose}.
|
|
|
|
Possible assignments to @code{bzerror}:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{b} is @code{NULL}
|
|
or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL}
|
|
@code{BZ_SEQUENCE_ERROR}
|
|
if @code{BZ_STREAM_END} has not been signalled
|
|
or if @code{b} was opened with @code{bzWriteOpen}
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
|
|
Allowable next actions:
|
|
@display
|
|
@code{bzReadClose}
|
|
@end display
|
|
|
|
|
|
@subsection @code{bzReadClose}
|
|
@example
|
|
void bzReadClose ( int *bzerror, BZFILE *b );
|
|
@end example
|
|
Releases all memory pertaining to the compressed file @code{b}.
|
|
@code{bzReadClose} does not call @code{fclose} on the underlying file
|
|
handle, so you should do that yourself if appropriate.
|
|
@code{bzReadClose} should be called to clean up after all error
|
|
situations.
|
|
|
|
Possible assignments to @code{bzerror}:
|
|
@display
|
|
@code{BZ_SEQUENCE_ERROR}
|
|
if @code{b} was opened with @code{bzOpenWrite}
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
|
|
Allowable next actions:
|
|
@display
|
|
none
|
|
@end display
|
|
|
|
|
|
|
|
@subsection @code{bzWriteOpen}
|
|
@example
|
|
BZFILE *bzWriteOpen ( int *bzerror, FILE *f,
|
|
int blockSize100k, int verbosity,
|
|
int workFactor );
|
|
@end example
|
|
Prepare to write compressed data to file handle @code{f}.
|
|
@code{f} should refer to
|
|
a file which has been opened for writing, and for which the error
|
|
indicator (@code{ferror(f)})is not set.
|
|
|
|
For the meaning of parameters @code{blockSize100k},
|
|
@code{verbosity} and @code{workFactor}, see
|
|
@* @code{bzCompressInit}.
|
|
|
|
All required memory is allocated at this stage, so if the call
|
|
completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a
|
|
subsequent call to @code{bzWrite}.
|
|
|
|
Possible assignments to @code{bzerror}:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{f} is @code{NULL}
|
|
or @code{blockSize100k < 1} or @code{blockSize100k > 9}
|
|
@code{BZ_IO_ERROR}
|
|
if @code{ferror(f)} is nonzero
|
|
@code{BZ_MEM_ERROR}
|
|
if insufficient memory is available
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
|
|
Possible return values:
|
|
@display
|
|
Pointer to an abstract @code{BZFILE}
|
|
if @code{bzerror} is @code{BZ_OK}
|
|
@code{NULL}
|
|
otherwise
|
|
@end display
|
|
|
|
Allowable next actions:
|
|
@display
|
|
@code{bzWrite}
|
|
if @code{bzerror} is @code{BZ_OK}
|
|
(you could go directly to @code{bzWriteClose}, but this would be pretty pointless)
|
|
@code{bzWriteClose}
|
|
otherwise
|
|
@end display
|
|
|
|
|
|
|
|
@subsection @code{bzWrite}
|
|
@example
|
|
void bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
|
|
@end example
|
|
Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be
|
|
compressed and written to the file.
|
|
|
|
Possible assignments to @code{bzerror}:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
|
|
@code{BZ_SEQUENCE_ERROR}
|
|
if b was opened with @code{bzReadOpen}
|
|
@code{BZ_IO_ERROR}
|
|
if there is an error writing the compressed file.
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
|
|
|
|
|
|
|
|
@subsection @code{bzWriteClose}
|
|
@example
|
|
int bzWriteClose ( int *bzerror, BZFILE* f,
|
|
int abandon,
|
|
unsigned int* nbytes_in,
|
|
unsigned int* nbytes_out );
|
|
@end example
|
|
|
|
Compresses and flushes to the compressed file all data so far supplied
|
|
by @code{bzWrite}. The logical end-of-stream markers are also written, so
|
|
subsequent calls to @code{bzWrite} are illegal. All memory associated
|
|
with the compressed file @code{b} is released.
|
|
@code{fflush} is called on the
|
|
compressed file, but it is not @code{fclose}'d.
|
|
|
|
If @code{bzWriteClose} is called to clean up after an error, the only
|
|
action is to release the memory. The library records the error codes
|
|
issued by previous calls, so this situation will be detected
|
|
automatically. There is no attempt to complete the compression
|
|
operation, nor to @code{fflush} the compressed file. You can force this
|
|
behaviour to happen even in the case of no error, by passing a nonzero
|
|
value to @code{abandon}.
|
|
|
|
If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the
|
|
total volume of uncompressed data handled. Similarly, @code{nbytes_out}
|
|
will be set to the total volume of compressed data written.
|
|
|
|
Possible assignments to @code{bzerror}:
|
|
@display
|
|
@code{BZ_SEQUENCE_ERROR}
|
|
if @code{b} was opened with @code{bzReadOpen}
|
|
@code{BZ_IO_ERROR}
|
|
if there is an error writing the compressed file
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
|
|
@subsection Handling embedded compressed data streams
|
|
|
|
The high-level library facilitates use of
|
|
@code{bzip2} data streams which form some part of a surrounding, larger
|
|
data stream.
|
|
@itemize @bullet
|
|
@item For writing, the library takes an open file handle, writes
|
|
compressed data to it, @code{fflush}es it but does not @code{fclose} it.
|
|
The calling application can write its own data before and after the
|
|
compressed data stream, using that same file handle.
|
|
@item Reading is more complex, and the facilities are not as general
|
|
as they could be since generality is hard to reconcile with efficiency.
|
|
@code{bzRead} reads from the compressed file in blocks of size
|
|
@code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot
|
|
the logical end of compressed stream.
|
|
To recover this data once decompression has
|
|
ended, call @code{bzReadGetUnused} after the last call of @code{bzRead}
|
|
(the one returning @code{BZ_STREAM_END}) but before calling
|
|
@code{bzReadClose}.
|
|
@end itemize
|
|
|
|
This mechanism makes it easy to decompress multiple @code{bzip2}
|
|
streams placed end-to-end. As the end of one stream, when @code{bzRead}
|
|
returns @code{BZ_STREAM_END}, call @code{bzReadGetUnused} to collect the
|
|
unused data (copy it into your own buffer somewhere).
|
|
That data forms the start of the next compressed stream.
|
|
To start uncompressing that next stream, call @code{bzReadOpen} again,
|
|
feeding in the unused data via the @code{unused}/@code{nUnused}
|
|
parameters.
|
|
Keep doing this until @code{BZ_STREAM_END} return coincides with the
|
|
physical end of file (@code{feof(f)}). In this situation
|
|
@code{bzReadGetUnused}
|
|
will of course return no data.
|
|
|
|
This should give some feel for how the high-level interface can be used.
|
|
If you require extra flexibility, you'll have to bite the bullet and get
|
|
to grips with the low-level interface.
|
|
|
|
@subsection Standard file-reading/writing code
|
|
Here's how you'd write data to a compressed file:
|
|
@example @code
|
|
FILE* f;
|
|
BZFILE* b;
|
|
int nBuf;
|
|
char buf[ /* whatever size you like */ ];
|
|
int bzerror;
|
|
int nWritten;
|
|
|
|
f = fopen ( "myfile.bz2", "w" );
|
|
if (!f) @{
|
|
/* handle error */
|
|
@}
|
|
b = bzWriteOpen ( &bzerror, f, 9 );
|
|
if (bzerror != BZ_OK) @{
|
|
bzWriteClose ( b );
|
|
/* handle error */
|
|
@}
|
|
|
|
while ( /* condition */ ) @{
|
|
/* get data to write into buf, and set nBuf appropriately */
|
|
nWritten = bzWrite ( &bzerror, b, buf, nBuf );
|
|
if (bzerror == BZ_IO_ERROR) @{
|
|
bzWriteClose ( &bzerror, b );
|
|
/* handle error */
|
|
@}
|
|
@}
|
|
|
|
bzWriteClose ( &bzerror, b );
|
|
if (bzerror == BZ_IO_ERROR) @{
|
|
/* handle error */
|
|
@}
|
|
@end example
|
|
And to read from a compressed file:
|
|
@example
|
|
FILE* f;
|
|
BZFILE* b;
|
|
int nBuf;
|
|
char buf[ /* whatever size you like */ ];
|
|
int bzerror;
|
|
int nWritten;
|
|
|
|
f = fopen ( "myfile.bz2", "r" );
|
|
if (!f) @{
|
|
/* handle error */
|
|
@}
|
|
b = bzReadOpen ( &bzerror, f, 0, NULL, 0 );
|
|
if (bzerror != BZ_OK) @{
|
|
bzReadClose ( &bzerror, b );
|
|
/* handle error */
|
|
@}
|
|
|
|
bzerror = BZ_OK;
|
|
while (bzerror == BZ_OK && /* arbitrary other conditions */) @{
|
|
nBuf = bzRead ( &bzerror, b, buf, /* size of buf */ );
|
|
if (bzerror == BZ_OK) @{
|
|
/* do something with buf[0 .. nBuf-1] */
|
|
@}
|
|
@}
|
|
if (bzerror != BZ_STREAM_END) @{
|
|
bzReadClose ( &bzerror, b );
|
|
/* handle error */
|
|
@} else @{
|
|
bzReadClose ( &bzerror );
|
|
@}
|
|
@end example
|
|
|
|
|
|
|
|
@section Utility functions
|
|
@subsection @code{bzBuffToBuffCompress}
|
|
@example
|
|
int bzBuffToBuffCompress( char* dest,
|
|
unsigned int* destLen,
|
|
char* source,
|
|
unsigned int sourceLen,
|
|
int blockSize100k,
|
|
int verbosity,
|
|
int workFactor );
|
|
@end example
|
|
Attempts to compress the data in @code{source[0 .. sourceLen-1]}
|
|
into the destination buffer, @code{dest[0 .. *destLen-1]}.
|
|
If the destination buffer is big enough, @code{*destLen} is
|
|
set to the size of the compressed data, and @code{BZ_OK} is
|
|
returned. If the compressed data won't fit, @code{*destLen}
|
|
is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
|
|
|
|
Compression in this manner is a one-shot event, done with a single call
|
|
to this function. The resulting compressed data is a complete
|
|
@code{bzip2} format data stream. There is no mechanism for making
|
|
additional calls to provide extra input data. If you want that kind of
|
|
mechanism, use the low-level interface.
|
|
|
|
For the meaning of parameters @code{blockSize100k}, @code{verbosity}
|
|
and @code{workFactor}, @* see @code{bzCompressInit}.
|
|
|
|
To guarantee that the compressed data will fit in its buffer, allocate
|
|
an output buffer of size 1% larger than the uncompressed data, plus
|
|
fifty bytes.
|
|
|
|
@code{bzBuffToBuffDecompress} will not write data at or
|
|
beyond @code{dest[*destLen]}, even in case of buffer overflow.
|
|
|
|
Possible return values:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
|
|
or @code{blockSize100k < 1} or @code{blockSize100k > 9}
|
|
or @code{verbosity < 0} or @code{verbosity > 4}
|
|
or @code{workFactor < 0} or @code{workFactor > 250}
|
|
@code{BZ_MEM_ERROR}
|
|
if insufficient memory is available
|
|
@code{BZ_OUTBUFF_FULL}
|
|
if the size of the compressed data exceeds @code{*destLen}
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
|
|
|
|
|
|
@subsection @code{bzBuffToBuffDecompress}
|
|
@example
|
|
int bzBuffToBuffDecompress ( char* dest,
|
|
unsigned int* destLen,
|
|
char* source,
|
|
unsigned int sourceLen,
|
|
int small,
|
|
int verbosity );
|
|
@end example
|
|
Attempts to decompress the data in @code{source[0 .. sourceLen-1]}
|
|
into the destination buffer, @code{dest[0 .. *destLen-1]}.
|
|
If the destination buffer is big enough, @code{*destLen} is
|
|
set to the size of the uncompressed data, and @code{BZ_OK} is
|
|
returned. If the compressed data won't fit, @code{*destLen}
|
|
is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
|
|
|
|
@code{source} is assumed to hold a complete @code{bzip2} format
|
|
data stream. @code{bzBuffToBuffDecompress} tries to decompress
|
|
the entirety of the stream into the output buffer.
|
|
|
|
For the meaning of parameters @code{small} and @code{verbosity},
|
|
see @code{bzDecompressInit}.
|
|
|
|
Because the compression ratio of the compressed data cannot be known in
|
|
advance, there is no easy way to guarantee that the output buffer will
|
|
be big enough. You may of course make arrangements in your code to
|
|
record the size of the uncompressed data, but such a mechanism is beyond
|
|
the scope of this library.
|
|
|
|
@code{bzBuffToBuffDecompress} will not write data at or
|
|
beyond @code{dest[*destLen]}, even in case of buffer overflow.
|
|
|
|
Possible return values:
|
|
@display
|
|
@code{BZ_PARAM_ERROR}
|
|
if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
|
|
or @code{small != 0 && small != 1}
|
|
or @code{verbosity < 0} or @code{verbosity > 4}
|
|
@code{BZ_MEM_ERROR}
|
|
if insufficient memory is available
|
|
@code{BZ_OUTBUFF_FULL}
|
|
if the size of the compressed data exceeds @code{*destLen}
|
|
@code{BZ_DATA_ERROR}
|
|
if a data integrity error was detected in the compressed data
|
|
@code{BZ_DATA_ERROR_MAGIC}
|
|
if the compressed data doesn't begin with the right magic bytes
|
|
@code{BZ_UNEXPECTED_EOF}
|
|
if the compressed data ends unexpectedly
|
|
@code{BZ_OK}
|
|
otherwise
|
|
@end display
|
|
|
|
|
|
|
|
@section Using the library in a @code{stdio}-free environment
|
|
|
|
@subsection Getting rid of @code{stdio}
|
|
|
|
In a deeply embedded application, you might want to use just
|
|
the memory-to-memory functions. You can do this conveniently
|
|
by compiling the library with preprocessor symbol @code{BZ_NO_STDIO}
|
|
defined. Doing this gives you a library containing only the following
|
|
eight functions:
|
|
|
|
@code{bzCompressInit}, @code{bzCompress}, @code{bzCompressEnd} @*
|
|
@code{bzDecompressInit}, @code{bzDecompress}, @code{bzDecompressEnd} @*
|
|
@code{bzBuffToBuffCompress}, @code{bzBuffToBuffDecompress}
|
|
|
|
When compiled like this, all functions will ignore @code{verbosity}
|
|
settings.
|
|
|
|
@subsection Critical error handling
|
|
@code{libbzip2} contains a number of internal assertion checks which
|
|
should, needless to say, never be activated. Nevertheless, if an
|
|
assertion should fail, behaviour depends on whether or not the library
|
|
was compiled with @code{BZ_NO_STDIO} set.
|
|
|
|
For a normal compile, an assertion failure yields the message
|
|
@example
|
|
bzip2/libbzip2, v0.9.0: internal error number N.
|
|
This is a bug in bzip2/libbzip2, v0.9.0. Please report
|
|
it to me at: jseward@@acm.org. If this happened when
|
|
you were using some program which uses libbzip2 as a
|
|
component, you should also report this bug to the author(s)
|
|
of that program. Please make an effort to report this bug;
|
|
timely and accurate bug reports eventually lead to higher
|
|
quality software. Thx. Julian Seward, 27 June 1998.
|
|
@end example
|
|
where @code{N} is some error code number. @code{exit(3)}
|
|
is then called.
|
|
|
|
For a @code{stdio}-free library, assertion failures result
|
|
in a call to a function declared as:
|
|
@example
|
|
extern void bz_internal_error ( int errcode );
|
|
@end example
|
|
The relevant code is passed as a parameter. You should supply
|
|
such a function.
|
|
|
|
In either case, once an assertion failure has occurred, any
|
|
@code{bz_stream} records involved can be regarded as invalid.
|
|
You should not attempt to resume normal operation with them.
|
|
|
|
You may, of course, change critical error handling to suit
|
|
your needs. As I said above, critical errors indicate bugs
|
|
in the library and should not occur. All "normal" error
|
|
situations are indicated via error return codes from functions,
|
|
and can be recovered from.
|
|
|
|
|
|
@section Making a Windows DLL
|
|
Everything related to Windows has been contributed by Yoshioka Tsuneo
|
|
@* (@code{QWF00133@@niftyserve.or.jp} /
|
|
@code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to
|
|
him (but perhaps Cc: me, @code{jseward@@acm.org}).
|
|
|
|
My vague understanding of what to do is: using Visual C++ 5.0,
|
|
open the project file @code{libbz2.dsp}, and build. That's all.
|
|
|
|
If you can't
|
|
open the project file for some reason, make a new one, naming these files:
|
|
@code{blocksort.c}, @code{bzlib.c}, @code{compress.c},
|
|
@code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @*
|
|
@code{randtable.c} and @code{libbz2.def}. You might also need
|
|
to name the header files @code{bzlib.h} and @code{bzlib_private.h}.
|
|
|
|
If you don't use VC++, you may need to define the proprocessor symbol
|
|
@code{_WIN32}.
|
|
|
|
Finally, @code{dlltest.c} is a sample program using the DLL. It has a
|
|
project file, @code{dlltest.dsp}.
|
|
|
|
I haven't tried any of this stuff myself, but it all looks plausible.
|
|
|
|
|
|
|
|
@chapter Miscellanea
|
|
|
|
These are just some random thoughts of mine. Your mileage may
|
|
vary.
|
|
|
|
@section Limitations of the compressed file format
|
|
@code{bzip2-0.9.0} uses exactly the same file format as the previous
|
|
version, @code{bzip2-0.1}. This decision was made in the interests of
|
|
stability. Creating yet another incompatible compressed file format
|
|
would create further confusion and disruption for users.
|
|
|
|
Nevertheless, this is not a painless decision. Development
|
|
work since the release of @code{bzip2-0.1} in August 1997
|
|
has shown complexities in the file format which slow down
|
|
decompression and, in retrospect, are unnecessary. These are:
|
|
@itemize @bullet
|
|
@item The run-length encoder, which is the first of the
|
|
compression transformations, is entirely irrelevant.
|
|
The original purpose was to protect the sorting algorithm
|
|
from the very worst case input: a string of repeated
|
|
symbols. But algorithm steps Q6a and Q6b in the original
|
|
Burrows-Wheeler technical report (SRC-124) show how
|
|
repeats can be handled without difficulty in block
|
|
sorting.
|
|
@item The randomisation mechanism doesn't really need to be
|
|
there. Udi Manber and Gene Myers published a suffix
|
|
array construction algorithm a few years back, which
|
|
can be employed to sort any block, no matter how
|
|
repetitive, in O(N log N) time. Subsequent work by
|
|
Kunihiko Sadakane has produced a derivative O(N (log N)^2)
|
|
algorithm which usually outperforms the Manber-Myers
|
|
algorithm.
|
|
|
|
I could have changed to Sadakane's algorithm, but I find
|
|
it to be slower than @code{bzip2}'s existing algorithm for
|
|
most inputs, and the randomisation mechanism protects
|
|
adequately against bad cases. I didn't think it was
|
|
a good tradeoff to make. Partly this is due to the fact
|
|
that I was not flooded with email complaints about
|
|
@code{bzip2-0.1}'s performance on repetitive data, so
|
|
perhaps it isn't a problem for real inputs.
|
|
|
|
Probably the best long-term solution
|
|
is to use the existing sorting
|
|
algorithm initially, and fall back to a O(N (log N)^2)
|
|
algorithm if the standard algorithm gets into difficulties.
|
|
This can be done without much difficulty; I made
|
|
a prototype implementation of it some months now.
|
|
@item The compressed file format was never designed to be
|
|
handled by a library, and I have had to jump though
|
|
some hoops to produce an efficient implementation of
|
|
decompression. It's a bit hairy. Try passing
|
|
@code{decompress.c} through the C preprocessor
|
|
and you'll see what I mean. Much of this complexity
|
|
could have been avoided if the compressed size of
|
|
each block of data was recorded in the data stream.
|
|
@item An Adler-32 checksum, rather than a CRC32 checksum,
|
|
would be faster to compute.
|
|
@end itemize
|
|
It would be fair to say that the @code{bzip2} format was frozen
|
|
before I properly and fully understood the performance
|
|
consequences of doing so.
|
|
|
|
Improvements which I have been able to incorporate into
|
|
0.9.0, despite using the same file format, are:
|
|
@itemize @bullet
|
|
@item Single array implementation of the inverse BWT. This
|
|
significantly speeds up decompression, presumably
|
|
because it reduces the number of cache misses.
|
|
@item Faster inverse MTF transform for large MTF values. The
|
|
new implementation is based on the notion of sliding blocks
|
|
of values.
|
|
@item @code{bzip2-0.9.0} now reads and writes files with @code{fread}
|
|
and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}.
|
|
Duh! I'm embarrassed at my own moronicness (moronicity?) on this
|
|
one.
|
|
|
|
@end itemize
|
|
Further ahead, it would be nice
|
|
to be able to do random access into files. This will
|
|
require some careful design of compressed file formats.
|
|
|
|
|
|
|
|
@section Portability issues
|
|
After some consideration, I have decided not to use
|
|
GNU @code{autoconf} to configure 0.9.0.
|
|
|
|
@code{autoconf}, admirable and wonderful though it is,
|
|
mainly assists with portability problems between Unix-like
|
|
platforms. But @code{bzip2} doesn't have much in the way
|
|
of portability problems on Unix; most of the difficulties appear
|
|
when porting to the Mac, or to Microsoft's operating systems.
|
|
@code{autoconf} doesn't help in those cases, and brings in a
|
|
whole load of new complexity.
|
|
|
|
Most people should be able to compile the library and program
|
|
under Unix straight out-of-the-box, so to speak, especially
|
|
if you have a version of GNU C available.
|
|
|
|
There are a couple of @code{__inline__} directives in the code. GNU C
|
|
(@code{gcc}) should be able to handle them. If your compiler doesn't
|
|
like them, just @code{#define} @code{__inline__} to be null. One
|
|
easy way to do this is to compile with the flag @code{-D__inline__=},
|
|
which should be understood by most Unix compilers.
|
|
|
|
If you still have difficulties, try compiling with the macro
|
|
@code{BZ_STRICT_ANSI} defined. This should enable you to build the
|
|
library in a strictly ANSI compliant environment. Building the program
|
|
itself like this is dangerous and not supported, since you remove
|
|
@code{bzip2}'s checks against compressing directories, symbolic links,
|
|
devices, and other not-really-a-file entities. This could cause
|
|
filesystem corruption!
|
|
|
|
One other thing: if you create a @code{bzip2} binary for public
|
|
distribution, please try and link it statically (@code{gcc -s}). This
|
|
avoids all sorts of library-version issues that others may encounter
|
|
later on.
|
|
|
|
|
|
@section Reporting bugs
|
|
I tried pretty hard to make sure @code{bzip2} is
|
|
bug free, both by design and by testing. Hopefully
|
|
you'll never need to read this section for real.
|
|
|
|
Nevertheless, if @code{bzip2} dies with a segmentation
|
|
fault, a bus error or an internal assertion failure, it
|
|
will ask you to email me a bug report. Experience with
|
|
version 0.1 shows that almost all these problems can
|
|
be traced to either compiler bugs or hardware problems.
|
|
@itemize @bullet
|
|
@item
|
|
Recompile the program with no optimisation, and see if it
|
|
works. And/or try a different compiler.
|
|
I heard all sorts of stories about various flavours
|
|
of GNU C (and other compilers) generating bad code for
|
|
@code{bzip2}, and I've run across two such examples myself.
|
|
|
|
2.7.X versions of GNU C are known to generate bad code from
|
|
time to time, at high optimisation levels.
|
|
If you get problems, try using the flags
|
|
@code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}.
|
|
You should specifically @emph{not} use @code{-funroll-loops}.
|
|
|
|
You may notice that the Makefile runs four tests as part of
|
|
the build process. If the program passes all of these, it's
|
|
a pretty good (but not 100%) indication that the compiler has
|
|
done its job correctly.
|
|
@item
|
|
If @code{bzip2} crashes randomly, and the crashes are not
|
|
repeatable, you may have a flaky memory subsystem. @code{bzip2}
|
|
really hammers your memory hierarchy, and if it's a bit marginal,
|
|
you may get these problems. Ditto if your disk or I/O subsystem
|
|
is slowly failing. Yup, this really does happen.
|
|
|
|
Try using a different machine of the same type, and see if
|
|
you can repeat the problem.
|
|
@item This isn't really a bug, but ... If @code{bzip2} tells
|
|
you your file is corrupted on decompression, and you
|
|
obtained the file via FTP, there is a possibility that you
|
|
forgot to tell FTP to do a binary mode transfer. That absolutely
|
|
will cause the file to be non-decompressible. You'll have to transfer
|
|
it again.
|
|
@end itemize
|
|
|
|
If you've incorporated @code{libbzip2} into your own program
|
|
and are getting problems, please, please, please, check that the
|
|
parameters you are passing in calls to the library, are
|
|
correct, and in accordance with what the documentation says
|
|
is allowable. I have tried to make the library robust against
|
|
such problems, but I'm sure I haven't succeeded.
|
|
|
|
Finally, if the above comments don't help, you'll have to send
|
|
me a bug report. Now, it's just amazing how many people will
|
|
send me a bug report saying something like
|
|
@display
|
|
bzip2 crashed with segmentation fault on my machine
|
|
@end display
|
|
and absolutely nothing else. Needless to say, a such a report
|
|
is @emph{totally, utterly, completely and comprehensively 100% useless;
|
|
a waste of your time, my time, and net bandwidth}.
|
|
With no details at all, there's no way I can possibly begin
|
|
to figure out what the problem is.
|
|
|
|
The rules of the game are: facts, facts, facts. Don't omit
|
|
them because "oh, they won't be relevant". At the bare
|
|
minimum:
|
|
@display
|
|
Machine type. Operating system version.
|
|
Exact version of @code{bzip2} (do @code{bzip2 -V}).
|
|
Exact version of the compiler used.
|
|
Flags passed to the compiler.
|
|
@end display
|
|
However, the most important single thing that will help me is
|
|
the file that you were trying to compress or decompress at the
|
|
time the problem happened. Without that, my ability to do anything
|
|
more than speculate about the cause, is limited.
|
|
|
|
Please remember that I connect to the Internet with a modem, so
|
|
you should contact me before mailing me huge files.
|
|
|
|
|
|
@section Did you get the right package?
|
|
|
|
@code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles
|
|
and memory. Also, it gives very large latencies. In the worst case, you
|
|
can feed many megabytes of uncompressed data into the library before
|
|
getting any compressed output, so this probably rules out applications
|
|
requiring interactive behaviour.
|
|
|
|
These aren't faults of my implementation, I hope, but more
|
|
an intrinsic property of the Burrows-Wheeler transform (unfortunately).
|
|
Maybe this isn't what you want.
|
|
|
|
If you want a compressor and/or library which is faster, uses less
|
|
memory but gets pretty good compression, and has minimal latency,
|
|
consider Jean-loup
|
|
Gailly's and Mark Adler's work, @code{zlib-1.1.2} and
|
|
@code{gzip-1.2.4}. Look for them at
|
|
@code{http://www.cdrom.com/pub/infozip/zlib} and
|
|
@code{http://www.gzip.org} respectively.
|
|
|
|
For something faster and lighter still, you might try Markus F X J
|
|
Oberhumer's @code{LZO} real-time compression/decompression library, at
|
|
@* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}.
|
|
|
|
If you want to use the @code{bzip2} algorithms to compress small blocks
|
|
of data, 64k bytes or smaller, for example on an on-the-fly disk
|
|
compressor, you'd be well advised not to use this library. Instead,
|
|
I've made a special library tuned for that kind of use. It's part of
|
|
@code{e2compr-0.40}, an on-the-fly disk compressor for the Linux
|
|
@code{ext2} filesystem. Look at
|
|
@code{http://www.netspace.net.au/~reiter/e2compr}.
|
|
|
|
|
|
|
|
@section Testing
|
|
|
|
A record of the tests I've done.
|
|
|
|
First, some data sets:
|
|
@itemize @bullet
|
|
@item B: a directory containing a 6001 files, one for every length in the
|
|
range 0 to 6000 bytes. The files contain random lowercase
|
|
letters. 18.7 megabytes.
|
|
@item H: my home directory tree. Documents, source code, mail files,
|
|
compressed data. H contains B, and also a directory of
|
|
files designed as boundary cases for the sorting; mostly very
|
|
repetitive, nasty files. 445 megabytes.
|
|
@item A: directory tree holding various applications built from source:
|
|
@code{egcs-1.0.2}, @code{gcc-2.8.1}, KDE Beta 4, GTK, Octave, etc.
|
|
827 megabytes.
|
|
@item P: directory tree holding large amounts of source code (@code{.tar}
|
|
files) of the entire GNU distribution, plus a couple of
|
|
Linux distributions. 2400 megabytes.
|
|
@end itemize
|
|
The tests conducted are as follows. Each test means compressing
|
|
(a copy of) each file in the data set, decompressing it and
|
|
comparing it against the original.
|
|
|
|
First, a bunch of tests with block sizes, internal buffer
|
|
sizes and randomisation lengths set very small,
|
|
to detect any problems with the
|
|
blocking, buffering and randomisation mechanisms.
|
|
This required modifying the source code so as to try to
|
|
break it.
|
|
@enumerate
|
|
@item Data set H, with
|
|
buffer size of 1 byte, and block size of 23 bytes.
|
|
@item Data set B, buffer sizes 1 byte, block size 1 byte.
|
|
@item As (2) but small-mode decompression (first 1700 files).
|
|
@item As (2) with block size 2 bytes.
|
|
@item As (2) with block size 3 bytes.
|
|
@item As (2) with block size 4 bytes.
|
|
@item As (2) with block size 5 bytes.
|
|
@item As (2) with block size 6 bytes and small-mode decompression.
|
|
@item H with normal buffer sizes (5000 bytes), normal block
|
|
size (up to 900000 bytes), but with randomisation
|
|
mechanism running intensely (randomising approximately every
|
|
third byte).
|
|
@item As (9) with small-mode decompression.
|
|
@end enumerate
|
|
Then some tests with unmodified source code.
|
|
@enumerate
|
|
@item H, all settings normal.
|
|
@item As (1), with small-mode decompress.
|
|
@item H, compress with flag @code{-1}.
|
|
@item H, compress with flag @code{-s}, decompress with flag @code{-s}.
|
|
@item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing,
|
|
@code{bzip2-0.9.0} decompressing, all settings normal.
|
|
@item Backwards compatibility: H, @code{bzip2-0.9.0} compressing,
|
|
@code{bzip2-0.1pl2} decompressing, all settings normal.
|
|
@item Bigger tests: A, all settings normal.
|
|
@item P, all settings normal.
|
|
@item Misc test: about 100 megabytes of @code{.tar} files with
|
|
@code{bzip2} compiled with Purify.
|
|
@item Misc tests to make sure it builds and runs ok on non-Linux/x86
|
|
platforms.
|
|
@end enumerate
|
|
These tests were conducted on a 205 MHz Cyrix 6x86MX machine, running
|
|
Linux 2.0.32. They represent nearly a week of continuous computation.
|
|
All tests completed successfully.
|
|
|
|
|
|
@section Further reading
|
|
@code{bzip2} is not research work, in the sense that it doesn't present
|
|
any new ideas. Rather, it's an engineering exercise based on existing
|
|
ideas.
|
|
|
|
Four documents describe essentially all the ideas behind @code{bzip2}:
|
|
@example
|
|
Michael Burrows and D. J. Wheeler:
|
|
"A block-sorting lossless data compression algorithm"
|
|
10th May 1994.
|
|
Digital SRC Research Report 124.
|
|
ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
|
|
If you have trouble finding it, try searching at the
|
|
New Zealand Digital Library, http://www.nzdl.org.
|
|
|
|
Daniel S. Hirschberg and Debra A. LeLewer
|
|
"Efficient Decoding of Prefix Codes"
|
|
Communications of the ACM, April 1990, Vol 33, Number 4.
|
|
You might be able to get an electronic copy of this
|
|
from the ACM Digital Library.
|
|
|
|
David J. Wheeler
|
|
Program bred3.c and accompanying document bred3.ps.
|
|
This contains the idea behind the multi-table Huffman
|
|
coding scheme.
|
|
ftp://ftp.cl.cam.ac.uk/pub/user/djw3/
|
|
|
|
Jon L. Bentley and Robert Sedgewick
|
|
"Fast Algorithms for Sorting and Searching Strings"
|
|
Available from Sedgewick's web page,
|
|
www.cs.princeton.edu/~rs
|
|
@end example
|
|
The following paper gives valuable additional insights into the
|
|
algorithm, but is not immediately the basis of any code
|
|
used in bzip2.
|
|
@example
|
|
Peter Fenwick:
|
|
Block Sorting Text Compression
|
|
Proceedings of the 19th Australasian Computer Science Conference,
|
|
Melbourne, Australia. Jan 31 - Feb 2, 1996.
|
|
ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
|
|
@end example
|
|
Kunihiko Sadakane's sorting algorithm, mentioned above,
|
|
is available from:
|
|
@example
|
|
http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
|
|
@end example
|
|
The Manber-Myers suffix array construction
|
|
algorithm is described in a paper
|
|
available from:
|
|
@example
|
|
http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
|
|
@end example
|
|
|
|
|
|
|
|
@contents
|
|
|
|
@bye
|
|
|