256
Deutsch   English
img

Fast lossless compression of numbers stored as text

If large amounts of data have to be read and written with maximum speed and good acompression at the same time, an algorithm tailored to the problem is recommended (see e.g. fc16). For some applications, however, it is advantageous if the data can be read by text-based programs, like grep, sed or awk.

fc4 (fast compression 4 bit)

Since numbers coded as text occupy on average slightly more than twice the memory space than numbers stored as binary data, large amounts of data should be compressed before writing them to the hard disk. The program fc4 from 256.systems compresses files, which mainly contain numbers in ASCII format, with over 4 GB/s per core and decompresses them with 8 GB/s per core (on a notebook with Intel i5 processor). Similar compression rates are achieved as with gzip, lzma and zstd.

Benchmark 1: Compression and decompression speeds of different programs

img
Left: Compression speed of gzip, lzma and zstd for different types of numbers stored as ASCII text.
Right: Decompression speed.

Benchmark 2: Compression rate and overall performance of different programs

img
Left: Compression rate of gzip, lzma and zstd for different types of numbers stored as ASCII text.
Right: Combined compression and decompression speed v = (v_c+v_d)/(2*ratio) ; v is the mean of compression and decompression speed divided by compression ratio and thus shows the mean read and write rate of uncompressed data.

Fast conversion from text to numbers and back

In order to save numbers from memory as text, they have to be converted from binary format to decimal format and then to ASCII format. Under Linux there are different printf functions in the standard library.

To convert numbers from text files back to binary format, there are atoi (ASCII to 32 bit integer), atol (ASCII to 64 bit integer) and atof (ASCII to 64 bit float). Although these functions are highly optimized, in many cases they can become the bottleneck of the program.

256.systems provides the functions fatoi, fatof, print_int and soon print_float, which are 4 to 6 times faster than the standard library functions (see benchmark results below).

Benchmark: Conversion of numbers: text --> binary; C Standard Library vs. fc4

img
Left: Number of conversions per second from ASCII text to 32 bit integer (left), 64 bit integer (center) and 64 bit floats (right). Right: Like left plot, but here the amount of converted ASCII text in MegaBytes per second is shown.

Benchmark: Conversion of numbers: binary --> text; C Standard Library vs. fc4

img
Left: Number of conversions per second from 32 bit integer (left) and 64 bit integer (right) to ASCII text. Right: Like left plot, but here the amount of converted ASCII text in megabytes per second is shown.

Contact

For questions about compressing numbers stored as ASCII text or fast conversion routines and similar algorithms, 256.systems can be contacted.