Tutorials¶
Using python-blosc
(or just blosc
, because we are going to talk always on
how to use it in a Python environment) is pretty easy. It basically mimics
the API of the zlib
module included in the standard Python library.
Here are some examples on how to use it. For the full documentation, please refer to the Command Reference section.
Most of the times in this tutorial have been obtained using a VM with 2 cores on top of a Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz.
Compressing and decompressing with blosc
¶
Let’s start creating a NumPy array with 80 MB full of data:
>>> import numpy as np
>>> a = np.linspace(0, 100, 1e7)
>>> bytes_array = a.tostring() # get a bytes stream
and let’s compare Blosc operation with zlib
(please note that we are
using IPython for leveraging its timing capabilities):
>>> import zlib
>>>%time zpacked = zlib.compress(bytes_array)
CPU times: user 5.17 s, sys: 14 ms, total: 5.19 s
Wall time: 5.2 s # ~ 15 MB/s
>>> import blosc
>>> %time bpacked = blosc.compress(bytes_array, typesize=8)
CPU times: user 125 ms, sys: 0 ns, total: 125 ms
Wall time: 38.8 ms # ~ 2.0 GB/s and 130x faster than zlib
>>> %time acp = a.copy() # a direct copy using memcpy() behind the scenes
CPU times: user 15 ms, sys: 8 ms, total: 23 ms
Wall time: 22.6 ms # ~ 3.5 GB/s, just a 1.7x faster than Blosc
Now, see at the compression ratios:
>>> len(zpacked)
52994692
>>> len(bytes_array) / float(len(zpacked))
1.5095851486409242 # zlib achieves a 1.5x compression ratio
>>> len(bpacked)
7641156
>>> len(bytes_array) / float(len(bpacked))
10.469620041784253 # blosc reaches more than 10x compression ratio
Wow, looks like Blosc is very efficient compressing binary data. How to decompress? Well, it is exactly the same way than Zlib:
>>> %time bytes_array2 = zlib.decompress(zpacked)
CPU times: user 345 ms, sys: 9 ms, total: 354 ms
Wall time: 354 ms # ~ 225 MB/s
>>> %time bytes_array2 = blosc.decompress(bpacked)
CPU times: user 82 ms, sys: 10 ms, total: 92 ms
Wall time: 36.3 ms # ~ 2.2 GB/s and ~ 10x times faster than zlib
Using different compressors inside Blosc¶
Since Blosc 1.3.0, you can use different compressors inside it. That allows for these compressors to leverage Blosc powerful multi-threading and shuffling machinery.
The examples above where using the default ‘blosclz’ compressor. Here there is another example using ‘zlib’:
>>> %time bpacked = blosc.compress(bytes_array, typesize=8, cname='zlib')
CPU times: user 1.09 s, sys: 15 ms, total: 1.1 s
Wall time: 290 ms # ~ 275 MB/s and 18x faster than plain zlib
So, by using Zlib inside Blosc we can make it work at speeds that are up to 18x faster than plain Zlib. How that can be? Well, as said before, Blosc has efficient machinery for dealing with binary data (shuffling) and leveraging multithreading. In addition, it uses block sizes for compressing data that are typically smaller than Zlib, so the cost for compressing is further reduced.
In terms of compression ratio, ‘zlib’ inside Blosc behaves very well too:
>>> len(bpacked)
1011304 # ~ 7.5x smaller than blosclz and ~ 50x than plain zlib
So, ‘zlib’ here can do a much better job than ‘blosclz’, although at the expenses of being slower (7.5x).
Decompression speed is pretty good too:
>>> %time bytes_array2 = blosc.decompress(bpacked)
CPU times: user 209 ms, sys: 9 ms, total: 218 ms
Wall time: 67.6 ms # ~ 1.2 GB/s and 5x faster than plain zlib
So, when mixing Zlib and Blosc, we can easily achieve decompression speeds above 1 GB/s, which is quite impressive for a relatively slow compressor like Zlib.
You can play with other compressors too, like ‘lz4’, ‘lz4hc’ and ‘snappy’. ‘lz4’ and snappy are in the same class than ‘blosclz’, so you can expect similar results. However, ‘lz4hc’ is variation of ‘lz4’ that typically spends more time compressing for a better compression ratio, so it is very good for read-only data.
Packaging NumPy arrays¶
Want to use blosc
to compress and decompress NumPy objects without having to
worry about passing the typesize for optimal compression, or having to create
the final container for decompression? blosc
comes with the pack_array
and unpack_array
to perform this in a handy way:
>>> a = np.linspace(0, 100, 1e7)
>>> %time packed = blosc.pack_array(a)
CPU times: user 172 ms, sys: 84 ms, total: 256 ms
Wall time: 151 ms
>>> %time a2 = blosc.unpack_array(packed)
CPU times: user 116 ms, sys: 60 ms, total: 176 ms
Wall time: 104 ms
>>> np.alltrue(a == a2)
True
Although this is a convenient way for compressing/decompressing NumPy arrays, this method uses pickle/unpickle behind the scenes. This step implies additional copies, which takes both memory and time.
Compressing from a data pointer¶
For avoiding the data copy problem in the previous section, blosc
comes with a couple of lower-level functions: compress_ptr
and
decompress_ptr
. Here are they in action:
>>> %time c = blosc.compress_ptr(a.__array_interface__['data'][0], a.size,
a.dtype.itemsize, 9, True)
CPU times: user 144 ms, sys: 0 ns, total: 144 ms
Wall time: 37.2 ms
>>> a2 = numpy.empty(a.size, dtype=a.dtype)
>>> %time blosc.decompress_ptr(c, a2.__array_interface__['data'][0])
CPU times: user 80 ms, sys: 0 ns, total: 80 ms
Wall time: 24.9 ms
80000000L
>>> (a == a2).all()
True
As you see, these are really low level functions because you should pass actual pointers where the data is, as well as the size and itemsize (for compression). Needless to say, it is very easy to cause a segfault by passing incorrect parameters to the functions (wrong pointer or wrong size).
On the other hand, and contrarily to the pack_array
/ unpack_array
method, the compress_ptr
/ decompress_ptr
functions do not need to
make internal copies of the data buffers, so they are extremely fast
(as much as the C-Blosc library can be), but you have to provide a
container when doing the de-serialization.
Packing NumPy arrays with Bloscpack¶
While pack_array
/ unpack_array
have been designed for convenience
and compress_ptr
/ decompress_ptr
have been designed for speed
there is also a third option that combines the best of both worlds:
Bloscpack. Since version 0.4.0,
Bloscpack is able to natively de/serialize NumPy arrays:
>>> import bloscpack as bp
>>> %time bp_packed = bp.pack_ndarray_str(a)
CPU times: user 152 ms, sys: 20 ms, total: 172 ms
Wall time: 76.8 ms
>>> %time bp_unpacked = unpack_ndarray_str(bp_packed)
CPU times: user 100 ms, sys: 8 ms, total: 108 ms
Wall time: 58 ms
>>> (a == bp_unpacked).all()
True