2011-04-22

Compression & Decompress Of A Stream

So far in Python I had not found a good method / module for performing compression and decompression of data as streams;  most tools required files to be compressed which has some obvious limitations.  But then I saw a mention of pyLZMA roll by. It supports compression and decompression of streams using the Lempel–Ziv–Markov chain algorithm. The license of the module is LGPL-2.1; not MIT, but at least it is "Lesser" GPL'd.  I've taken it for a spin and it seems to successfully compress and decompress all the data I've thrown at it (remember to always checksum your data).

import pylzma, hashlib

# Calculate the SHA checksum for our input file
i = open('Brighton.jpg', 'rb')
h1 = hashlib.sha1()
while True:
    tmp = i.read(1024)
    if not tmp: break
    h1.update(tmp)
h1 = h1.hexdigest()
print 'Input SHA Checksum: {0}'.format(h1)
    
# Compress the input file (as a stream) to a file (as a stream)
o = open('compressed.lzma', 'wb')
i.seek(0)
s = pylzma.compressfile(i)
while True:
    tmp = s.read(1)
    if not tmp: break
    o.write(tmp)
o.close()
i.close()

# Decomrpess the file (as a stream) to a file (as a stream)
i = open('compressed.lzma', 'rb')
o = open('decompressed.raw', 'wb')
s = pylzma.decompressobj()
while True:
    tmp = i.read(1)
    if not tmp: break
    o.write(s.decompress(tmp))
o.close()
i.close()

# Check the decompressed file
i = open('decompressed.raw', 'rb')
h2 = hashlib.sha1()
while True:
    tmp = i.read(1024)
    if not tmp: break
    h2.update(tmp)
h2 = h2.hexdigest()
print 'Result SHA Checksum: {0}'.format(h2)
if (h1 == h2): print 'OK!'

Of course a JPEG file doesn't compress much, but that makes it an even better test case.

2011-04-20

block_dump logging

There are lots of tools for studying the systems use of CPU and memory, but I/O is generally harder to track down.  A useful trick is available via the block dump.  Setting the value to "1" turns on block access logging to the kernel ring-buffer [aka dmesg] and a value of "0" turns it back on.  This means it can be turned on by a simple:
echo "1" > /proc/sys/vm/block_dump
This logs the accesses to the block storage as:
[ 2032.934178] postmaster(11528): READ block 5058592 on dm-3 (16 sectors)
[ 2032.934200] postmaster(11528): READ block 5058624 on dm-3 (32 sectors)
[ 2032.934240] postmaster(11528): READ block 3172800 on dm-3 (16 sectors)
[ 2032.945328] banshee-1(11267): dirtied inode 1051864 (banshee.db-journal) on dm-0
[ 2032.945336] banshee-1(11267): dirtied inode 1051864 (banshee.db-journal) on dm-0
[ 2033.042671] python(11518): READ block 9017928 on dm-2 (32 sectors)
[ 2033.055771] python(11518): dirtied inode 267260 (expatbuilder.pyc) on dm-2
[ 2033.055808] python(11518): READ block 9017960 on dm-2 (40 sectors)
[ 2033.412972] nautilus(11078): dirtied inode 410492 (dav:host=127.0.0.1,port=8080,ssl=false) on dm-0
[ 2033.413001] nautilus(11078): READ block 50855560 on dm-0 (40 sectors)
[ 2033.431011] nautilus(11078): dirtied inode 410596 (dav:host=127.0.0.1,port=8080,ssl=false-ab9de673.log) on dm-0
[ 2033.431044] nautilus(11078): READ block 50855736 on dm-0 (64 sectors)
[ 2034.221831] jbd2/dm-2-8(386): WRITE block 21261800 on dm-2 (8 sectors)
[ 2034.221887] jbd2/dm-2-8(386): WRITE block 21261808 on dm-2 (8 sectors)
 Handy.