You are here

Stream Peekaboo With Python

The Python standard libary provides a reasonably adequate module for reading delimited data streams and there are modules available for reading everything from XLS and DIF documents to MARC data. One definiciency of many of these modules is the ability to gracefully deal with whack data; in the real world data is never clean, never correctly structured, and you are lucky if it is accurate even on the rare occasion that it is correctly encoded.

For example, when Python's CSV reader meets a garbled line in a file it throws an exception and stops, and you're done. And it does not report what record it could not parse, all you have is a traceback. Perhaps in the output you can look at the last record and guess that the error lies one record beyond that... maybe.

Fortunately most of these modules work with file-like objects. As long as the object they receive properly implements iteration they will work. Using this strength it is possible to implement a Peekaboo on the input stream which allows us to see what the current unit of work being currently processed is, or even to pre-mangle that chunk.

Aside: The hardest part, at least for not-line-oriented data, is defining the unit of work.

For example here is a simple Peekaboo that allows for easy reporting of the line read by the CSV reader whenever that line does not contain the expected data:

import csv

class Peekaboo(object):

    def __init__(self, handle):
        self._h = handle
        self._h.seek(0)
        self._c = None

    def __iter__(self):
        for row in iter(self._h):
            self._c = row
            yield self._c

    @property
    def current(self):
        return self._c

class RecordFormatException(Exception):
    pass

def import_record(record):
    # verify record data, check field types, field count, etc...
    if not valid:
        raise RecordFormatException()

if __name__ == '__main__':

    rfile = open('testfile.csv', 'rb')
    peekabo = Peekaboo(rfile)
    for record in csv.reader(wrapper):
        try:
            data = import_record(record)
        except RecordFormatException as exc:
            print('Format Exception Processing Record:\n{0}'.format(peekabo.current, ))

Another use for a Peekabo and CSV reader is reading a delimited file that contains comments - lines starting with a hash ("#") are to be ignored when reading the file.

class Peekaboo(object):

    def __init__(self, handle, comment_prefix=None):
        self._h = handle
        self._h.seek(0)
        self._c = None
        self._comment_prefix = comment_prefix

    def __iter__(self):
        for row in iter(self._h):
            self._c = row
            if self._comment_prefix and self._c.startswith(self._comment_prefix):
                # skip the line
                continue
            yield self._c

    @property
    def current(self):
        return self._c

if __name__ == '__main__':

    rfile = open('testfile.csv', 'rb')
    peekabo = Peekaboo(rfile, comment_prefix="#")
    ...

The Peekaboo is nothing revolutionary; to experienced developers it is likely just obvious. But I've introduced it to enough Python developers to believe it worthy of a mention.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer