You are here

python

Converting MBOX to ZIP

Task: Converting an MBOX format export of a mailbox into a ZIP file containing each message as a file named after the message-id of the email message. Every e-mail client worth a pinch of salt can export messages, or a mailbox, to an MBOX file.

Dropping An Element In An Iterative Parse

Using lxml's etree to iteratively parse an XML document and I wanted to drop a specific element from the stream...

        for event, element in etree.iterparse(self.rfile, events=("end",)):
            if (event == 'end') and (element.tag == 'row'):
                self.wfile.write(etree.tostring(element))
            elif (event == 'end') and (element.tag == name_of_element_to_drop):
                element.getparent().remove(element) # drop element

The secret sauce is: element.getparent().remove(element)

Reading BYTE Fields From An Informix Unload

Exporting records from an Informix table is simple using the UNLOAD TO command. This creates a delimited text file with a row for each record and the fields of the record delimited by the specified delimiter. Useful for data archive the files can easily be restored or processed with a Python script.

Informix Dialect With CASE Derived Polymorphism

I ran into an interesting issue when using SQLAlchemy 0.7.7 with the Informix dialect. In a rather ugly database (which dates back to the late 1980s) there is a table called "xrefr" that contains two types of records: "supersede" and "cross". What those signify doesn't really matter for this issue so I'll skip any further explanation. But the really twisted part is that while a single field distinquishes between these two record types - it does not do so based on a consistent value.

Failure to apply LDAP pages results control.

On a particular instance of OpenGroupware Coils the switch from an OpenLDAP server to an Active Directory service - which should be nearly seamless - resulted in "Failure to apply LDAP pages results control.". Interesting, as Active Directory certainly supports paged results - the 1.2.840.113556.1.4.319 control.

But there is a caveat! Of course.

LDAP Search For Object By SID

All the interesting objects in an Active Directory DSA have an objectSID which is used throughout the Windows subsystems as the reference for the object. When using a Samba4 (or later) domain controller it is possible to simply query for an object by its SID, as one would expect - like "(&(objectSID=S-1-...))". However, when using a Microsoft DC searching for an object by its SID is not as straight-forward; attempting to do so will only result in an invalid search filter error.

Deduplicating with group_by, func.min, and having

You have a text file with four million records and you want to load this data into a table in an SQLite database. But some of these records are duplicates (based on certain fields) and the file is not ordered. Due to the size of the data loading the entire file into memory doesn't work very well. And due to the number of records doing a check-at-insert when loading the data is also prohibitively slow. But what does work pretty well is just to load all the data and then deduplicate it.

Concerning Decorators

Decorators are a powerful as well as arcanely constructed feature of Python. Over at his BLOG Graham Dumpleton has written a series of [8 so far] articles covering Python decorator's in-depth. Lots of information here along with abundant examples.

Tags: 

Stream Peekaboo With Python

The Python standard libary provides a reasonably adequate module for reading delimited data streams and there are modules available for reading everything from XLS and DIF documents to MARC data. One definiciency of many of these modules is the ability to gracefully deal with whack data; in the real world data is never clean, never correctly structured, and you are lucky if it is accurate even on the rare occasion that it is correctly encoded.

For example, when Python's CSV reader meets a garbled line in a file it throws an exception and stops, and you're done. And it does not report what record it could not parse, all you have is a traceback. Perhaps in the output you can look at the last record and guess that the error lies one record beyond that... maybe.

Fortunately most of these modules work with file-like objects. As long as the object they receive properly implements iteration they will work. Using this strength it is possible to implement a Peekaboo on the input stream which allows us to see what the current unit of work being currently processed is, or even to pre-mangle that chunk.

Complex Queries With SQLAlchemy (Example#1)

There are lots of examples of how to use SQLAlchemy to provide your Python application with a first-rate ORM. But most of these examples tend to model very trivial queries; yet the real power of SQLAlchemy, unlike many ORM solutions, is that it doesn't hide / bury the power of the RDBMS - and if you aren't going to use that power why bother with an RDBMS at all [Aren't NoSQL solutions the IT fad of the year? You could be so hip!].

Pages

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer