Python 2 and Unicode

I’ve been meaning to write this up for a while, so let’s see how we go.

One of the problems I have with Python 2 currently is that Unicode support is a bit of a hit and miss game. The problem is that Unicode was a bolt on extra in Python 2.x – in the brave new world of Python 3, they’ve actually fixed this up properly, with unicode objects being the default, and a new type, bytes, to represent byte strings. Unless you’re doing I/O, then you really do want Unicode (hint – what do you expect len(“£”) to be?) To give an example:

a = "foo"

In Python 2, this will return a str byte string. If you want Unicode objects, you either need an explicit cast:

a = u"foo"

Or, if you’re using Python 2.6:

from __future__ import unicode_literals

a = "foo"

Which is all nice, except libraries are the falling down point, as per usual. The situation on 2.x is basically a mess – not entirely unsurprising, given the origins of the unicode type in Python 2.

Here’s a small selection of the unicode support in Python 2.x libraries:

csv – str only for input and output
ElementTree (and faithfully reproduced in LXML) – Depends. Accepts Unicode, but on return, it tries to coerce everything to ASCII encoded str objects, and if it fails, returns the original internal unicode object.
PyGTK – Accepts unicode, always returns UTF-8 encoded str
PyQt – either QString, or if you switch on the v2 API, unicode objects.
Django – Returns unicode objects
Pyscopg2 – Returns str in the client encoding, unless you specify you want unicode objects (globally or per connection – Django sets this globally)

As for other libraries, I can’t really speak, but I would guess the situation is not much improved there either.

So what are the solutions?

1) Close your eyes, put your fingers in your ears and pretend no-one uses anything but ASCII.

2) Try to only use libraries that actually have proper Unicode support. Oh, and don’t forget to declare every string literal as unicode while you’re there (or use

from __future__ import unicode_literals


3) Use Python 3 (though quite a few libraries still haven’t been ported to it yet, so perhaps not better than 2).

Posted on October 28, 2010 at 9:31 pm by Carlos Corbacho · Permalink
In: Python

Leave a Reply

You must be logged in to post a comment.