Psycopg2 and large result sets

Psycopg2 has a bit of a gotcha when it comes to fetching result sets that can catch out the unsuspecting developer. By default, it does not actually use cursors, but simply emulate them. In practical terms, this means that the entire result set of your query is fetched by the client into memory.

It is documented these days, but buried quite far down if you’re not looking for it:

http://initd.org/psycopg/docs/usage.html#server-side-cursors

Practically speaking though, what does this actually mean? From a DB API point of view, there is no difference memory wise between:

.fetchone()

and

.fetchall()

The entire result set has already been fetched into memory, all you are doing is controlling how much of that you read into Python at one time.

By and large, it’s not actually a bad thing, as long as you don’t execute queries that return huge result sets. As you generally don’t need to do that, the key thing to be aware of in your client code is to write your code in such a way that you do as much filtering as possible at the SQL layer so you can return as small a result set as possible.

As an example, consider something like this:

cursor = connection.execute("SELECT * from cars")
for row in cursor:
    if row[1] == "blue":
        return row

Using an ORM such as the Django one, the equivalent would look something like:

for car in Car.objects.all():
    if car.colour == "blue":
        return car

In the above, we’re trying to find the first car that is blue. (It’s rather contrived that we’re calling .all(), but you could also imagine some other filter that returns a large number of car records). Now, let’s say that our ‘Car’ table has 20,000 cars in it. In both cases, it naively appears that we’re only reading in one record at a time, but this is not quite the case.

As soon as we executed the query, Psycopg2 loaded the entire result set, which in our case is the entire table, into memory. In the Django example, the only saving grace is that we are lazily creating the Car objects from the row, but that’s it – the entire result set is still in memory!

Whilst you could use named cursors (even with things like Django there are various ways to force PostgreSQL to use them), it’s generally not necessary. Simply try to do as much of your filtering as possible in SQL to keep the size of the result set small, rather than filtering in your Python code.

Posted on April 15, 2013 at 11:02 pm by Carlos Corbacho · Permalink
In: Linux, Python

Leave a Reply

You must be logged in to post a comment.