Performance Benchmarking: Pandas DataFrame vs Python List of Dictionaries

Problem

While in the initial stages of a project, sometimes we have to choose between storing data with Pandas DataFrames or in native python lists of dictionaries. Both data structures look similar enough to perform the same tasks - we can even look at lists of dictionaries as simply a less complex Pandas DataFrame (each row in a DataFrame corresponds to each dictionary in the list).

The question then arises: given the increased complexity and overhead of a Pandas DataFrame, is it true then that we should always default to using python Lists of dictionaries when performance is the primary consideration?

The answer, it would seem, is no. This we demonstrate by examining the use case of element-wise assignment*.

Setup

To run our experiment on real data, we will use a dataset containing a list of the coordinates of all New York hotels. We will do the comparison using 2 different functions: a simple summation, and a Haversine function.

The dataset, as well as the Haversine function we will use, is the same one used by Sofia Hesler in her Pycon 2017 presentation.

You can also find the source code for this blog post on GitHub.

Comparison 1 - Summation

For the list, we will utilise a straightforward looping construct.

Running both of the above with timeit, at 10 runs of 100 repeats each, returns the following result:

DataFrame: 0.019s for best run.
List:      0.021s for best run.

Comparison 2 - Haversine

The Haversine function is much more complicated, and we will use the DataFrame optimised version provided by Sofia Hesler for our DataFrame computation.

For our list of dictionaries, I have made modifications to the above Haversine function, speeding up its implementation by switching out numpy functions for those found in the built-in math library.

Again, we run both of the above with timeit, at 10 runs of 100 repeats each. This returns the following result:

DataFrame: 0.027s for best run.
List:      0.330s for best run.

Results

From the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation.

This is surprising.

Some further digging establishes the reasons for this — Pandas implements additional optimisations in many use cases, some of these in C code. Such optimisations like vectorisation add a level of power to Pandas DataFrames that would be hard and/or time-consuming to achieve while using native data structures, like a list of dictionaries in this case.

Additional notes

Element-wise assignment in this case refers to the iterating of a list of dictionaries, running computations on the values of each individual dictionary, and then assigning the result of that computation onto the same dictionary.

Problem#

Setup#

Comparison 1 - Summation#

Comparison 2 - Haversine#

Results#

Additional notes#