warehouse-boxes

Introduction

In this post, we want to evaluate the memory footprint in Python 3 of data stored in various tabular formats. In particular, we want to compare DataFrames, to JSON-like data structures like List of Dictionaries, and Dictionaries of Lists.

The above are 3 different ways to store table-like data. Table-like data is basically data represented by rows and columns. In this examination, we will ignore any questions regarding efficient read/write or lookups. We are purely concerned with one question: which approach will save us the most memory?

Dataset

We generate a nonsense (but very large) test dataset for this experiment, using a list of some popular dog breeds. This list is definitely not biased at all, and all of them are definitely dogs.

To ensure that the test is general enough for most use cases, we ensure that this dataset has at least three primitive data types: str, int and float.

Experiments

We run simple calculations for each data structure variant.

DataFrames

Measuring the memory of DataFrames is relatively simple, and can be done with a simple built-in function: DataFrame.memory_usage.

This gives the following result:

84,613,093 B
84.61 MB

List of Dictionaries

Measuring lists of dictionaries is not as straightforward as the above.

To get the size of a native python data structure, we can use the method sys.getsizeof. However, this only gives us the size (in bytes) of the object itself, without including the size of its nested elements.

For example, if one has a list of integers [1, 2, 3], calling sys.getsizeof([1, 2, 3]) would only return the size of the “empty” list, along with its allocated memory. This size would not include the integers 1, 2 or 3. You can do a deep dive of this problem on this very informative Stack Overflow question.

As such, one has to iterate each object within the list, and each key-value pair within each dictionary to get the cumulative size of the data structure.

This gives the following result:

515,041,877 B
515.04 MB

Dictionary of Lists

An alternative way of representing tabular data in json format is the dictionary of lists. It consists of a dictionary, where each key represents a column, and points to an array. Each array’s index in such a case corresponds to a row index.

An e.g. of such a data structure is as follows:

{
    "breed": [ ... ],
    "count": [ ... ],
    "barks": [ ... ],
}

Fortunately for us, we do not need to do any complex manipulation to arrive at the above data structure. Instead we can use the to_dict method provided by Pandas on the DataFrame we created earlier, with a special option orient=”list”.

This gives the following result:

136,593,711 B
136.59 MB

Conclusion

We can see that the Pandas DataFrame, despite its added complexity, has a significantly smaller footprint than a list of dictionaries, and even a dictionary of lists. The latter are roughly 6 times and 2 times larger, respectively.

We can hence conclude that the use of DataFrames can be a useful non-trivial optimisation in certain use cases, especially when RAM capacity is an issue.