dependency-reproducibility-python-3

Introduction

Python dependency management can be a big headache. Many a times I have returned to a past project in Python with its varied dependencies, attempted to reinstall all libraries it depends upon, only to be faced with various issues on compatibility.

Worse, ensuring that the same dependencies are loaded in development and in production can be another challenge in and of itself.

This highlights important problems when it comes to environment reproducibility.

What is Reproducibility

My understanding of reproducibility is rooted in scientific espistemology.

In that context, reproducibility tackles the ability of others to successfully replicate the results of your own scientific experiments. The idea behind this is that if someone (or your future self) takes your experiment, replicates all its processes exactly, with the exact same context, then ceteris paribus – all else being equal – they should get the exact same result.

From a software engineering perspective, we can think of reproducibility in similar terms. The whole idea boils down to this: If someone else takes your code, sets it up from scratch, and runs it, it should always work in the same way, without fail. And by extension, if code is run in development, it should perform exactly the same way in production.

Think of it in functional programming or unit testing terms: we treat our code as a black box, give it a specific input, and always expect to receive back a specific kind of output. In a sense, we are concerned with guarantees of determinism. If this is not the case, our software is unreliable, and its claims and assertions are not reproducible.

The Problem in Python

This header is a bit of a hyperbole. This isn’t really so much a problem with python, but rather with how it is used – in machine learning and data science, which is what a good chunk of python users, myself included, use it for.

The guarantee of reproducibility is something very important in such projects. The AI blackbox problem suggests that the inner workings of predictive models in machine learning are usually so obscure that humans can hardly understand how variables are being combined to make predictions.

Combined with the butterfly effect, the above can lead to some very undesirable extrapolations from input data, given slight perturbations. A perfect storm.

The above makes functional reproducibility in Python that much more important.

There are methods to reduce the impact of this problem at the data and algorithmic level. Models can be trained to be more generalisable and robust. For example, while training neural network models, strategies like L1/L2 regularisation can serve to reduce overfitting on training data to generate more robust models.

But what if the perturbation stems from a project’s dependencies itself?

Given the non-deterministic nature of many machine learning models, we always want to try to reduce the impact of randomness to improve reproducibility. And this is one of the things that keeps me up at night – how do I make sure that the models I deploy in development, have the exact same setup and context in production?

Solution

Dependency management system of choice

Python has a good many libraries for managing dependencies. My dependency management system of choice here is venv. venv comes pre-packaged with Python. Where possible I like to keep things simple, so my preference is to use built-in utilities and libraries.

As a bonus, it is also easy to use with docker, which will be important for production-use.

We also initialise a localised installation at the project root, and only install project-related dependencies there. This allows us to ensure dependency isolation, to prevent dependency pollution from other python packages installed on the same machine.

mkdir path/to/project             # create the project folder
cd path/to/project                # go into the project folder root
python3 -m venv path/to/venv      # to generate the isolated python dependency, in a specified folder
source path/to/venv/bin/activate  # to use the isolated environment

Example:

mkdir <project-name> 
cd <project-name>
python3 -m venv ./venv            # packages will be installed at the project root
source venv/bin/activate

I tend to use venv as the dependency packages folder name for 2 reasons: (1) it makes it obvious where my dependencies are, and (2) so that I can run this with an alias in all my python projects:

alias p="source venv/bin/activate"

Also, if you use some kind of version control system (and you should!), don’t forget to exclude your venv environment from it. For example, while using git, I would add venv to my .gitignore file.

Handcraft top-level dependencies

The typical approach to saving a list of dependencies is to do a data dump with pip freeze. But I have found that this approach is usually problematic. The pain can be more pronouned when version numbers get out of sync, and you find yourself hunting down dependencies and updating different obscure packages, while simultaneously building and rebuilding packages while trying not to pull all your hair out.

Not fun at all.

What I prefer to do is to maintain a list of high-level dependencies, and let the package manager handle dependency resolution. This we do by handcrafting a requirements.txt file, and only including in it a list of the top-level dependencies. The reason for this is two-fold:

  1. We really only care about the higher-level dependencies because these are the ones we are using directly. At the end of the day it makes much more economical sense to sub-delegate dependency management to the level of the modules that require them. This means trusting, to a certain extent, the module and package authors to ensure proper versioning and robust dependency management at their respective layers.
  2. The fewer dependencies we need to pay attention to, the easier it is to debug issues if reinstalling our dependencies do fail or cause problems.
touch requirements.txt

An example of a requirements.txt file:

pandas==1.4.2
seaborn==0.11.2
torch==1.11.0

To use it to install packages, run the following command.

pip install -r requirements.txt

This is a trick I learned awhile ago, through a stackoverflow question I can unfortunately no longer find.

On hindsight, I realise that this approach is similar to the NodeJS way of handling dependencies, which I find extremely elegant. In that approach we simply specify the packages we want in either the package.json file, or through a cli command like yarn add <package-name>, and these top-level dependencies are added to the local set of packages. The dependency manager then handles dependency resolution for us.

Moving to production: Containerisation with Docker

The use of a requirements.txt also allows easier dockerisation of our python applications. The following barebones Dockerfile shows how this can be done.

FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

First, we pull the python image that we want, and set the working directory. Then it is a simple matter of copying the requirements.txt file over, and running pip install.

Note the --no-cache-dir option. This essentially it prevents the caching of source files to reduce the size of the compiled Docker image.

Conclusion

The above is my own approach for dealing with dependency management for reproducibility between environments. Hope it helps!