When to use Python notebooks
In both data engineering and data science, the use of notebooks is widespread. It is very easy and intuitive to open a Jupyter/Databricks notebook in your favourite browser and start executing commands. Pretty much anyone can train a machine learning model in 3 or 4 lines. Writing code to migrate 100 tables from one storage account to another can be done in a matter of minutes. This ease of use is a positive aspect of notebooks, but they can also be a double-edged sword if you don't know when to stop using them.
As a data engineer, I have come across many large projects that are purely developed with notebooks. Notebooks are a powerful tool, do not get me wrong, but they are not suitable for all use cases. In this document I will go through the main reasons why notebooks are not the best option when we want to create a serious product or a library/framework that can be used across multiple projects.
This post is mainly focus on Python and Databricks and, of course, is very opinionated. I will be happy to discuss any of the points presented here with anyone who disagrees.
What is the alternative?
OK, notebooks are not good enough for you, so what is the alternative? Very good question! The standard way to produce and distribute Python software is using packages. A package is nothing more than a series of files with code that is somehow related. In other words, a compressed file containing code created for a specific purpose.
You have probably used packages before. For example, when you run:
You are telling
pip to look for the Pandas package in the public package
index and install it in your environment.
Instead of writing notebooks, the alternative proposed in this post is the creation of one or more Python packages and maintain that package as if it were a normal Python project. Software engineering has evolved a lot in the last 50 years and the development of a Python package is compatible with all these best practices. Notebooks, on the other hand, are a step backwards in many of these aspects. Keep reading to find out what I mean.
Although this post is mainly about Python, all the information here applies to any other language. For example, Scala can be used in notebooks but you can also create Jar packages (the equivalent of Python's Wheel) using sbt.
Products and libraries/frameworks
Products should be reliable, well tested, they need to follow the best software engineering practices.
Libraries/frameworks should be easy to use across different projects, more than one version should exists at the same time to support old projects that are not up to date. Of course, the more projects are using the libraries the more important it is to have something well tested and maintained.
Boilerplate code VS Python package
We do not usually start new projects from an empty directory. In the best case we have a template that we clone, other times we just copy and existing project and remove those parts that are not needed for our new project.
If you are building some kind of internal framework for your company, we need to clarify the differences between your framework and the projects that are going to use your framework.
Having a Python package does not mean that we no longer need a project template which we have to clone when we want to start new projects. It is still absolutely necessary. The main difference is that the boilerplate code you need should be as minimal as possible. Project templates should call functions defined in the library and there should be no business logic in them, just a skeleton ready to be completed by the developer.
CookieCutter is a nice tool for building templates :)
Magic and DBUtils
Python magic and Databricks DBUtils are cool, but they only work on
Jupyter/Databricks. When we work on notebooks it is really easy to end up using
this kind of tools. For example, if you avoid using
because it cannot be done in a Python package, you will get much more portable
Run magic VS Python imports
Databricks has a Python magic used to run other notebooks. In addition to
making the code less portable, the use of
%run can lead to other problems.
%run magic works as a C/C++ preprocessor, it copies all the code from a
different notebook into the current notebook and runs it. Why is this bad? You
overwrite existing functions, imported modules and local variables. Here you
have a real example I have found some months ago:
import datetime first and later we copy and paste the
notebook1.py code in cell 2. This runs
from datetime import datetime and
overwrites the existing datetime module. Cell 3 fails for obvious reasons.
Besides that, after running more than one notebook with
%run it is not easy
to know where a function has been defined. It is equivalent to Python's
import *, something strongly
because it makes the code very difficult to follow and can overwrite
existing local variables. It is specially difficult in notebooks where we
cannot use the Go to definition option that we usually have in powerful IDEs.
Moreover, sometimes even the powerful IDEs and their linters cannot tell where
functions come from if the fearsome
import * is used.
Finally, if we execute two
%run ./notebook1 in a row on the same notebook it
will run twice. This is a problem especially if there is any code other than
function definitions. Python imports take this into account and do not re-run
code that has already been imported.
In Databricks we can also use
to run notebooks, but they are executed in a different process, so we cannot
compare that with Python imports.
Share folders VS Python Package Index
A common way to deploy projects made with notebooks is to copy the files to a
shared directory such as
/Share in Databricks. This brings two main problems.
It complicates the deployment process if we want to maintain more than one version at the same time. Imagine that you are working on an internal framework for your company. This kind of framework is used in more than one project at the same time. Older and still active projects may require an older version of the utility library, if only the latest version is deployed, copies of the utility library will end up being made within each project's repository to ensure that a new version will not break the existing code. A folder structure like
/Shared/my_package/v0.2... can be created but you need to check that all your
%runsin your project are pointing to the same required version.
Normally, as the name suggests, the
/Sharedirectory is accessible by everyone, so we can make accidental modifications while reading those files in the Databricks Workspace or, even worse, hot fixes that are applied on the deployed files are never applied in the repository so they are lost after the next deployments. Once a version is deployed is should never be modified.
Artifactory or any server that allow us to
deploy Python packages can solve this problem. Every deployment should have a
different version, you cannot replace existing deployed versions. In your
cluster or in your job definition you can specify the version you want to use
in your project and you can upgrade it when you decide. No more errors due to
breaking changes in your dependencies. Besides that, if the update does not
contains breaking changes, you do not need to modify the source files at all.
We do not need to go through all the import statements and update the path,
imports do not use paths like
%run does. And we do not need to modify the
PYTHONPATH environment variable at all, just write a normal Python import.
In Python packages you can include dependencies and specify the minimum and maximum compatible versions. You can also use tools like Poetry or pip-compile to lock the versions of the dependencies. This is difficult to achieve in notebooks.
Locking your dependencies is specially useful when creating Docker images for
our products. We must be able to exactly reproduce the build of our images. If
instead of creating our images with the already tested dependencies (the ones
in our lock file) we rely on the latest versions installed by
pip, there is
no guarantee that the code will work.
For libraries/frameworks we should deploy our requirements with our package.
I have seen libraries that contain a
requirements.txt file with pinned
versions of its dependencies, but the package is build without any information
of those dependencias. If you are using Poetry this is unlikely to happen, but
with setuptools is common. Remember to add your dependencies with lower and
upper bounds on the versions and
Using Python libraries you can build, run and test your code locally, saving costs in cloud computing. Most of the time we do not need to work on big clusters or even on small clusters. Why to pay for a head node + 1 worker node while we are just writing a complex query for 30 min? Write the code locally, test the code with some sample data, deploy your code once you are sure it is working and run it on production data.
No more excuses like I couldn't develop because my internet was down!
With local development comes the use of well-known Python tools like:
- Type checkers:
- Testing frameworks:
- Style checkers/formatters:
- Your favourite IDE:
Python packages downsides
Nothing is perfect. With the use of Python packages come several problems:
The learning curve is steeper. Especially in the world of data, many people have only worked with notebooks and understanding how Python packages work can take some extra time for them. I am not saying it is difficult, but it does require some time which may delay the project.
Libraries induce to abstract concepts using classes. Like notebooks, object-oriented programming (OOP) is very powerful, but needs to be used well. Unnecessarily complex design can lead to unmaintainable projects. The right balance between abstraction and complexity must be found. I know it is possible to use classes on notebooks but, in my experience, people get crazier with classes when they work in libraries. Especially if their goal is to create a super flexible framework.
Performance tuning is a bit more difficult. A performance issue is usually detected in the performance environment at best, or in the production environment at worst. To work in such cases we need to test different queries on a large collection of data that we cannot have locally. Copy and paste functions/queries from the package into a notebook in an environment with a lot of data may be needed. Another option is to build and deploy the package each time a new change is made, which can be awkward.
Most companies work with private code, so publishing to the public package index is not an option. We need a private package index or just deploy wheel files to a storage location. The recommended option is to have our own private repository, otherwise deploying wheels in a storage folder becomes very similar (and risky) as deploying notebooks in a directory. Setting up a package index server is not difficult but, again, it needs to be done by someone.
When are notebooks useful?
- When we want to run ad-hoc queries/visualizations.
- When prototyping.
- When we want to experiment with data that takes some time to load in memory.
When notebooks should be avoided?
- When we want to deliver or reuse the solution as a whole (wheel package).
- When we want to have a better control on the versions (including parallel versions).
- When proper testing is needed, including coverage and test reports.
- When we want to develop locally.
- When we want to write cloud-agnostic code (
dbutilsand some of the magic work only on Databricks).
- When you want to use linters like
flake8, type checkers like
pyright, your favourite IDE (VIM ), auto formatting tools like
Black... In general, any tool that works with normal Python projects but do not integrate well with notebooks.
As can be seen, there are many more reasons for not using them. Experiment as much as you want with notebooks, but please do not try to build a reliable and quality product around them. And if you do, please don't call me to maintain it.