If you run two builds with the same source code and the same commit but on two different machines, do you expect to get the same result?
Well, in most of the cases you will not!
In this article, we'll identify sources of non-determinism in most build processes and look at how Bazel can be used to create reproducible, hermetic builds. We'll then create a reproducible Flask application that can be built with Bazel so that the Python interpreter and all dependencies are hermetical.
Contents
Reproducible Builds
According to the Reproducible Builds project, "a build is reproducible if given the same source code, build environment, and build instructions, any party can recreate bit-by-bit identical copies of all specified artifacts". This means that in order to achieve a reproducible build you must remove all sources of non-determinism. Although this can be difficult, there are several benefits:
- It can drastically speed up the build time thanks to caching of intermediate build artifacts in large build graphs.
- You can determine the binary origin of an artifact, like what sources it was built from, reliably.
- Reproducible code is more secure and reduces the attack surface.
Hermeticity
One of the most common causes of non-determinism are inputs to the build. With that, I mean everything that's not source code: compilers, build tools, third-party libraries, and any other inputs that might influence the build. For your builds to be hermetic, all references must be unambiguous, either as fully resolved version numbers or hashes. Hermetic information should be checked in as part of the source code.
Hermetic builds enable cherry-picking. Let's say you want to fix a bug in an older release that's running in production. If you have a hermetic build process you can check out the old revision, fix the bug, and then rebuild the code. Thanks to hermeticity, all the build tools are versioned in the source code repository, so a project built two months ago won't use today's version of the compiler because it can be incompatible with the two months old source code.
Internal Randomness
Internal randomness is an issue you have to tackle before you can achieve a reproducible build, which can be a sneaky thing to fix.
Timestamps are a common source of internal randomness. They are often used to keep track of when the build was done. Get rid of them. With reproducible builds, timestamps are irrelevant since you're already tracking your build environment with source control.
For the languages that don't initialize values, you need to do it explicitly in order to avoid randomness in your build due to capturing random bytes from memory.
There's no easy way around it -- you must inspect your code!
GCC in some situations uses random numbers so you may need to use the -frandom-seed option to produce reproducibly identical object files.
For more on the causes of internal randomness, check out the Documentation section from the Reproducible Builds project.
Reproducible Builds with Bazel
All this may sound a bit overwhelming, but it's actually not as complex as it sounds. Bazel makes this process much easier.
We'll now go through an example of using Bazel to compile and distribute a Flask application.
Installation
Bazel is one of the best solutions available for creating reproducible, hermetic builds. It supports many languages like Python, Java, C, C++, Go, and more. Start by installing Bazel.
To build our Flask application, we need to instruct Bazel to use python 3.8.3 hermetically. This means that we can't rely on the Python version installed on the host machine -- we must compile it from scratch!
Workspace
After creating a folder to hold your project, start by setting up a workspace, which holds your project's source code and Bazel’s build outputs.
Create a file called WORKSPACE:
workspace(name = "my_flask_app")
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
_configure_python_based_on_os = """
if [[ "$OSTYPE" == "darwin"* ]]; then
./configure --prefix=$(pwd)/bazel_install --with-openssl=$(brew --prefix openssl)
else
./configure --prefix=$(pwd)/bazel_install
fi
"""
# Fetch Python and build it from scratch
http_archive(
name = "python_interpreter",
build_file_content = """
exports_files(["python_bin"])
filegroup(
name = "files",
srcs = glob(["bazel_install/**"], exclude = ["**/* *"]),
visibility = ["//visibility:public"],
)
""",
patch_cmds = [
"mkdir $(pwd)/bazel_install",
_configure_python_based_on_os,
"make",
"make install",
"ln -s bazel_install/bin/python3 python_bin",
],
sha256 = "dfab5ec723c218082fe3d5d7ae17ecbdebffa9a1aea4d64aa3a2ecdd2e795864",
strip_prefix = "Python-3.8.3",
urls = ["https://www.python.org/ftp/python/3.8.3/Python-3.8.3.tar.xz"],
)
# Fetch official Python rules for Bazel
http_archive(
name = "rules_python",
sha256 = "b6d46438523a3ec0f3cead544190ee13223a52f6a6765a29eae7b7cc24cc83a0",
url = "https://github.com/bazelbuild/rules_python/releases/download/0.1.0/rules_python-0.1.0.tar.gz",
)
load("@rules_python//python:repositories.bzl", "py_repositories")
py_repositories()
We first fetched the Python source code archive with the http_archive rule and then built it from scratch.
With this, we can be sure to have control over the Python binary and the version. Remember: You don't want to use the Python version installed on the host machine or your build will not be reproducible. The hermeticity here is ensured by the urls
field, which tells Bazel where to find the dependency, and the sha256
field, which is the unique identifier for it. Every build will use the same unambiguous Python version.
Next, we fetched the official Python Bazel rules. Here, the sha256
is used as the identifier. We'll use the rules later on to create the build and test the targets. Before that, we must define our toolchain.
Toolchain
Next, we'll configure a BUILD file.
Create a file called BUILD:
load("@rules_python//python:defs.bzl", "py_runtime", "py_runtime_pair")
py_runtime(
name = "python3_runtime",
files = ["@python_interpreter//:files"],
interpreter = "@python_interpreter//:python_bin",
python_version = "PY3",
visibility = ["//visibility:public"],
)
py_runtime_pair(
name = "py_runtime_pair",
py2_runtime = None,
py3_runtime = ":python3_runtime",
)
toolchain(
name = "py_3_toolchain",
toolchain = ":py_runtime_pair",
toolchain_type = "@bazel_tools//tools/python:toolchain_type",
)
This config will create a Python runtime from the Python interpreter that we defined in the workspace, which will then be used in a toolchain.
Finally, to register the toolchain, add the following line to the end of the WORKSPACE file:
# The Python toolchain must be registered ALWAYS at the end of the file
register_toolchains("//:py_3_toolchain")
Nice! You now have a hermetic Bazel build environment set up. Don't just take my word for it, let's write a test.
Test
For writing tests in Python, we'll need pytest, so let's add a requirements.txt file:
attrs==20.3.0 --hash=sha256:31b2eced602aa8423c2aea9c76a724617ed67cf9513173fd3a4f03e3a929c7e6
more-itertools==8.2.0 --hash=sha256:5dd8bcf33e5f9513ffa06d5ad33d78f31e1931ac9a18f33d37e77a180d393a7c
packaging==20.3 --hash=sha256:82f77b9bee21c1bafbf35a84905d604d5d1223801d639cf3ed140bd651c08752
pluggy==0.13.1 --hash=sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d
py==1.8.1 --hash=sha256:c20fdd83a5dbc0af9efd622bee9a5564e278f6380fffcacc43ba6f43db2813b0
pyparsing==2.0.2 --hash=sha256:17e43d6b17588ed5968735575b3983a952133ec4082596d214d7090b56d48a06
pytest==5.4.1 --hash=sha256:0e5b30f5cb04e887b91b1ee519fa3d89049595f428c1db76e73bd7f17b09b172
six==1.15.0 --hash=sha256:8b74bedcbbbaca38ff6d7491d76f2b06b3592611af620f8426e82dddb04a5ced
wcwidth==0.1.9 --hash=sha256:cafe2186b3c009a04067022ce1dcd79cb38d8d65ee4f4791b8888d6599d1bbe1
Along with pytest, we added all the child dependencies as well. We also added the versions and the sha256 hash as an identifier for hermeticity.
Now we can modify the workspace again by adding the pip_install
rule for handling dependencies. Add the following just before register_toolchain
:
# Third party libraries
load("@rules_python//python:pip.bzl", "pip_install")
pip_install(
name = "py_deps",
python_interpreter_target = "@python_interpreter//:python_bin",
requirements = "//:requirements.txt",
)
You should now have:
workspace(name = "my_flask_app")
load("@bazel_tools//tools/build_defs/repo:git.bzl", "git_repository")
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
_configure_python_based_on_os = """
if [[ "$OSTYPE" == "darwin"* ]]; then
./configure --prefix=$(pwd)/bazel_install --with-openssl=$(brew --prefix openssl)
else
./configure --prefix=$(pwd)/bazel_install
fi
"""
# Fetch Python and build it from scratch
http_archive(
name = "python_interpreter",
build_file_content = """
exports_files(["python_bin"])
filegroup(
name = "files",
srcs = glob(["bazel_install/**"], exclude = ["**/* *"]),
visibility = ["//visibility:public"],
)
""",
patch_cmds = [
"mkdir $(pwd)/bazel_install",
_configure_python_based_on_os,
"make",
"make install",
"ln -s bazel_install/bin/python3 python_bin",
],
sha256 = "dfab5ec723c218082fe3d5d7ae17ecbdebffa9a1aea4d64aa3a2ecdd2e795864",
strip_prefix = "Python-3.8.3",
urls = ["https://www.python.org/ftp/python/3.8.3/Python-3.8.3.tar.xz"],
)
# Fetch official Python rules for Bazel
http_archive(
name = "rules_python",
sha256 = "b6d46438523a3ec0f3cead544190ee13223a52f6a6765a29eae7b7cc24cc83a0",
url = "https://github.com/bazelbuild/rules_python/releases/download/0.1.0/rules_python-0.1.0.tar.gz",
)
load("@rules_python//python:repositories.bzl", "py_repositories")
py_repositories()
# Third party libraries
load("@rules_python//python:pip.bzl", "pip_install")
pip_install(
name = "py_deps",
python_interpreter_target = "@python_interpreter//:python_bin",
requirements = "//:requirements.txt",
)
# The Python toolchain must be registered ALWAYS at the end of the file
register_toolchains("//:py_3_toolchain")
We're now ready to write the test.
Create a new folder called "test" and add a new test file called compiler_version_test.py:
import os
import platform
import sys
import pytest
class TestPythonVersion:
def test_version(self):
assert(os.path.abspath(os.path.join(os.getcwd(),"..", "python_interpreter", "python_bin")) in sys.executable)
assert(platform.python_version() == "3.8.3")
if __name__ == "__main__":
import pytest
raise SystemExit(pytest.main([__file__]))
This will test that the Python executable is present and that the version is correct.
To include it in the build process, add a BUILD file to the "test" folder:
load("@rules_python//python:defs.bzl", "py_test")
load("@py_deps//:requirements.bzl", "requirement")
py_test(
name = "compiler_version_test",
srcs = ["compiler_version_test.py"],
deps = [
requirement("attrs"),
requirement("more-itertools"),
requirement("packaging"),
requirement("pluggy"),
requirement("py"),
requirement("pytest"),
requirement("wcwidth"),
],
)
Here we defined a py_test rule called compiler_version_test
, the source files, and the dependencies needed. Everything is explicit.
At this point you should have something like this:
├── BUILD
├── WORKSPACE
├── requirements.txt
└── test
├── BUILD
└── compiler_version_test.py
With that, we can run our first "bazelized" Python test!
From the project root, run:
$ bazel test //test:compiler_version_test
Output:
Starting local Bazel server and connecting to it...
INFO: Analyzed target //test:compiler_version_test (31 packages loaded, 8550 targets configured).
INFO: Found 1 test target...
Target //test:compiler_version_test up-to-date:
bazel-bin/test/compiler_version_test
INFO: Elapsed time: 172.459s, Critical Path: 3.10s
INFO: 2 processes: 2 darwin-sandbox.
INFO: Build completed successfully, 2 total actions
//test:compiler_version_test PASSED in 0.6s
Executed 1 out of 1 test: 1 test passes.
INFO: Build completed successfully, 2 total actions
At this point you have a working Python environment configured hermetically.
Flask
We're now ready to develop the Flask application.
Create a "src" folder. Then, add a file called flask_app.py to it:
import platform
import subprocess
import sys
from flask import Flask
def cmd(args):
process = subprocess.Popen(args, stdout=subprocess.PIPE)
out, _ = process.communicate()
return out.decode('ascii').strip()
app = Flask(__name__)
@app.route('/')
def python_versions():
bazel_python_path = f'Python executable used by Bazel is: {sys.executable} <br/><br/>'
bazel_python_version = f'Python version used by Bazel is: {platform.python_version()} <br/><br/>'
host_python_path = f'Python executable on the HOST machine is: {cmd(["which", "python3"])} <br/><br/>'
host_python_version = f'Python version on the HOST machine is: {cmd(["python3", "-c", "import platform; print(platform.python_version())"])}'
python_string = (
bazel_python_path
+ bazel_python_version
+ host_python_path
+ host_python_version
)
return python_string
if __name__ == '__main__':
app.run()
This is a simple Flask application that will show the binary path and the version of Python of the host machine along with the one used by Bazel.
To build it, we need to add a BUILD file to "src":
load("@rules_python//python:defs.bzl", "py_binary")
load("@py_deps//:requirements.bzl", "requirement")
py_binary(
name = "flask_app",
srcs = ["flask_app.py"],
python_version = "PY3",
deps = [
requirement("flask"),
requirement("jinja2"),
requirement("werkzeug"),
requirement("itsdangerous"),
requirement("click"),
],
)
We also need to extend the requirements.txt file with the following:
click==5.1 --hash=sha256:0c22a2cd5a1d741e993834df99133de07eff6cc1bf06f137da2c5f3bab9073a6
flask==1.1.2 --hash=sha256:8a4fdd8936eba2512e9c85df320a37e694c93945b33ef33c89946a340a238557
itsdangerous==0.24 --hash=sha256:cbb3fcf8d3e33df861709ecaf89d9e6629cff0a217bc2848f1b41cd30d360519
Jinja2==2.10.0 --hash=sha256:74c935a1b8bb9a3947c50a54766a969d4846290e1e788ea44c1392163723c3bd
MarkupSafe==0.23 --hash=sha256:a4ec1aff59b95a14b45eb2e23761a0179e98319da5a7eb76b56ea8cdc7b871c3
Werkzeug==0.15.5 --hash=sha256:87ae4e5b5366da2347eb3116c0e6c681a0e939a33b2805e2c0cbd282664932c4
Full file:
attrs==20.3.0 --hash=sha256:31b2eced602aa8423c2aea9c76a724617ed67cf9513173fd3a4f03e3a929c7e6
click==5.1 --hash=sha256:0c22a2cd5a1d741e993834df99133de07eff6cc1bf06f137da2c5f3bab9073a6
flask==1.1.2 --hash=sha256:8a4fdd8936eba2512e9c85df320a37e694c93945b33ef33c89946a340a238557
itsdangerous==0.24 --hash=sha256:cbb3fcf8d3e33df861709ecaf89d9e6629cff0a217bc2848f1b41cd30d360519
Jinja2==2.10.0 --hash=sha256:74c935a1b8bb9a3947c50a54766a969d4846290e1e788ea44c1392163723c3bd
MarkupSafe==0.23 --hash=sha256:a4ec1aff59b95a14b45eb2e23761a0179e98319da5a7eb76b56ea8cdc7b871c3
more-itertools==8.2.0 --hash=sha256:5dd8bcf33e5f9513ffa06d5ad33d78f31e1931ac9a18f33d37e77a180d393a7c
packaging==20.3 --hash=sha256:82f77b9bee21c1bafbf35a84905d604d5d1223801d639cf3ed140bd651c08752
pluggy==0.13.1 --hash=sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d
py==1.8.1 --hash=sha256:c20fdd83a5dbc0af9efd622bee9a5564e278f6380fffcacc43ba6f43db2813b0
pyparsing==2.0.2 --hash=sha256:17e43d6b17588ed5968735575b3983a952133ec4082596d214d7090b56d48a06
pytest==5.4.1 --hash=sha256:0e5b30f5cb04e887b91b1ee519fa3d89049595f428c1db76e73bd7f17b09b172
six==1.15.0 --hash=sha256:8b74bedcbbbaca38ff6d7491d76f2b06b3592611af620f8426e82dddb04a5ced
wcwidth==0.1.9 --hash=sha256:cafe2186b3c009a04067022ce1dcd79cb38d8d65ee4f4791b8888d6599d1bbe1
Werkzeug==0.15.5 --hash=sha256:87ae4e5b5366da2347eb3116c0e6c681a0e939a33b2805e2c0cbd282664932c4
Then, to run the application, run:
$ bazel run //src:flask_app
You should see:
INFO: Analyzed target //src:flask_app (10 packages loaded, 184 targets configured).
INFO: Found 1 target...
Target //src:flask_app up-to-date:
bazel-bin/src/flask_app
INFO: Elapsed time: 7.430s, Critical Path: 1.12s
INFO: 4 processes: 4 internal.
INFO: Build completed successfully, 4 total actions
INFO: Build completed successfully, 4 total actions
* Serving Flask app "flask_app" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Now the application is running on localhost. Open your browser and navigate to http://127.0.0.1:5000/. You should see something similar to:
Python executable used by Bazel is: /private/var/tmp/_bazel_michael/0c5c16dff39796b913e37a926dff4861/execroot/my_flask_app/bazel-out/darwin-fastbuild/bin/src/flask_app.runfiles/python_interpreter/python_bin
Python version used by Bazel is: 3.8.3
Python executable in the HOST machine is: /Users/michael/.pyenv/versions/3.9.0/bin/python3
Python version in the HOST machine is: 3.9.0
As we expected, Bazel is using Python version 3.8.3 that we compiled from scratch and not Python 3.9.0 that I have on my host machine.
Reproducibility Test
Finally, are we sure that the build is reproducible?
To test, run a build two times and check the output binaries for any differences by comparing the md5 hashes:
$ md5sum $(bazel info bazel-bin)/src/flask_app
2075a7ec4e8eb7ced16f0d9b3d8c5619 /private/var/tmp/_bazel_michael/0c5c16dff39796b913e37a926dff4861/execroot/my_flask_app/bazel-out/darwin-fastbuild/bin/src/flask_app
$ bazel clean --expunge_async
# or 'bazel clean --expunge' on non-linux platforms
INFO: Starting clean.
$ bazel build //...
Starting local Bazel server and connecting to it...
INFO: Analyzed 4 targets (38 packages loaded, 8729 targets configured).
INFO: Found 4 targets...
INFO: Elapsed time: 183.923s, Critical Path: 1.65s
INFO: 7 processes: 7 internal.
INFO: Build completed successfully, 7 total actions
$ md5sum $(bazel info bazel-bin)/src/flask_app
2075a7ec4e8eb7ced16f0d9b3d8c5619 /private/var/tmp/_bazel_michael/0c5c16dff39796b913e37a926dff4861/execroot/my_flask_app/bazel-out/darwin-fastbuild/bin/src/flask_app
Here, we computed the hash of the binary that we just built, cleaned all build artifacts and dependencies, and ran the build again. The new binary is identical to the old one!
--
So your build is hermetical, right?
Well, actually, it’s not fully reproducible, let's look at why.
Jump back to the WORKSPACE file. In this file, we attempted to build Python, inside Bazel, to achieve full reproducibility. However, using http_archive
's patch_cmds
means that Python is built using the compiler of the host machine that runs the build. The Python interpreter, which is pinned to a precise version, will depend on the machine's GCC and system libraries that are not pinned or controlled in any way. In other words, the build is not fully reproducible.
There are solutions for that, though!
Examples:
- You can run
bazel build
from a Docker container, with a pinned GCC version, and then check in the Docker information within your project. This is a common approach in CI systems. - Instead of compiling Python from scratch, you can use a pre-compiled binary executable, check it into source control, and pin it on the build.
- You can use a tool like Nix, which allows importing external dependencies (like system libraries) into Bazel hermetically.
Conclusion
To summarize the biggest takeaways:
- Don't take for granted that your build is reproducible
- Hermeticity enables cherry-picking
- Inputs to the build must be versioned with source code
- Internal randomness can be sneaky but must be removed
- You have a working Python environment that is hermetic thanks to Bazel
- You have seen how to compile a Flask Application in a reproducible way
Now that you are familiar with the meaning of a reproducible, hermetic build, your journey to making your builds reproducible begins.
Test the md5 of the binary of the project you are currently working on and let me know the result.