This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section
- What is DataProc? And why do we need it?
- Environments, what’s shared and what’s local
- Environment variables
- Cluster configuration
- Cluster startup optimization
- Runtime variables
On this page
DataProc cluster startup optimization
Introduction
Cluster startup time is a crucial factor in DataProc job efficiency. When a DataProc cluster is provisioned, worker nodes are allocated from a Compute Instance pool. The provisioning time depends on machine type availability in the compute region. Larger memory machines typically take longer to provision due to lower availability in the data center.
Once provisioned, workers automatically apply the DataProc image, which includes Java, Scala (for Spark), and Python 3.11 with Conda. However, no additional libraries are installed, meaning all project dependencies must be configured before executing any job.
To streamline the startup process, the initialization script follows three key steps:
- OS package update (
apt-get) - Install OS binaries (e.g., required system dependencies)
- Install Python requirements
Optimizing dependency installation
There are multiple ways to install Python dependencies within a DataProc runtime. The choice of method significantly impacts cluster startup time.
The key options include:
- Installing from a requirements file
- Installing from a built Python package
- Using Poetry for dependency resolution
Option 1: Installing from a requirements file
| |
Option 2: Installing from a .whl Package
| |
Final optimization strategy
Based on extensive testing, the most efficient approach was determined to be:
| |
The COIN-OR installation could be removed as a version update to the data science libraries meant that they were builded inside of the Python libraries.
Key optimizations
- Dropped
apt-getupdates: Eliminating OS package updates significantly reduces startup time (~3 minutes saved) - Explicit Dependency Versioning: Pre-resolving dependency versions avoids unnecessary resolution during installation.
- Using
pipOverpoetryorconda:pipprovides the fastest installation for pre-versioned dependencies.
Python package management for DataProc
When submitting PySpark jobs, dependencies must be packaged for the Spark environment. Unlike cluster-wide dependencies installed via startup scripts, these packages are defined at the job level.
Five methods for building a Python package
Option 1) Poetry build
With no changes to the project, call poetry to build a wheel using the pyproject.toml file.
| |
A requirements file is still required for cluster setup. To export the dependencies:
| |
Here we use --with to select the dependency groups that are required.
The --without-hashes is important to build a requirements file that is OS agnostic. Otherwise a requirements file built on a Mac will not be able to resolve any matching dependencies on a Linux based CI pipeline.
Option 2) Use Python’s built-in tools
Create a basic setup.py:
| |
With the setup file we can then build.
2a) Old method (setuptools)
The traditional way of building a Python package is with the built-in setuptools library. This is largely discouraged, and will show warnings on the command line when executed.
From the command line call
| |
To include the requirements file in the bundle (required for installation) requires adding the requirements file to a MANIFEST.in file (also at project root).
| |
2b) Modern python (build)
The preferred Pythonic way of building a package is using the built-in build package.
| |
This builds both
<project>-0.1.0.tar.gz<project>-0.1.0-py3-none-any.whl
Wheels do not include the dependencies, only references for installation.
Wheels are better for fast installation and when there are frozen dependencies (such as a release on a production pipeline)
Option 3) Zipping the source
DataProc supports Python code simply zipped into a .zip file. However, that .zip file still requires a top-level setup.py or pyproject.toml to instruct the package manager on how to handle the install.
In this project, the pyproject.toml references the README.md file, so it also needs to be included in the bundle.
3a) Zipping with the pyproject.toml
| |
The README.md needs to be included in the bundle as the pyproject.toml makes reference to it.
3b) Zipping with the setup.py
| |
Bundling the dependencies in the package
Unfortunately the nature of Python is that a project inevitably ends up with a lot of dependencies. By default pip will install these sequentially, by querying an archive server, then downloading the package, caching it, then installing it - all over the open internet. If you are not careful with explicitly versioning each dependency pip will also try to rebuild the entire dependency tree upon install and will slow down the startup by as much as 15 minutes per run.
In a release situation, where the dependencies are frozen and clusters are frequently spun up, it can make sense to include the dependencies within the package. The (larger) package is then copied once from Cloud Storage to the Composer file system once and everything is installed without any calls to the internet.
The only complication is that the bundled dependencies need to exactly match the target run time. i.e. building a package on a Mac will pull Mac compiled dependencies and will not run on a Linux server.
Pull the Docker image
Jump into a (Debian) Linux based Docker image, ideally one that matches the Python version on the DataProc cluster
| |
With the image downloaded, create and connect to an interactive session with the image. Note to connect with bash and not the Python command interface
| |
Where,
$(pwd):/app: Mounts your current directory (pwd) to /app in the container.-w /app: Sets the working directory inside the container to /app.
Install dependencies to a target directory
From the container, install the requirements and save the cache
| |
This will then pull the exact requirement binary versions for Linux and store them in the vendor directory.
Again, we zip the project code, this time include the dependencies
| |
For this project the ZIP size went from ~80KB to 468MB with the usual data science libraries.
However, this should save a lot of time on installation, as there is no resolution required from pip, and there should be no network latency.
Benchmarking cluster initialization
To determine the optimal approach, a benchmarking exercise was conducted to compare different formats, dependency resolutions, and build systems.
Benchmark results
| Format | Versioned Dependencies | Include Dependencies | Build System | Config | Build Size | DataProc Init Time |
|---|---|---|---|---|---|---|
| .tar.gz | No | No | Poetry | pyproject.toml | 80 KB | 25+ min |
| .whl | No | No | Poetry | pyproject.toml | 80 KB | 7 min |
| .whl | Yes | No | Poetry | pyproject.toml | 80 KB | 5 min |
| .zip | No | No | Poetry .whl → zip | pyproject.toml | 80 KB | 25+ min |
| .zip | No | Yes | Poetry .whl → zip | pyproject.toml | 468 MB | 22 min |
| .zip | No | No | Poetry .whl → zip | setup.py | 80 KB | 12 min |
| .whl | No | No | Python sdist | setup.py | 80 KB | 9 min |
| .whl | No | No | Python build | setup.py | 80 KB | 9 min |
| .zip | Yes | No | Zip | pyproject.toml | 80 KB | 25+ min |
| .zip | Yes | Yes | Zip | pyproject.toml | 468 MB | 17 min |
| .zip | Yes | No | Zip | setup.py | 80 KB | 12 min |
| .zip | Yes | Yes | Zip | setup.py | 468 MB | 5 min |
Key takeaways
1) .whl is the most performant file format
.whl’s were the most performant.
.zip’s were the second best. Curiously, taking a .whl file and renaming it to a .zip almost doubled the cluster initialization.
.tar.gz was by far the worst format and more than 5x the install time before the cluster timed out.
2) Versioned dependencies vastly increase install speed
Explicit versioning saved approx. 30% of cluster initialization time across file formats.
Of course, the version resolution needs to be added to the deployment pipeline - but this only runs once per release, but saves resolution every time a job is submitted.
3) Bundling the dependencies yields no real benefit
It was theorized that, as GCS and DataProc would be in the same data centre, moving a larger package internally would be faster than looking up and pulling ~100 libraries across the open internet. This was not the case. It might be that the DataProc service is providing a certain amount of caching of common libraries, or that pip can multithread the fetch and install of packages from the internet, but not locally.
4) pyproject.toml performs slightly better than setup.py
Not a significant difference, but avoiding a legacy build system that is warned against and will be shortly deprecated makes sense - further it would pollute the project top level directory with more files.
5) Modern build systems produce more performant builds
At first glance, the packages built by sdist, build and poetry appear identical - however, there are slight differences which affect their handling by the package manager at install. The new systems install slightly faster.
6) Larger jobs are more performant than increasing parallelization
Given that the package install has to occur every time a job is submitted to a Spark cluster worker, care should be taken to balance number of jobs with job size. From this investigation, increasing a job length by 5 minutes would be far more beneficial than creating a new job to do the same work.
For this reason, larger worker nodes utilizing a multithread approach would likely yield the best outcomes.

