•21 Feb, 2025

DataProc - Cluster startup optimization

An explainer on how optimize Google Cloud DataProc startup for speed and cost with Python AI and ML workflows.

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

DataProc cluster startup optimization

Introduction

Cluster startup time is a crucial factor in DataProc job efficiency. When a DataProc cluster is provisioned, worker nodes are allocated from a Compute Instance pool. The provisioning time depends on machine type availability in the compute region. Larger memory machines typically take longer to provision due to lower availability in the data center.

Once provisioned, workers automatically apply the DataProc image, which includes Java, Scala (for Spark), and Python 3.11 with Conda. However, no additional libraries are installed, meaning all project dependencies must be configured before executing any job.

To streamline the startup process, the initialization script follows three key steps:

OS package update (apt-get)
Install OS binaries (e.g., required system dependencies)
Install Python requirements

Optimizing dependency installation

There are multiple ways to install Python dependencies within a DataProc runtime. The choice of method significantly impacts cluster startup time.

The key options include:

Installing from a requirements file
Installing from a built Python package
Using Poetry for dependency resolution

Option 1: Installing from a requirements file

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/bash  

# Install required OS libraries  
apt-get update -y  
apt-get install --yes make coinor-cbc coinor-libcbc-dev libgomp1  

# Variables  
PACKAGE_BUCKET="gs://<bucket>/data"  
LOCAL_PACKAGE_DIR="/opt/app"  

# Copy the requirements file from GCS  
gsutil cp $PACKAGE_BUCKET/requirements.txt ./requirements.txt  

# Install dependencies  
pip install -r requirements.txt

Option 2: Installing from a `.whl` Package

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash  

# Install required OS libraries  
apt-get update -y  
apt-get install --yes make coinor-cbc coinor-libcbc-dev libgomp1  

# Variables  
PACKAGE_BUCKET="gs://<bucket>/data"  
LATEST_PACKAGE=$(gsutil ls "$PACKAGE_BUCKET/<project>-*-py3-none-any.whl" | tail -1)   
LOCAL_DIR="/opt/app"  

# Copy the package from GCS  
mkdir -p $LOCAL_DIR  
gsutil cp $LATEST_PACKAGE $LOCAL_DIR    

# Install dependencies  
cd $LOCAL_DIR  
ls -1 <project>-*-py3-none-any.whl | xargs pip install

Final optimization strategy

Based on extensive testing, the most efficient approach was determined to be:

1
2
3
4
5
6
7
8
#!/bin/bash  

# Variables  
REQUIREMENTS="gs://<bucket>/data/conf/requirements.txt"  

# Copy and install dependencies  
gsutil cp $REQUIREMENTS ./requirements.txt  
pip install -r requirements.txt

Note

The COIN-OR installation could be removed as a version update to the data science libraries meant that they were builded inside of the Python libraries.

Key optimizations

Dropped apt-get updates: Eliminating OS package updates significantly reduces startup time (~3 minutes saved)
Explicit Dependency Versioning: Pre-resolving dependency versions avoids unnecessary resolution during installation.
Using pip Over poetry or conda: pip provides the fastest installation for pre-versioned dependencies.

Python package management for DataProc

When submitting PySpark jobs, dependencies must be packaged for the Spark environment. Unlike cluster-wide dependencies installed via startup scripts, these packages are defined at the job level.

Five methods for building a Python package

Option 1) Poetry build

With no changes to the project, call poetry to build a wheel using the pyproject.toml file.

1
poetry build -f wheel

A requirements file is still required for cluster setup. To export the dependencies:

1
2
3
4
5
poetry export \
    -f requirements.txt \
    --with data \
    --output requirements.txt \
    --without-hashes

Here we use --with to select the dependency groups that are required.

Important

The --without-hashes is important to build a requirements file that is OS agnostic. Otherwise a requirements file built on a Mac will not be able to resolve any matching dependencies on a Linux based CI pipeline.

Option 2) Use Python’s built-in tools

Create a basic setup.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import os  
from setuptools import find_packages, setup  

def read_requirements():  
    req_path = os.path.join(os.path.dirname(__file__), "requirements.txt")  
    if os.path.exists(req_path):  
        with open(req_path) as f:  
            return f.read().splitlines()  
    return []  

setup(  
    name="<project>",  
    version="0.1.0",  
    install_requires=read_requirements(),  
    packages=find_packages(),  
)

With the setup file we can then build.

2a) Old method (setuptools)

The traditional way of building a Python package is with the built-in setuptools library. This is largely discouraged, and will show warnings on the command line when executed.

From the command line call

1
python setup.py sdist

To include the requirements file in the bundle (required for installation) requires adding the requirements file to a MANIFEST.in file (also at project root).

1
include requirements.txt

2b) Modern python (build)

The preferred Pythonic way of building a package is using the built-in build package.

1
python -m build

This builds both

<project>-0.1.0.tar.gz
<project>-0.1.0-py3-none-any.whl

Wheels do not include the dependencies, only references for installation.

Tip

Wheels are better for fast installation and when there are frozen dependencies (such as a release on a production pipeline)

Option 3) Zipping the source

DataProc supports Python code simply zipped into a .zip file. However, that .zip file still requires a top-level setup.py or pyproject.toml to instruct the package manager on how to handle the install.

In this project, the pyproject.toml references the README.md file, so it also needs to be included in the bundle.

3a) Zipping with the pyproject.toml

1
2
3
4
5
zip -r \
    dist/<project>-0.1.0-py3-none-any.zip \
    <project or src>/ \
    pyproject.toml \
    README.md

Note

The README.md needs to be included in the bundle as the pyproject.toml makes reference to it.

3b) Zipping with the setup.py

1
2
3
4
zip -r \
    dist/<project>-0.1.0-py3-none-any.zip \
    <project or src>/ \
    setup.py

Bundling the dependencies in the package

Unfortunately the nature of Python is that a project inevitably ends up with a lot of dependencies. By default pip will install these sequentially, by querying an archive server, then downloading the package, caching it, then installing it - all over the open internet. If you are not careful with explicitly versioning each dependency pip will also try to rebuild the entire dependency tree upon install and will slow down the startup by as much as 15 minutes per run.

In a release situation, where the dependencies are frozen and clusters are frequently spun up, it can make sense to include the dependencies within the package. The (larger) package is then copied once from Cloud Storage to the Composer file system once and everything is installed without any calls to the internet.

The only complication is that the bundled dependencies need to exactly match the target run time. i.e. building a package on a Mac will pull Mac compiled dependencies and will not run on a Linux server.

Pull the Docker image

Jump into a (Debian) Linux based Docker image, ideally one that matches the Python version on the DataProc cluster

1
docker pull python:3.11-slim

With the image downloaded, create and connect to an interactive session with the image. Note to connect with bash and not the Python command interface

1
docker run -it --rm -v $(pwd):/app -w /app python:3.11-slim bash

Where,

$(pwd):/app: Mounts your current directory (pwd) to /app in the container.
-w /app: Sets the working directory inside the container to /app.

Install dependencies to a target directory

From the container, install the requirements and save the cache

1
pip install -r requirements.txt --target ./vendor

This will then pull the exact requirement binary versions for Linux and store them in the vendor directory.

Again, we zip the project code, this time include the dependencies

1
2
3
4
5
6
zip -r \
    dist/<project>-0.1.0-py3-none-any.zip
    <project or src>/ \
    vendor/ \       # include dependencies
    pyproject.toml \
    README.md

Warning

For this project the ZIP size went from ~80KB to 468MB with the usual data science libraries.

However, this should save a lot of time on installation, as there is no resolution required from pip, and there should be no network latency.

Benchmarking cluster initialization

To determine the optimal approach, a benchmarking exercise was conducted to compare different formats, dependency resolutions, and build systems.

Benchmark results

Format	Versioned Dependencies	Include Dependencies	Build System	Config	Build Size	DataProc Init Time
.tar.gz	No	No	Poetry	pyproject.toml	80 KB	25+ min
.whl	No	No	Poetry	pyproject.toml	80 KB	7 min
.whl	Yes	No	Poetry	pyproject.toml	80 KB	5 min
.zip	No	No	Poetry .whl → zip	pyproject.toml	80 KB	25+ min
.zip	No	Yes	Poetry .whl → zip	pyproject.toml	468 MB	22 min
.zip	No	No	Poetry .whl → zip	setup.py	80 KB	12 min
.whl	No	No	Python sdist	setup.py	80 KB	9 min
.whl	No	No	Python build	setup.py	80 KB	9 min
.zip	Yes	No	Zip	pyproject.toml	80 KB	25+ min
.zip	Yes	Yes	Zip	pyproject.toml	468 MB	17 min
.zip	Yes	No	Zip	setup.py	80 KB	12 min
.zip	Yes	Yes	Zip	setup.py	468 MB	5 min

Key takeaways

1) `.whl` is the most performant file format

.whl’s were the most performant.

.zip’s were the second best. Curiously, taking a .whl file and renaming it to a .zip almost doubled the cluster initialization.

.tar.gz was by far the worst format and more than 5x the install time before the cluster timed out.

2) Versioned dependencies vastly increase install speed

Explicit versioning saved approx. 30% of cluster initialization time across file formats.

Of course, the version resolution needs to be added to the deployment pipeline - but this only runs once per release, but saves resolution every time a job is submitted.

3) Bundling the dependencies yields no real benefit

It was theorized that, as GCS and DataProc would be in the same data centre, moving a larger package internally would be faster than looking up and pulling ~100 libraries across the open internet. This was not the case. It might be that the DataProc service is providing a certain amount of caching of common libraries, or that pip can multithread the fetch and install of packages from the internet, but not locally.

4) `pyproject.toml` performs slightly better than `setup.py`

Not a significant difference, but avoiding a legacy build system that is warned against and will be shortly deprecated makes sense - further it would pollute the project top level directory with more files.

5) Modern build systems produce more performant builds

At first glance, the packages built by sdist, build and poetry appear identical - however, there are slight differences which affect their handling by the package manager at install. The new systems install slightly faster.

6) Larger jobs are more performant than increasing parallelization

Given that the package install has to occur every time a job is submitted to a Spark cluster worker, care should be taken to balance number of jobs with job size. From this investigation, increasing a job length by 5 minutes would be far more beneficial than creating a new job to do the same work.

For this reason, larger worker nodes utilizing a multithread approach would likely yield the best outcomes.

Older post

DataProc - Cluster configuration

Newer post

DataProc - Passing runtime variables

LEADERSHIP, SCALE, AND THE HUMAN ELEMENT

Principles & Perspectives

How do you define successful engineering leadership?

The Philosophy

Many view technical leadership as being the “smartest architect in the room.” I see it as the opposite. My job is to build a room where I don’t have to be the smartest person because the systems, culture, and communication are so robust that the team can out-innovate me.

The Strategy

Alignment: Does every engineer understand how their sprint task impacts the company’s bottom line?
Velocity vs. Stability: We aren’t just “shipping fast”; we are building a predictable, repeatable engine that doesn’t collapse under its own weight at the next order of magnitude.
The Human Growth Curve: Success is when the engineering team’s capability evolves faster than the product’s complexity. If the team feels stagnant, the tech stack will soon follow.

What is your approach to scaling technical organizations?

The Philosophy

Scaling isn’t just “hiring more people” - that’s often how you slow down. Scaling is about moving from Individual Heroics to Organizational Systems.

The Strategy

The 3-Continent Perspective: Having managed global teams, I focus on “High-Signal Communication.” As you grow, the cost of a meeting triples. I implement “Asynchronous-First” cultures that protect deep-work time while ensuring no one is blocked by a timezone.
Modular Autonomy: I advocate for breaking down monolithic teams into autonomous units with clear ownership. This reduces the “communication tax” and allows us to scale the headcount without scaling the bureaucracy.
Automation as Infrastructure: At petabyte scale, manual intervention is a failure. I treat the developer experience (CI/CD, observability, self-service infra) as a first-class product to keep the “path to production” frictionless.

How do you balance high-growth velocity with technical stability?

The Philosophy

Technical debt isn’t a “bad thing” to be avoided; it’s a set of historical decisions that no longer serve you. Like any loan, leverage can accelerate growth when investments payoff. But if velocity and returns are slowing you need a payment plan before the interest kills you.

The Strategy

The ROI Filter: I don’t refactor for the sake of “clean code.” I don’t refactor a micro-service with no users. I refactor when the pain on that debt - measured in bugs, downtime, or developer frustration - starts to exceed the cost of the fix.
Zero-Downtime Culture: Especially at scale, stability is a feature. I implement “Guardrail Engineering” where the system is designed to fail gracefully, ensuring that a Series B growth spike becomes a success story rather than a post-mortem.
The 70/20/10 Rule: I typically aim to dedicate 70% of resources to new features, 20% to infrastructure/debt, and 10% to R&D. This ensures we never stop innovating, but we never stop fortifying either.

DataProc - Cluster startup optimization

On this page

DataProc cluster startup optimization

Introduction

Optimizing dependency installation

Option 1: Installing from a requirements file

Option 2: Installing from a .whl Package

Final optimization strategy

Key optimizations

Python package management for DataProc

Five methods for building a Python package

Option 1) Poetry build

Option 2) Use Python’s built-in tools

2a) Old method (setuptools)

2b) Modern python (build)

Option 3) Zipping the source

3a) Zipping with the pyproject.toml

3b) Zipping with the setup.py

Bundling the dependencies in the package

Pull the Docker image

Install dependencies to a target directory

Benchmarking cluster initialization

Benchmark results

Key takeaways

1) .whl is the most performant file format

2) Versioned dependencies vastly increase install speed

3) Bundling the dependencies yields no real benefit

4) pyproject.toml performs slightly better than setup.py

5) Modern build systems produce more performant builds

6) Larger jobs are more performant than increasing parallelization

DataProc - Cluster configuration

DataProc - Passing runtime variables

Explore the archive

DataProc - a (near) complete guide

DataProc - Passing runtime variables

DataProc - Cluster configuration

Principles & Perspectives

How do you define successful engineering leadership?

What is your approach to scaling technical organizations?

How do you balance high-growth velocity with technical stability?

Option 2: Installing from a `.whl` Package

1) `.whl` is the most performant file format

4) `pyproject.toml` performs slightly better than `setup.py`