DataProc - Cluster startup optimization

An explainer on how optimize Google Cloud DataProc startup for speed and cost with Python AI and ML workflows.

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

  1. What is DataProc? And why do we need it?
  2. Environments, what’s shared and what’s local
  3. Environment variables
  4. Cluster configuration
  5. Cluster startup optimization
  6. Runtime variables

On this page


DataProc cluster startup optimization

Introduction

Cluster startup time is a crucial factor in DataProc job efficiency. When a DataProc cluster is provisioned, worker nodes are allocated from a Compute Instance pool. The provisioning time depends on machine type availability in the compute region. Larger memory machines typically take longer to provision due to lower availability in the data center.

Once provisioned, workers automatically apply the DataProc image, which includes Java, Scala (for Spark), and Python 3.11 with Conda. However, no additional libraries are installed, meaning all project dependencies must be configured before executing any job.

To streamline the startup process, the initialization script follows three key steps:

  1. OS package update (apt-get)
  2. Install OS binaries (e.g., required system dependencies)
  3. Install Python requirements

Optimizing dependency installation

There are multiple ways to install Python dependencies within a DataProc runtime. The choice of method significantly impacts cluster startup time.

The key options include:

  • Installing from a requirements file
  • Installing from a built Python package
  • Using Poetry for dependency resolution

Option 1: Installing from a requirements file

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/bash  

# Install required OS libraries  
apt-get update -y  
apt-get install --yes make coinor-cbc coinor-libcbc-dev libgomp1  

# Variables  
PACKAGE_BUCKET="gs://<bucket>/data"  
LOCAL_PACKAGE_DIR="/opt/app"  

# Copy the requirements file from GCS  
gsutil cp $PACKAGE_BUCKET/requirements.txt ./requirements.txt  

# Install dependencies  
pip install -r requirements.txt

Option 2: Installing from a .whl Package

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash  

# Install required OS libraries  
apt-get update -y  
apt-get install --yes make coinor-cbc coinor-libcbc-dev libgomp1  

# Variables  
PACKAGE_BUCKET="gs://<bucket>/data"  
LATEST_PACKAGE=$(gsutil ls "$PACKAGE_BUCKET/<project>-*-py3-none-any.whl" | tail -1)   
LOCAL_DIR="/opt/app"  

# Copy the package from GCS  
mkdir -p $LOCAL_DIR  
gsutil cp $LATEST_PACKAGE $LOCAL_DIR    

# Install dependencies  
cd $LOCAL_DIR  
ls -1 <project>-*-py3-none-any.whl | xargs pip install

Final optimization strategy

Based on extensive testing, the most efficient approach was determined to be:

1
2
3
4
5
6
7
8
#!/bin/bash  

# Variables  
REQUIREMENTS="gs://<bucket>/data/conf/requirements.txt"  

# Copy and install dependencies  
gsutil cp $REQUIREMENTS ./requirements.txt  
pip install -r requirements.txt
Note

The COIN-OR installation could be removed as a version update to the data science libraries meant that they were builded inside of the Python libraries.

Key optimizations

  1. Dropped apt-get updates: Eliminating OS package updates significantly reduces startup time (~3 minutes saved)
  2. Explicit Dependency Versioning: Pre-resolving dependency versions avoids unnecessary resolution during installation.
  3. Using pip Over poetry or conda: pip provides the fastest installation for pre-versioned dependencies.

Python package management for DataProc

When submitting PySpark jobs, dependencies must be packaged for the Spark environment. Unlike cluster-wide dependencies installed via startup scripts, these packages are defined at the job level.

Five methods for building a Python package

Option 1) Poetry build

With no changes to the project, call poetry to build a wheel using the pyproject.toml file.

1
poetry build -f wheel

A requirements file is still required for cluster setup. To export the dependencies:

1
2
3
4
5
poetry export \
    -f requirements.txt \
    --with data \
    --output requirements.txt \
    --without-hashes

Here we use --with to select the dependency groups that are required.

Important

The --without-hashes is important to build a requirements file that is OS agnostic. Otherwise a requirements file built on a Mac will not be able to resolve any matching dependencies on a Linux based CI pipeline.

Option 2) Use Python’s built-in tools

Create a basic setup.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import os  
from setuptools import find_packages, setup  

def read_requirements():  
    req_path = os.path.join(os.path.dirname(__file__), "requirements.txt")  
    if os.path.exists(req_path):  
        with open(req_path) as f:  
            return f.read().splitlines()  
    return []  

setup(  
    name="<project>",  
    version="0.1.0",  
    install_requires=read_requirements(),  
    packages=find_packages(),  
)

With the setup file we can then build.

2a) Old method (setuptools)

The traditional way of building a Python package is with the built-in setuptools library. This is largely discouraged, and will show warnings on the command line when executed.

From the command line call

1
python setup.py sdist

To include the requirements file in the bundle (required for installation) requires adding the requirements file to a MANIFEST.in file (also at project root).

1
include requirements.txt

2b) Modern python (build)

The preferred Pythonic way of building a package is using the built-in build package.

1
python -m build

This builds both

  • <project>-0.1.0.tar.gz
  • <project>-0.1.0-py3-none-any.whl

Wheels do not include the dependencies, only references for installation.

Tip

Wheels are better for fast installation and when there are frozen dependencies (such as a release on a production pipeline)

Option 3) Zipping the source

DataProc supports Python code simply zipped into a .zip file. However, that .zip file still requires a top-level setup.py or pyproject.toml to instruct the package manager on how to handle the install.

In this project, the pyproject.toml references the README.md file, so it also needs to be included in the bundle.

3a) Zipping with the pyproject.toml

1
2
3
4
5
zip -r \
    dist/<project>-0.1.0-py3-none-any.zip \
    <project or src>/ \
    pyproject.toml \
    README.md
Note

The README.md needs to be included in the bundle as the pyproject.toml makes reference to it.

3b) Zipping with the setup.py

1
2
3
4
zip -r \
    dist/<project>-0.1.0-py3-none-any.zip \
    <project or src>/ \
    setup.py

Bundling the dependencies in the package

Unfortunately the nature of Python is that a project inevitably ends up with a lot of dependencies. By default pip will install these sequentially, by querying an archive server, then downloading the package, caching it, then installing it - all over the open internet. If you are not careful with explicitly versioning each dependency pip will also try to rebuild the entire dependency tree upon install and will slow down the startup by as much as 15 minutes per run.

In a release situation, where the dependencies are frozen and clusters are frequently spun up, it can make sense to include the dependencies within the package. The (larger) package is then copied once from Cloud Storage to the Composer file system once and everything is installed without any calls to the internet.

The only complication is that the bundled dependencies need to exactly match the target run time. i.e. building a package on a Mac will pull Mac compiled dependencies and will not run on a Linux server.

Pull the Docker image

Jump into a (Debian) Linux based Docker image, ideally one that matches the Python version on the DataProc cluster

1
docker pull python:3.11-slim

With the image downloaded, create and connect to an interactive session with the image. Note to connect with bash and not the Python command interface

1
docker run -it --rm -v $(pwd):/app -w /app python:3.11-slim bash

Where,

  • $(pwd):/app: Mounts your current directory (pwd) to /app in the container.
  • -w /app: Sets the working directory inside the container to /app.

Install dependencies to a target directory

From the container, install the requirements and save the cache

1
pip install -r requirements.txt --target ./vendor

This will then pull the exact requirement binary versions for Linux and store them in the vendor directory.

Again, we zip the project code, this time include the dependencies

1
2
3
4
5
6
zip -r \
    dist/<project>-0.1.0-py3-none-any.zip
    <project or src>/ \
    vendor/ \       # include dependencies
    pyproject.toml \
    README.md
Warning

For this project the ZIP size went from ~80KB to 468MB with the usual data science libraries.

However, this should save a lot of time on installation, as there is no resolution required from pip, and there should be no network latency.


Benchmarking cluster initialization

To determine the optimal approach, a benchmarking exercise was conducted to compare different formats, dependency resolutions, and build systems.

Benchmark results

FormatVersioned DependenciesInclude DependenciesBuild SystemConfigBuild SizeDataProc Init Time
.tar.gzNoNoPoetrypyproject.toml80 KB25+ min
.whlNoNoPoetrypyproject.toml80 KB7 min
.whlYesNoPoetrypyproject.toml80 KB5 min
.zipNoNoPoetry .whl → zippyproject.toml80 KB25+ min
.zipNoYesPoetry .whl → zippyproject.toml468 MB22 min
.zipNoNoPoetry .whl → zipsetup.py80 KB12 min
.whlNoNoPython sdistsetup.py80 KB9 min
.whlNoNoPython buildsetup.py80 KB9 min
.zipYesNoZippyproject.toml80 KB25+ min
.zipYesYesZippyproject.toml468 MB17 min
.zipYesNoZipsetup.py80 KB12 min
.zipYesYesZipsetup.py468 MB5 min

Key takeaways

1) .whl is the most performant file format

.whl’s were the most performant.

.zip’s were the second best. Curiously, taking a .whl file and renaming it to a .zip almost doubled the cluster initialization.

.tar.gz was by far the worst format and more than 5x the install time before the cluster timed out.

2) Versioned dependencies vastly increase install speed

Explicit versioning saved approx. 30% of cluster initialization time across file formats.

Of course, the version resolution needs to be added to the deployment pipeline - but this only runs once per release, but saves resolution every time a job is submitted.

3) Bundling the dependencies yields no real benefit

It was theorized that, as GCS and DataProc would be in the same data centre, moving a larger package internally would be faster than looking up and pulling ~100 libraries across the open internet. This was not the case. It might be that the DataProc service is providing a certain amount of caching of common libraries, or that pip can multithread the fetch and install of packages from the internet, but not locally.

4) pyproject.toml performs slightly better than setup.py

Not a significant difference, but avoiding a legacy build system that is warned against and will be shortly deprecated makes sense - further it would pollute the project top level directory with more files.

5) Modern build systems produce more performant builds

At first glance, the packages built by sdist, build and poetry appear identical - however, there are slight differences which affect their handling by the package manager at install. The new systems install slightly faster.

6) Larger jobs are more performant than increasing parallelization

Given that the package install has to occur every time a job is submitted to a Spark cluster worker, care should be taken to balance number of jobs with job size. From this investigation, increasing a job length by 5 minutes would be far more beneficial than creating a new job to do the same work.

For this reason, larger worker nodes utilizing a multithread approach would likely yield the best outcomes.

What distinguishes you from other developers?

I've built data pipelines across 3 continents at petabyte scales, for over 15 years. But the data doesn't matter if we don't solve the human problems first - an AI solution that nobody uses is worthless.

Are the robots going to kill us all?

Not any time soon. At least not in the way that you've got imagined thanks to the Terminator movies. Sure somebody with a DARPA grant is always going to strap a knife/gun/flamethrower on the side of a robot - but just like in Dr.Who - right now, that robot will struggle to even get out of the room, let alone up some stairs.

But AI is going to steal my job, right?

A year ago, the whole world was convinced that AI was going to steal their job. Now, the reality is that most people are thinking 'I wish this POC at work would go a bit faster to scan these PDFs'.

When am I going to get my self-driving car?

Humans are complicated. If we invented driving today - there's NO WAY IN HELL we'd let humans do it. They get distracted. They text their friends. They drink. They make mistakes. But the reality is, all of our streets, cities (and even legal systems) have been built around these limitations. It would be surprisingly easy to build self-driving cars if there were no humans on the road. But today no one wants to take liability. If a self-driving company kills someone, who's responsible? The manufacturer? The insurance company? The software developer?