This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section
- What is DataProc? And why do we need it?
- Environments, what’s shared and what’s local
- Environment variables
- Cluster configuration
- Cluster startup optimization
- Runtime variables
On this page
DataProc cluster configuration
Introduction
In many DataProc projects, jobs are executed under nearly identical conditions — they typically have low memory and CPU requirements and access the same data sources.
To avoid duplicating boilerplate configuration across multiple DAGs, a shared configuration file can be created under the /dag
directory. Each environment (e.g., development, UAT, production) can have its own <env>.py
file that defines a standard cluster configuration. This shared configuration can then be imported into all relevant DAGs, with project-specific overrides applied when necessary.
Example shared configuration file
|
|
Number of workers
The minimum number of workers (num_workers
) must be greater than 1. DataProc requires at least two workers to form a fully functional cluster.
Clusters configured with only a single worker will pass initial DAG validation and begin provisioning, but will ultimately time out with an unclear error message.
Cluster timeout (auto_delete_ttl
)
The auto_delete_ttl
parameter is often misunderstood. It defines the total lifetime of the cluster, not an idle timeout.
If this value is set below 10 minutes (600 seconds), the DAG will validate successfully and cluster provisioning will start, but the cluster will silently terminate without producing any meaningful error messages.
Typical startup durations
For small Spark clusters with two standard workers in North American regions, observed provisioning times are:
- Minimum startup time: Approximately 7 minutes
- Cold start (e.g., Monday morning): Up to 15 minutes
Clusters that install common data science libraries such as NumPy, scikit-learn, and pandas will often exceed 12.5 minutes for startup, even under ideal conditions.
Poorly optimized initialization scripts — especially those with complex or unconstrained pip install
commands — can easily push startup times beyond 10 minutes.
Recommended configuration
Despite natural cost-saving instincts, I strongly recommend setting the auto_delete_ttl
parameter significantly higher than you might expect — 20 to 30 minutes is a reasonable default.
For DAGs involving model training jobs, you may need to extend the timeout to several hours.
Because all DAGs include a cluster teardown task using trigger_rule="all_done"
, a generously high timeout does not pose a risk of excessive costs. The cluster will always be cleaned up automatically, even if the job fails or gets stuck.
DAG structure
With the previous configuration steps complete, the DAG itself follows a standard structure:
- Pull key variables from the Composer environment, including the environment name, region, and bucket.
- Retrieve the default cluster configuration from a shared submodule.
- Load default DAG arguments from the shared submodule.
- Assemble the DAG: cluster spin-up → PySpark job submission → cluster teardown.
Example DAG structure
|
|
Important considerations
As with DAG IDs, cluster names should only contain uppercase letters, numbers, and dashes. Avoid underscores and periods, as they will silently break cluster provisioning.
DAG files should avoid direct imports of business logic. Since DAGs are parsed by the Composer environment (a minimal Python environment), adding dependencies requires formal approval and complicates deployment.
Business logic should be packaged as a .whl
, .zip
, or .tar.gz
file and referenced using the python_file_uris
argument in the job configuration.
Job submission and argument handling
Since Airflow and Composer operate independently from the DataProc/Spark cluster, relative imports into DAG files are not possible. Instead, all arguments must be explicitly passed in the DataprocSubmitJobOperator.job
configuration as a valid DataProc job dictionary.
|
|
Reworking existing code
Business logic must be restructured to accept command-line arguments using argparse
, rather than relying on imported configurations.
|
|