DataProc - Passing runtime variables

An explainer on how to pass runtime variables to Google Cloud DataProc for Python AI and ML applications.

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

  1. What is DataProc? And why do we need it?
  2. Environments, what’s shared and what’s local
  3. Environment variables
  4. Cluster configuration
  5. Cluster startup optimization
  6. Runtime variables

On this page


Passing runtime variables in DataProc and Airflow

Introduction

Using environment variables to pass secrets and configurations at runtime is a DevOps best practice that enhances security, flexibility, and maintainability. Hard-coding sensitive values in source code increases the risk of leaks, while environment variables keep secrets out of version control systems. This also enables dynamic configuration, allowing jobs to adapt across different environments (e.g., DEV, UAT, PRD) without modifying the application code.

In DataProc, variables can be set within Google Cloud Composer and injected into DataProc worker environments, allowing job scripts to retrieve these values dynamically.


Methods for passing runtime variables

Apache Spark provides multiple ways to pass variables to the runtime environment. The following are the four primary approaches, listed in order of preference.

OS-level initialization script (cluster startup)

Google Cloud’s recommended method is defining a startup script during cluster creation. This script is stored in Cloud Storage and executed as an initialization action.

Example command

1
2
3
4
5
gcloud dataproc clusters create <NAME> \
    --region=${REGION} \
    --initialization-actions=gs://<bucket>/startup.sh \
    --initialization-action-timeout=10m \
    ... other flags ...

Using this method, environment variables can be appended to the system’s /etc/environment file which is run as the final initialization step for every machine that is added to the cluster.

Example startup script

1
2
3
#!/usr/bin/env bash

echo "FOO=BAR" >> /etc/environment

Compute engine metadata (instance-level variables)

Since DataProc clusters are built on Google Compute Engine (GCE) instances, metadata can be passed to both head and worker nodes at time of provisioning from Google management layer.

Example command

1
2
3
gcloud dataproc clusters \
    create <NAME> \
    --metadata foo=BAR,startup-script-url=gs://<bucket>/startup.sh

Metadata can also be used for Hadoop and Spark properties:

1
2
3
gcloud dataproc clusters \
    create <NAME> \
    --properties hadoop-env:FOO=hello,spark-env:BAR=world

Spark properties (job submission-level variables)

If the environment variables need to be job-specific, they can be injected directly into the Spark runtime environment when submitting a job.

Example command

1
2
3
gcloud dataproc jobs \
    submit spark \
    --properties spark.executorEnv.FOO=world

From Airflow, the DataprocSubmitJobOperator supports passing job-level properties:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
DataprocSubmitJobOperator(
    task_id="example_task",
    region=REGION,
    project_id=PROJECT,
    job={
        "placement": {"cluster_name": CLUSTER_NAME},
        "pyspark_job": {
            "main_python_file_uri": f"gs://{BUCKET}/dags/predict.py",
            "python_file_uris": [PKG],
            "args": [
                "--store_key", store,
                "--start_date", PARAMS.start_date.strftime("%Y-%m-%d %H:%M"),
                "--end_date", PARAMS.end_date.strftime("%Y-%m-%d %H:%M"),
            ],
            "properties": {
                "ENV": ENV,
                "PROJECT": PROJECT,
                "REGION": REGION,
                "BUCKET": BUCKET,
            },
        },
    },
)
Warning

Some security policies may block custom environment variables from being passed using this method.


Command-line arguments (explicit variable passing)

If other methods fail due to security restrictions, environment variables can be passed explicitly as command-line arguments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
DataprocSubmitJobOperator(
    task_id="example_task",
    region=REGION,
    project_id=PROJECT,
    job={
        "placement": {"cluster_name": CLUSTER_NAME},
        "pyspark_job": {
            "main_python_file_uri": f"gs://{BUCKET}/dags/predict.py",
            "python_file_uris": [PKG],
            "args": [
                "--site", site,
                "--start_date", PARAMS.start_date.strftime("%Y-%m-%d %H:%M"),
                "--end_date", PARAMS.end_date.strftime("%Y-%m-%d %H:%M"),
                "--env",  # NEW ARGUMENT ADDED
                ENV,
            ],
        },
    },
)

This requires modifying the entry point of the run script to accept the new argument:

1
parser.add_argument("--env", required=False, default="local", help="Runtime environment.")

Use a secrets manager

The most secure way to manage runtime variables is to store sensitive values in a secrets manager rather than passing them directly.

Why Use a Secrets Manager?

  • Security: Keeps secrets out of logs, DAGs, and environment variables.
  • Access Control: Secrets can be role-restricted to prevent unauthorized access.
  • Versioning: Allows tracking changes to secrets over time.
  • Auditing: Provides logging to track access attempts.
  • Ease of coding: The same variables can be used across deployment environments, so long as each environment has its own Secret Manager.

Google Cloud Secret Manager

Google Cloud Secret Manager provides centralized, access-controlled storage for secrets. All we need to do is add Secret Manager access to the deployment Service Account and then we can replace most variables with a simple secret lookup.

1) Store a Secret

1
gcloud secrets create MY_SECRET --replication-policy="automatic"

2) Set a value

1
echo -n "my_secret_value" | gcloud secrets versions add MY_SECRET --data-file=-

3) Retrieve a secret from Airflow

In Airflow DAGs, secrets can be accessed using Google Secret Manager Hook:

1
2
3
4
5
6
7
from airflow.providers.google.cloud.hooks.secret_manager import SecretManagerHook

def get_secret(secret_name: str):
    hook = SecretManagerHook()
    return hook.get_secret(secret_name)

MY_SECRET_VALUE = get_secret("MY_SECRET")

4) Retrieve a Secret from DataProc

Alternatively, secrets can be accessed directly from within the DataProc environment using the Google Cloud Secret Manager Python client.

Ensure that the Secret Manager library is included in the project requirements and installed during project initialization

Then in your PySpark job, retrieve the Secret

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from google.cloud import secretmanager

def get_secret(secret_name: str, project: str) -> str:
    """Retrieve a secret from Google Cloud Secret Manager."""
    client = secretmanager.SecretManagerServiceClient()
    secret_path = f"projects/{project}/secrets/{secret_name}/versions/latest"
    response = client.access_secret_version(request={"name": secret_path})
    return response.payload.data.decode("UTF-8")

# Example usage
SECRET_NAME = "MY_SECRET"

secret_value = get_secret(SECRET_NAME, PROJECT)

This method has the advantage of only pulling secrets when they are needed at runtime rather than at DAG execution time - this limits their scope to a single function or class rather than the entire execution environment.


Conclusion

When configuring runtime variables in DataProc and Airflow, the best method depends on security policies and operational requirements:

MethodUse Case
OS-Level Startup ScriptBest for cluster-wide persistent variables
Compute Engine MetadataWorks at the instance level; useful for per-node settings
Spark PropertiesBest for job-specific runtime variables
Command-Line ArgumentsA fallback option when security policies restrict variable injection
Secrets ManagerThe most secure option for sensitive values

For sensitive data such as API keys, database credentials, and encryption secrets, Google Cloud Secret Manager should always be used.

This structured approach ensures secure and flexible runtime configuration across different environments.

DataProc - a (near) complete guide

What distinguishes you from other developers?

I've built data pipelines across 3 continents at petabyte scales, for over 15 years. But the data doesn't matter if we don't solve the human problems first - an AI solution that nobody uses is worthless.

Are the robots going to kill us all?

Not any time soon. At least not in the way that you've got imagined thanks to the Terminator movies. Sure somebody with a DARPA grant is always going to strap a knife/gun/flamethrower on the side of a robot - but just like in Dr.Who - right now, that robot will struggle to even get out of the room, let alone up some stairs.

But AI is going to steal my job, right?

A year ago, the whole world was convinced that AI was going to steal their job. Now, the reality is that most people are thinking 'I wish this POC at work would go a bit faster to scan these PDFs'.

When am I going to get my self-driving car?

Humans are complicated. If we invented driving today - there's NO WAY IN HELL we'd let humans do it. They get distracted. They text their friends. They drink. They make mistakes. But the reality is, all of our streets, cities (and even legal systems) have been built around these limitations. It would be surprisingly easy to build self-driving cars if there were no humans on the road. But today no one wants to take liability. If a self-driving company kills someone, who's responsible? The manufacturer? The insurance company? The software developer?