DataProc - Passing runtime variables

An explainer on how to pass runtime variables to Google Cloud DataProc for Python AI and ML applications.

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

  1. What is DataProc? And why do we need it?
  2. Environments, what’s shared and what’s local
  3. Environment variables
  4. Cluster configuration
  5. Cluster startup optimization
  6. Runtime variables

On this page


Passing runtime variables in DataProc and Airflow

Introduction

Using environment variables to pass secrets and configurations at runtime is a DevOps best practice that enhances security, flexibility, and maintainability. Hard-coding sensitive values in source code increases the risk of leaks, while environment variables keep secrets out of version control systems. This also enables dynamic configuration, allowing jobs to adapt across different environments (e.g., DEV, UAT, PRD) without modifying the application code.

In DataProc, variables can be set within Google Cloud Composer and injected into DataProc worker environments, allowing job scripts to retrieve these values dynamically.


Methods for passing runtime variables

Apache Spark provides multiple ways to pass variables to the runtime environment. The following are the four primary approaches, listed in order of preference.

OS-level initialization script (cluster startup)

Google Cloud’s recommended method is defining a startup script during cluster creation. This script is stored in Cloud Storage and executed as an initialization action.

Example command

1
2
3
4
5
gcloud dataproc clusters create <NAME> \
    --region=${REGION} \
    --initialization-actions=gs://<bucket>/startup.sh \
    --initialization-action-timeout=10m \
    ... other flags ...

Using this method, environment variables can be appended to the system’s /etc/environment file which is run as the final initialization step for every machine that is added to the cluster.

Example startup script

1
2
3
#!/usr/bin/env bash

echo "FOO=BAR" >> /etc/environment

Compute engine metadata (instance-level variables)

Since DataProc clusters are built on Google Compute Engine (GCE) instances, metadata can be passed to both head and worker nodes at time of provisioning from Google management layer.

Example command

1
2
3
gcloud dataproc clusters \
    create <NAME> \
    --metadata foo=BAR,startup-script-url=gs://<bucket>/startup.sh

Metadata can also be used for Hadoop and Spark properties:

1
2
3
gcloud dataproc clusters \
    create <NAME> \
    --properties hadoop-env:FOO=hello,spark-env:BAR=world

Spark properties (job submission-level variables)

If the environment variables need to be job-specific, they can be injected directly into the Spark runtime environment when submitting a job.

Example command

1
2
3
gcloud dataproc jobs \
    submit spark \
    --properties spark.executorEnv.FOO=world

From Airflow, the DataprocSubmitJobOperator supports passing job-level properties:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
DataprocSubmitJobOperator(
    task_id="example_task",
    region=REGION,
    project_id=PROJECT,
    job={
        "placement": {"cluster_name": CLUSTER_NAME},
        "pyspark_job": {
            "main_python_file_uri": f"gs://{BUCKET}/dags/predict.py",
            "python_file_uris": [PKG],
            "args": [
                "--store_key", store,
                "--start_date", PARAMS.start_date.strftime("%Y-%m-%d %H:%M"),
                "--end_date", PARAMS.end_date.strftime("%Y-%m-%d %H:%M"),
            ],
            "properties": {
                "ENV": ENV,
                "PROJECT": PROJECT,
                "REGION": REGION,
                "BUCKET": BUCKET,
            },
        },
    },
)
Warning

Some security policies may block custom environment variables from being passed using this method.


Command-line arguments (explicit variable passing)

If other methods fail due to security restrictions, environment variables can be passed explicitly as command-line arguments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
DataprocSubmitJobOperator(
    task_id="example_task",
    region=REGION,
    project_id=PROJECT,
    job={
        "placement": {"cluster_name": CLUSTER_NAME},
        "pyspark_job": {
            "main_python_file_uri": f"gs://{BUCKET}/dags/predict.py",
            "python_file_uris": [PKG],
            "args": [
                "--site", site,
                "--start_date", PARAMS.start_date.strftime("%Y-%m-%d %H:%M"),
                "--end_date", PARAMS.end_date.strftime("%Y-%m-%d %H:%M"),
                "--env",  # NEW ARGUMENT ADDED
                ENV,
            ],
        },
    },
)

This requires modifying the entry point of the run script to accept the new argument:

1
parser.add_argument("--env", required=False, default="local", help="Runtime environment.")

Use a secrets manager

The most secure way to manage runtime variables is to store sensitive values in a secrets manager rather than passing them directly.

Why Use a Secrets Manager?

  • Security: Keeps secrets out of logs, DAGs, and environment variables.
  • Access Control: Secrets can be role-restricted to prevent unauthorized access.
  • Versioning: Allows tracking changes to secrets over time.
  • Auditing: Provides logging to track access attempts.
  • Ease of coding: The same variables can be used across deployment environments, so long as each environment has its own Secret Manager.

Google Cloud Secret Manager

Google Cloud Secret Manager provides centralized, access-controlled storage for secrets. All we need to do is add Secret Manager access to the deployment Service Account and then we can replace most variables with a simple secret lookup.

1) Store a Secret

1
gcloud secrets create MY_SECRET --replication-policy="automatic"

2) Set a value

1
echo -n "my_secret_value" | gcloud secrets versions add MY_SECRET --data-file=-

3) Retrieve a secret from Airflow

In Airflow DAGs, secrets can be accessed using Google Secret Manager Hook:

1
2
3
4
5
6
7
from airflow.providers.google.cloud.hooks.secret_manager import SecretManagerHook

def get_secret(secret_name: str):
    hook = SecretManagerHook()
    return hook.get_secret(secret_name)

MY_SECRET_VALUE = get_secret("MY_SECRET")

4) Retrieve a Secret from DataProc

Alternatively, secrets can be accessed directly from within the DataProc environment using the Google Cloud Secret Manager Python client.

Ensure that the Secret Manager library is included in the project requirements and installed during project initialization

Then in your PySpark job, retrieve the Secret

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from google.cloud import secretmanager

def get_secret(secret_name: str, project: str) -> str:
    """Retrieve a secret from Google Cloud Secret Manager."""
    client = secretmanager.SecretManagerServiceClient()
    secret_path = f"projects/{project}/secrets/{secret_name}/versions/latest"
    response = client.access_secret_version(request={"name": secret_path})
    return response.payload.data.decode("UTF-8")

# Example usage
SECRET_NAME = "MY_SECRET"

secret_value = get_secret(SECRET_NAME, PROJECT)

This method has the advantage of only pulling secrets when they are needed at runtime rather than at DAG execution time - this limits their scope to a single function or class rather than the entire execution environment.


Conclusion

When configuring runtime variables in DataProc and Airflow, the best method depends on security policies and operational requirements:

MethodUse Case
OS-Level Startup ScriptBest for cluster-wide persistent variables
Compute Engine MetadataWorks at the instance level; useful for per-node settings
Spark PropertiesBest for job-specific runtime variables
Command-Line ArgumentsA fallback option when security policies restrict variable injection
Secrets ManagerThe most secure option for sensitive values

For sensitive data such as API keys, database credentials, and encryption secrets, Google Cloud Secret Manager should always be used.

This structured approach ensures secure and flexible runtime configuration across different environments.

DataProc - a (near) complete guide

How do you define successful engineering leadership?

The Philosophy

Many view technical leadership as being the “smartest architect in the room.” I see it as the opposite. My job is to build a room where I don’t have to be the smartest person because the systems, culture, and communication are so robust that the team can out-innovate me.

The Strategy

  • Alignment: Does every engineer understand how their sprint task impacts the company’s bottom line?
  • Velocity vs. Stability: We aren’t just “shipping fast”; we are building a predictable, repeatable engine that doesn’t collapse under its own weight at the next order of magnitude.
  • The Human Growth Curve: Success is when the engineering team’s capability evolves faster than the product’s complexity. If the team feels stagnant, the tech stack will soon follow.

What is your approach to scaling technical organizations?

The Philosophy

Scaling isn’t just “hiring more people” - that’s often how you slow down. Scaling is about moving from Individual Heroics to Organizational Systems.

The Strategy

  • The 3-Continent Perspective: Having managed global teams, I focus on “High-Signal Communication.” As you grow, the cost of a meeting triples. I implement “Asynchronous-First” cultures that protect deep-work time while ensuring no one is blocked by a timezone.

  • Modular Autonomy: I advocate for breaking down monolithic teams into autonomous units with clear ownership. This reduces the “communication tax” and allows us to scale the headcount without scaling the bureaucracy.

  • Automation as Infrastructure: At petabyte scale, manual intervention is a failure. I treat the developer experience (CI/CD, observability, self-service infra) as a first-class product to keep the “path to production” frictionless.

How do you balance high-growth velocity with technical stability?

The Philosophy

Technical debt isn’t a “bad thing” to be avoided; it’s a set of historical decisions that no longer serve you. Like any loan, leverage can accelerate growth when investments payoff. But if velocity and returns are slowing you need a payment plan before the interest kills you.

The Strategy

  • The ROI Filter: I don’t refactor for the sake of “clean code.” I don’t refactor a micro-service with no users. I refactor when the pain on that debt - measured in bugs, downtime, or developer frustration - starts to exceed the cost of the fix.

  • Zero-Downtime Culture: Especially at scale, stability is a feature. I implement “Guardrail Engineering” where the system is designed to fail gracefully, ensuring that a Series B growth spike becomes a success story rather than a post-mortem.

  • The 70/20/10 Rule: I typically aim to dedicate 70% of resources to new features, 20% to infrastructure/debt, and 10% to R&D. This ensures we never stop innovating, but we never stop fortifying either.