This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section
- What is DataProc? And why do we need it?
- Environments, what’s shared and what’s local
- Environment variables
- Cluster configuration
- Cluster startup optimization
- Runtime variables
On this page
Passing runtime variables in DataProc and Airflow
Introduction
Using environment variables to pass secrets and configurations at runtime is a DevOps best practice that enhances security, flexibility, and maintainability. Hard-coding sensitive values in source code increases the risk of leaks, while environment variables keep secrets out of version control systems. This also enables dynamic configuration, allowing jobs to adapt across different environments (e.g., DEV, UAT, PRD) without modifying the application code.
In DataProc, variables can be set within Google Cloud Composer and injected into DataProc worker environments, allowing job scripts to retrieve these values dynamically.
Methods for passing runtime variables
Apache Spark provides multiple ways to pass variables to the runtime environment. The following are the four primary approaches, listed in order of preference.
OS-level initialization script (cluster startup)
Google Cloud’s recommended method is defining a startup script during cluster creation. This script is stored in Cloud Storage and executed as an initialization action.
Example command
|
|
Using this method, environment variables can be appended to the system’s /etc/environment
file which is run as the final initialization step for every machine that is added to the cluster.
Example startup script
|
|
Compute engine metadata (instance-level variables)
Since DataProc clusters are built on Google Compute Engine (GCE) instances, metadata can be passed to both head and worker nodes at time of provisioning from Google management layer.
Example command
|
|
Metadata can also be used for Hadoop and Spark properties:
|
|
Spark properties (job submission-level variables)
If the environment variables need to be job-specific, they can be injected directly into the Spark runtime environment when submitting a job.
Example command
|
|
From Airflow, the DataprocSubmitJobOperator
supports passing job-level properties:
|
|
Some security policies may block custom environment variables from being passed using this method.
Command-line arguments (explicit variable passing)
If other methods fail due to security restrictions, environment variables can be passed explicitly as command-line arguments.
|
|
This requires modifying the entry point of the run script to accept the new argument:
|
|
Use a secrets manager
The most secure way to manage runtime variables is to store sensitive values in a secrets manager rather than passing them directly.
Why Use a Secrets Manager?
- Security: Keeps secrets out of logs, DAGs, and environment variables.
- Access Control: Secrets can be role-restricted to prevent unauthorized access.
- Versioning: Allows tracking changes to secrets over time.
- Auditing: Provides logging to track access attempts.
- Ease of coding: The same variables can be used across deployment environments, so long as each environment has its own Secret Manager.
Google Cloud Secret Manager
Google Cloud Secret Manager provides centralized, access-controlled storage for secrets. All we need to do is add Secret Manager access to the deployment Service Account and then we can replace most variables with a simple secret lookup.
1) Store a Secret
|
|
2) Set a value
|
|
3) Retrieve a secret from Airflow
In Airflow DAGs, secrets can be accessed using Google Secret Manager Hook:
|
|
4) Retrieve a Secret from DataProc
Alternatively, secrets can be accessed directly from within the DataProc environment using the Google Cloud Secret Manager Python client.
Ensure that the Secret Manager library is included in the project requirements and installed during project initialization
Then in your PySpark job, retrieve the Secret
|
|
This method has the advantage of only pulling secrets when they are needed at runtime rather than at DAG execution time - this limits their scope to a single function or class rather than the entire execution environment.
Conclusion
When configuring runtime variables in DataProc and Airflow, the best method depends on security policies and operational requirements:
Method | Use Case |
---|---|
OS-Level Startup Script | Best for cluster-wide persistent variables |
Compute Engine Metadata | Works at the instance level; useful for per-node settings |
Spark Properties | Best for job-specific runtime variables |
Command-Line Arguments | A fallback option when security policies restrict variable injection |
Secrets Manager | The most secure option for sensitive values |
For sensitive data such as API keys, database credentials, and encryption secrets, Google Cloud Secret Manager should always be used.
This structured approach ensures secure and flexible runtime configuration across different environments.