This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section
- What is DataProc? And why do we need it?
- Environments, what’s shared and what’s local
- Environment variables
- Cluster configuration
- Cluster startup optimization
- Runtime variables
On this page
Passing environment variables in DataProc and Composer
Executive summary
A common design pattern for orchestrating AI and ML pipelines leverages Google Cloud DataProc to execute Spark-based workflows, offering scalable compute resources for distributed data processing.
According to Google’s marketing materials, DataProc offers:
Serverless deployment, logging, and monitoring let you focus on your data and analytics, not on your infrastructure.
As data practitioners, we may not always have access to all cloud resources, including direct access to DataProc logs, cluster monitoring tools, or SSH access to cluster instances. Security policies or organizational controls often restrict these capabilities, requiring us to rely on orchestration tools like Airflow and alternative logging or monitoring approaches.
A common design pattern for Google Cloud clients is to use Google Cloud Composer, a managed version of Apache Airflow, to orchestrate and schedule Spark jobs. Each Composer environment is linked to a dedicated Google Cloud Storage (GCS) bucket, which is also mounted as a local filesystem on the DataProc cluster when it starts.
Standard workflow
The typical workflow follows these steps:
- Initialize a Directed Acyclic Graph (DAG) in Airflow.
- The initial node provisions a new Spark cluster.
- Subsequent nodes submit PySpark jobs to the cluster.
- The final node tears down the cluster.
The Airflow DataProc operators make it possible to orchestrate this entire workflow using Python code that runs within the Composer environment.
Here’s an example DAG that reflects this pattern:
|
|
Airflow DAG IDs can only contain uppercase letters, numbers, and periods. Avoid using underscores (_) or dashes (-) in cluster names, as they can cause silent failures in Google Cloud operations.
The DAG.py
file is uploaded to the Composer dags directory in the associated GCS bucket (gs://<BUCKET>/dags
). Airflow polls this directory every minute, and new files automatically appear in the UI. When updating a DAG, it’s good practice to confirm that the new code is visible under the Code tab before triggering a run.
Managing composer variables
To enhance security and maintain consistency across environments, many variables are injected into the DAG runtime environment directly from the Composer instance. This allows the DAG to dynamically retrieve these values using either Python’s built-in os.getenv()
or Airflow’s airflow.models.Variable
class.
To support reproducibility and deployment automation, projects may include a variables.json
file in their source repository. This file can be deployed to Composer through several methods.
Deploying variables
Method 1: Deploy to the GCS bucket
Airflow supports automatically importing variables from a variables.json
file placed in gs://<BUCKET>/data/
. Adding a command to the deployment pipeline, such as .gitlab-ci.yml
or .github/workflows/release.yml
, ensures the file is copied during deployment:
|
|
At present, automatic parsing of this file is restricted, though this may change in higher environments with appropriate approval.
Method 2: Import via command line
The variables can also be directly imported using the gcloud
CLI:
|
|
This approach would require CI/CD runner permissions to authenticate, upload, and apply the variables. Currently, these permissions are not granted to the service account.
Method 3: Manual import via web interface
Variables can also be managed directly through the Airflow UI.
- Navigate to Admin > Variables.
- Manually enter key-value pairs or upload the
variables.json
file directly.

The uploaded file’s variables will appear in the table.

Preferred approach
Due to the infrequent nature of variable changes and the current lack of automation permissions, manually uploading variables through the Airflow UI is the preferred method. This also improves operational security, as no sensitive bucket names or regions need to be stored in the source repository.
This approach balances flexibility, security, and operational control while ensuring variables are correctly injected into the Composer environment at runtime.