This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section
- What is DataProc? And why do we need it?
- Environments, what’s shared and what’s local
- Environment variables
- Cluster configuration
- Cluster startup optimization
- Runtime variables
On this page
Understanding environments: shared vs local
Conceptual overview of Airflow and Spark environments
Google Cloud Composer, a managed Apache Airflow, orchestrates workflows using Directed Acyclic Graphs (DAGs). DAGs define a sequence of tasks and their dependencies, ensuring reliable workflow execution.
A typical Composer DAG follows three primary steps:
- Provisioning a DataProc cluster
- Executing the workload
- Tearing down the cluster
This structure ensures efficient resource utilization and cost optimization.

How Airflow and DataProc work together
A DataProc cluster is a Google Cloud-managed service that transforms Compute Instance resources into an Apache Spark cluster for processing large-scale data. When Airflow provisions a cluster, it effectively creates a secondary execution environment within a dedicated Virtual Private Cloud (VPC).

Configuring the DataProc environment
DataProc clusters can be provisioned using:
- Google Cloud Console
- Command-line tools
- Airflow’s DataProcClusterOperator
At cluster creation, several background operations are automatically executed, including:
- Provisioning head node(s)
- Installing the Hadoop Distributed File System (HDFS)
- Configuring YARN for resource scheduling
- Establishing internal networking

Provisioning worker nodes
Once the Spark cluster is initialized, worker nodes are dynamically provisioned based on the specified configuration. Each worker node:
- Pulls the base DataProc image
- Starts the Spark service
- Runs initialization scripts
This process repeats for each requested worker node.

Worker nodes undergo health checks before registering with head nodes to receive jobs.
Executing PySpark Jobs
Within the cluster, Python code runs as PySpark jobs on worker nodes. Each worker node requires a Python environment, ensuring compatibility with submitted scripts. Once all jobs are processed, Airflow enters the Teardown phase, deallocating resources.

Configuration scope and security considerations
When tuning an Airflow-Spark system, it’s essential to recognize that settings and parameters exist at multiple levels. Each layer—including Airflow, DataProc, Spark, and individual worker nodes—can influence execution performance. Additionally, security policies can be enforced at various levels, affecting deployment strategies and access control.

Conclusion
Understanding shared and local environments within Airflow and DataProc is essential for designing scalable, cost-effective workflows. By correctly configuring clusters, defining resource scopes, and implementing security policies, teams can optimize performance while maintaining a robust orchestration framework.