DataProc - Understanding environments: shared vs local

An explainer on how Google Cloud Composer and DataProc environments interact with each and share resources.

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

  1. What is DataProc? And why do we need it?
  2. Environments, what’s shared and what’s local
  3. Environment variables
  4. Cluster configuration
  5. Cluster startup optimization
  6. Runtime variables

On this page


Understanding environments: shared vs local

Conceptual overview of Airflow and Spark environments

Google Cloud Composer, a managed Apache Airflow, orchestrates workflows using Directed Acyclic Graphs (DAGs). DAGs define a sequence of tasks and their dependencies, ensuring reliable workflow execution.

A typical Composer DAG follows three primary steps:

  1. Provisioning a DataProc cluster
  2. Executing the workload
  3. Tearing down the cluster

This structure ensures efficient resource utilization and cost optimization.

Typical Airflow training job

How Airflow and DataProc work together

A DataProc cluster is a Google Cloud-managed service that transforms Compute Instance resources into an Apache Spark cluster for processing large-scale data. When Airflow provisions a cluster, it effectively creates a secondary execution environment within a dedicated Virtual Private Cloud (VPC).

Relationship between Airflow and DataProc environments

Configuring the DataProc environment

DataProc clusters can be provisioned using:

  • Google Cloud Console
  • Command-line tools
  • Airflow’s DataProcClusterOperator

At cluster creation, several background operations are automatically executed, including:

  • Provisioning head node(s)
  • Installing the Hadoop Distributed File System (HDFS)
  • Configuring YARN for resource scheduling
  • Establishing internal networking

Provisioning worker nodes

Once the Spark cluster is initialized, worker nodes are dynamically provisioned based on the specified configuration. Each worker node:

  1. Pulls the base DataProc image
  2. Starts the Spark service
  3. Runs initialization scripts

This process repeats for each requested worker node.

Worker nodes undergo health checks before registering with head nodes to receive jobs.

Executing PySpark Jobs

Within the cluster, Python code runs as PySpark jobs on worker nodes. Each worker node requires a Python environment, ensuring compatibility with submitted scripts. Once all jobs are processed, Airflow enters the Teardown phase, deallocating resources.

Configuration scope and security considerations

When tuning an Airflow-Spark system, it’s essential to recognize that settings and parameters exist at multiple levels. Each layer—including Airflow, DataProc, Spark, and individual worker nodes—can influence execution performance. Additionally, security policies can be enforced at various levels, affecting deployment strategies and access control.

Conclusion

Understanding shared and local environments within Airflow and DataProc is essential for designing scalable, cost-effective workflows. By correctly configuring clusters, defining resource scopes, and implementing security policies, teams can optimize performance while maintaining a robust orchestration framework.

What distinguishes you from other developers?

I've built data pipelines across 3 continents at petabyte scales, for over 15 years. But the data doesn't matter if we don't solve the human problems first - an AI solution that nobody uses is worthless.

Are the robots going to kill us all?

Not any time soon. At least not in the way that you've got imagined thanks to the Terminator movies. Sure somebody with a DARPA grant is always going to strap a knife/gun/flamethrower on the side of a robot - but just like in Dr.Who - right now, that robot will struggle to even get out of the room, let alone up some stairs.

But AI is going to steal my job, right?

A year ago, the whole world was convinced that AI was going to steal their job. Now, the reality is that most people are thinking 'I wish this POC at work would go a bit faster to scan these PDFs'.

When am I going to get my self-driving car?

Humans are complicated. If we invented driving today - there's NO WAY IN HELL we'd let humans do it. They get distracted. They text their friends. They drink. They make mistakes. But the reality is, all of our streets, cities (and even legal systems) have been built around these limitations. It would be surprisingly easy to build self-driving cars if there were no humans on the road. But today no one wants to take liability. If a self-driving company kills someone, who's responsible? The manufacturer? The insurance company? The software developer?