•15 Feb, 2025

DataProc - Understanding environments: shared vs local

An explainer on how Google Cloud Composer and DataProc environments interact with each and share resources.

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

Understanding environments: shared vs local

Conceptual overview of Airflow and Spark environments

Google Cloud Composer, a managed Apache Airflow, orchestrates workflows using Directed Acyclic Graphs (DAGs). DAGs define a sequence of tasks and their dependencies, ensuring reliable workflow execution.

A typical Composer DAG follows three primary steps:

Provisioning a DataProc cluster
Executing the workload
Tearing down the cluster

This structure ensures efficient resource utilization and cost optimization.

How Airflow and DataProc work together

A DataProc cluster is a Google Cloud-managed service that transforms Compute Instance resources into an Apache Spark cluster for processing large-scale data. When Airflow provisions a cluster, it effectively creates a secondary execution environment within a dedicated Virtual Private Cloud (VPC).

Relationship between Airflow and DataProc environments

Configuring the DataProc environment

DataProc clusters can be provisioned using:

Google Cloud Console
Command-line tools
Airflow’s DataProcClusterOperator

At cluster creation, several background operations are automatically executed, including:

Provisioning head node(s)
Installing the Hadoop Distributed File System (HDFS)
Configuring YARN for resource scheduling
Establishing internal networking

Provisioning worker nodes

Once the Spark cluster is initialized, worker nodes are dynamically provisioned based on the specified configuration. Each worker node:

Pulls the base DataProc image
Starts the Spark service
Runs initialization scripts

This process repeats for each requested worker node.

Worker nodes undergo health checks before registering with head nodes to receive jobs.

Executing PySpark Jobs

Within the cluster, Python code runs as PySpark jobs on worker nodes. Each worker node requires a Python environment, ensuring compatibility with submitted scripts. Once all jobs are processed, Airflow enters the Teardown phase, deallocating resources.

Configuration scope and security considerations

When tuning an Airflow-Spark system, it’s essential to recognize that settings and parameters exist at multiple levels. Each layer—including Airflow, DataProc, Spark, and individual worker nodes—can influence execution performance. Additionally, security policies can be enforced at various levels, affecting deployment strategies and access control.

Conclusion

Understanding shared and local environments within Airflow and DataProc is essential for designing scalable, cost-effective workflows. By correctly configuring clusters, defining resource scopes, and implementing security policies, teams can optimize performance while maintaining a robust orchestration framework.

Older post

DataProc - Everything You Need to Know

Newer post

DataProc - Variables

LEADERSHIP, SCALE, AND THE HUMAN ELEMENT

Principles & Perspectives

How do you define successful engineering leadership?

The Philosophy

Many view technical leadership as being the “smartest architect in the room.” I see it as the opposite. My job is to build a room where I don’t have to be the smartest person because the systems, culture, and communication are so robust that the team can out-innovate me.

The Strategy

Alignment: Does every engineer understand how their sprint task impacts the company’s bottom line?
Velocity vs. Stability: We aren’t just “shipping fast”; we are building a predictable, repeatable engine that doesn’t collapse under its own weight at the next order of magnitude.
The Human Growth Curve: Success is when the engineering team’s capability evolves faster than the product’s complexity. If the team feels stagnant, the tech stack will soon follow.

What is your approach to scaling technical organizations?

The Philosophy

Scaling isn’t just “hiring more people” - that’s often how you slow down. Scaling is about moving from Individual Heroics to Organizational Systems.

The Strategy

The 3-Continent Perspective: Having managed global teams, I focus on “High-Signal Communication.” As you grow, the cost of a meeting triples. I implement “Asynchronous-First” cultures that protect deep-work time while ensuring no one is blocked by a timezone.
Modular Autonomy: I advocate for breaking down monolithic teams into autonomous units with clear ownership. This reduces the “communication tax” and allows us to scale the headcount without scaling the bureaucracy.
Automation as Infrastructure: At petabyte scale, manual intervention is a failure. I treat the developer experience (CI/CD, observability, self-service infra) as a first-class product to keep the “path to production” frictionless.

How do you balance high-growth velocity with technical stability?

The Philosophy

Technical debt isn’t a “bad thing” to be avoided; it’s a set of historical decisions that no longer serve you. Like any loan, leverage can accelerate growth when investments payoff. But if velocity and returns are slowing you need a payment plan before the interest kills you.

The Strategy

The ROI Filter: I don’t refactor for the sake of “clean code.” I don’t refactor a micro-service with no users. I refactor when the pain on that debt - measured in bugs, downtime, or developer frustration - starts to exceed the cost of the fix.
Zero-Downtime Culture: Especially at scale, stability is a feature. I implement “Guardrail Engineering” where the system is designed to fail gracefully, ensuring that a Series B growth spike becomes a success story rather than a post-mortem.
The 70/20/10 Rule: I typically aim to dedicate 70% of resources to new features, 20% to infrastructure/debt, and 10% to R&D. This ensures we never stop innovating, but we never stop fortifying either.

DataProc - Understanding environments: shared vs local

On this page

Understanding environments: shared vs local

Conceptual overview of Airflow and Spark environments

How Airflow and DataProc work together

Configuring the DataProc environment

Provisioning worker nodes

Executing PySpark Jobs

Configuration scope and security considerations

Conclusion

DataProc - Everything You Need to Know

DataProc - Variables

Explore the archive

DataProc - a (near) complete guide

DataProc - Passing runtime variables

DataProc - Cluster startup optimization

Principles & Perspectives

How do you define successful engineering leadership?

What is your approach to scaling technical organizations?

How do you balance high-growth velocity with technical stability?