DataProc - Everything You Need to Know

A beginners guide to DataProc on Google Cloud and calling it from Composer

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

  1. What is DataProc? And why do we need it?
  2. Environments, what’s shared and what’s local
  3. Environment variables
  4. Cluster configuration
  5. Cluster startup optimization
  6. Runtime variables

On this page


Everything you need to know about DataProc

Understanding Airflow and Spark in Depth

Google Cloud Composer is a managed version of Apache Airflow, an orchestration tool for defining, scheduling, and monitoring workflows. Workflows in Airflow are structured as Directed Acyclic Graphs (DAGs), which define a series of tasks and their dependencies.

Structure of an Airflow DAG

A typical Airflow DAG for DataProc involves three key steps:

  1. Provisioning a DataProc cluster to allocate compute resources.
  2. Executing the workload, typically a Spark job.
  3. Decommissioning the cluster to optimize resource usage and cost.

This approach ensures efficient resource utilization, as clusters exist only for the duration of processing. Managing cost and resource efficiency is critical in large-scale data workflows.

Here’s an overview of how such a workflow is structured:

Typical Airflow training job

Airflow provides a mechanism to define tasks and their relationships programmatically, ensuring that dependencies are correctly handled. The DataProc cluster itself is a managed Apache Spark environment on Google Cloud, similar to Amazon EMR, designed for scalable, distributed data processing.

What is Apache Spark?

Apache Spark is an open-source, distributed computing system optimized for large-scale data processing. It provides APIs in Scala, Java, Python, and R and operates efficiently using an in-memory computation model.

Key Advantages of Spark

  • In-Memory Processing: Unlike traditional disk-based processing systems like Hadoop MapReduce, Spark keeps intermediate data in memory, dramatically improving performance.
  • Lazy Evaluation: Spark constructs an execution plan before running computations, minimizing redundant operations.
  • Fault Tolerance: Spark’s Resilient Distributed Dataset (RDD) abstraction ensures data recovery in case of node failures.
  • DataFrame and SQL API: Higher-level abstractions simplify working with structured data, integrating SQL queries directly into Spark workflows.
  • Scalability: Spark can scale across thousands of nodes, making it suitable for enterprise-grade workloads.

Integrating Airflow and DataProc for scalable workflows

Consider a scenario where a machine learning model requires training on a large dataset. The workflow follows these steps:

  1. Cluster Provisioning: A DataProc cluster is created with an appropriate configuration.
  2. Job Execution: A Spark job processes the dataset.
  3. Storage and Cleanup: Processed results are stored in Google Cloud Storage (GCS), and the cluster is shut down.
  4. Monitoring: Airflow tracks job status and failure handling mechanisms.
  5. Automated Scheduling: Recurring workflows can be scheduled for periodic execution.

Example Airflow DAG for DataProc

Below is an example Airflow DAG that automates this workflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from airflow import DAG
from airflow.providers.google.cloud.operators.dataproc import (
    DataprocCreateClusterOperator,
    DataprocDeleteClusterOperator,
    DataprocSubmitJobOperator
)
from airflow.utils.dates import days_ago

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(1),
    'retries': 1,
}

dag = DAG(
    'dataproc_workflow',
    default_args=default_args,
    description='Example DataProc workflow',
    schedule_interval=None,
)

create_cluster = DataprocCreateClusterOperator(
    task_id='create_cluster',
    project_id='my-project',
    cluster_config={},
    region='us-central1',
    dag=dag,
)

submit_job = DataprocSubmitJobOperator(
    task_id='submit_spark_job',
    job={},
    region='us-central1',
    project_id='my-project',
    dag=dag,
)

delete_cluster = DataprocDeleteClusterOperator(
    task_id='delete_cluster',
    project_id='my-project',
    cluster_name='example-cluster',
    region='us-central1',
    dag=dag,
)

create_cluster >> submit_job >> delete_cluster

Optimization strategies for cost and performance

Efficiently managing DataProc clusters and Airflow workflows can significantly impact cost and performance. Consider the following:

  • Auto-scaling: Dynamically allocate resources based on workload demand.
  • Pre-emptible Instances: Use cost-effective, short-lived VMs to lower computational costs.
  • Persistent vs. Ephemeral Clusters: Persistent clusters reduce spin-up time but may be more expensive if underutilized.
  • Storage Optimization: Compress and clean up unused data to minimize storage costs.
  • Spot Instances: Some cloud providers offer excess compute capacity at reduced prices.
  • Resource Monitoring: Track CPU and memory usage to identify bottlenecks and improve efficiency.

Conclusion

By leveraging Google Cloud Composer (Airflow) and DataProc (Spark), organizations can automate complex data workflows efficiently. The combination of Airflow for orchestration and Spark for distributed computing provides a scalable, cost-effective solution for big data processing.

With optimized configurations and efficient resource allocation, these tools enable businesses to process massive datasets while minimizing costs. Whether for log processing, machine learning training, or ETL pipelines, this setup offers high performance, automation, and scalability.

What distinguishes you from other developers?

I've built data pipelines across 3 continents at petabyte scales, for over 15 years. But the data doesn't matter if we don't solve the human problems first - an AI solution that nobody uses is worthless.

Are the robots going to kill us all?

Not any time soon. At least not in the way that you've got imagined thanks to the Terminator movies. Sure somebody with a DARPA grant is always going to strap a knife/gun/flamethrower on the side of a robot - but just like in Dr.Who - right now, that robot will struggle to even get out of the room, let alone up some stairs.

But AI is going to steal my job, right?

A year ago, the whole world was convinced that AI was going to steal their job. Now, the reality is that most people are thinking 'I wish this POC at work would go a bit faster to scan these PDFs'.

When am I going to get my self-driving car?

Humans are complicated. If we invented driving today - there's NO WAY IN HELL we'd let humans do it. They get distracted. They text their friends. They drink. They make mistakes. But the reality is, all of our streets, cities (and even legal systems) have been built around these limitations. It would be surprisingly easy to build self-driving cars if there were no humans on the road. But today no one wants to take liability. If a self-driving company kills someone, who's responsible? The manufacturer? The insurance company? The software developer?