Google Cloud Dataproc can deliver 18% to 60% cost savings compared to other cloud-based Hadoop and Spark alternatives. Get the ESG report.

Dataproc

A managed platform for Spark, Hadoop, and open source analytics

Run fully managed Apache Spark, Hadoop, and 30+ open source framework clusters with ease and control. Accelerate Spark on Compute Engine with Lightning Engine and integrate with Google Cloud's open lakehouse.

Apache Spark is a trademark of the Apache Software Foundation.

Features

Robust Hadoop ecosystem support

Beyond Spark, Dataproc provides fully managed services for the complete Apache Hadoop stack (MapReduce, HDFS, YARN), plus Flink, Trino, Hive, and over 30 other open source tools. To support these, Dataproc integrates with Dataproc Metastore, a fully managed Hive Metastore service, simplifying metadata management for your traditional data lake components. Modernize traditional data lake workloads or build new applications with your preferred engines.

Managed Spark with Lightning Engine

Run demanding Spark workloads with the control of a managed Dataproc cluster, now supercharged with 3.6x* query speed by the Lightning Engine,** in Preview. Experience significant performance gains for Spark SQL and DataFrame operations. Configure Spark environments precisely to your needs, choosing versions and libraries.

*The queries are derived from the TPC-DS standard and TPC-H standard and as such are not comparable to published TPC-DS standard and TPC-H standard results, as these runs do not comply with all requirements of the TPC-DS standard and TPC-H standard specification.

**Available for Dataproc on Compute Engine premium tier.

Flexible cluster configuration and management

Customize Dataproc clusters with a wide range of machine types (including GPUs), preemptible VMs, disk options, autoscaling policies, initialization actions, container/images, and optional components. Use features like Workflow Templates for orchestrating complex jobs and manage clusters via the console, gcloud, API, or client libraries. Gain deep visibility into cluster performance and health through integration with Cloud Monitoring, providing comprehensive metrics, dashboards, and alerting capabilities.

Open lakehouse connectivity

Dataproc clusters integrate natively with BigLake Metastore, allowing you to process data stored in open formats like Apache Iceberg on Cloud Storage. For traditional Hive-based metadata needs, there’s seamless integration with the managed Dataproc Metastore service. Leverage Dataplex Universal Catalog for unified discovery, lineage, and governance across your lakehouse assets. Extend your data applications by connecting Dataproc with BigQuery, Vertex AI, Spanner, Pub/Sub, and Data Fusion, creating powerful, end-to-end solutions.

Secure your open source data processing

Benefit from Google Cloud's robust security. Configure Kerberos, manage access with IAM, enforce network policies with VPC Service Controls, and use CMEK. Integrate with Dataplex Universal Catalog for centralized policy management and enable fine grained access control with BigLake.

Empower data engineers and data scientists

Utilize familiar tools and IDEs, such as Jupyter and VS Code IDEs running on your laptops to connect Dataproc clusters. Integrate Dataproc with Vertex AI Workbench for interactive Spark development on clusters and build end-to-end AI/ML pipelines with Vertex AI.

How It Works

Simplified cluster operations for powerful analytics

Common Uses

Data lake modernization and Hadoop migration

Modernize your data lake

Migrate on-premises Hadoop and Spark workloads to the cloud with ease. Use Dataproc to run MapReduce, Hive, Pig, and Spark jobs on data in Cloud Storage, integrated with Dataproc Metastore and governed by Dataplex Universal Catalog.

Tutorials, quickstarts, & labs

Modernize your data lake

Migrate on-premises Hadoop and Spark workloads to the cloud with ease. Use Dataproc to run MapReduce, Hive, Pig, and Spark jobs on data in Cloud Storage, integrated with Dataproc Metastore and governed by Dataplex Universal Catalog.

Large-scale batch ETL with Spark and Hadoop

Enterprise batch processing

Process and transform massive datasets efficiently using Spark, accelerated by Lightning Engine with Dataproc on Compute Engine, or MapReduce on customizable Dataproc clusters. Optimize complex ETL pipelines for performance and cost in a controlled environment.

Tutorials, quickstarts, & labs

Enterprise batch processing

Process and transform massive datasets efficiently using Spark, accelerated by Lightning Engine with Dataproc on Compute Engine, or MapReduce on customizable Dataproc clusters. Optimize complex ETL pipelines for performance and cost in a controlled environment.

Configurable data science and ML environments

Custom data science at scale

Spin up purpose-built Dataproc clusters with specific versions of Spark, Jupyter, and your required ML libraries for collaborative, large-scale model training and advanced analytics. Integrate with Vertex AI for MLOps.

AI/ML recipes for Dataproc

Tutorials, quickstarts, & labs

Custom data science at scale

Spin up purpose-built Dataproc clusters with specific versions of Spark, Jupyter, and your required ML libraries for collaborative, large-scale model training and advanced analytics. Integrate with Vertex AI for MLOps.

AI/ML recipes for Dataproc

Running diverse open source analytics engines

Flexible OSS

Deploy dedicated clusters with Trino for interactive SQL, Flink for advanced stream processing, or other specialized open source engines alongside Spark and Hadoop, all managed by Dataproc.

Dataproc meets TensorFlow on YARN

Tutorials, quickstarts, & labs

Flexible OSS

Deploy dedicated clusters with Trino for interactive SQL, Flink for advanced stream processing, or other specialized open source engines alongside Spark and Hadoop, all managed by Dataproc.

Dataproc meets TensorFlow on YARN

Pricing

Dataproc pricing for managed clusters	Dataproc offers pay-as-you-go pricing. Optimize costs with autoscaling and preemptible VMs. Compute Engine premium tier enables faster Spark with Lightning Engine.
Key components:	Compute Engine instances (vCPU, memory) Dataproc service fee (per vCPU-hour) Persistent Disks
Example:	A cluster with 6 nodes (1 main + 5 workers) of 4 CPUs each ran for 2 hours would cost $0.48. Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48

See detailed Dataproc pricing

Dataproc pricing for managed clusters

Dataproc offers pay-as-you-go pricing. Optimize costs with autoscaling and preemptible VMs. Compute Engine premium tier enables faster Spark with Lightning Engine.

Key components:

Compute Engine instances (vCPU, memory)
Dataproc service fee (per vCPU-hour)
Persistent Disks

Example:

A cluster with 6 nodes (1 main + 5 workers) of 4 CPUs each ran for 2 hours would cost $0.48. Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48

See detailed Dataproc pricing

Pricing calculator

Estimate your monthly Dataproc costs, including region-specific pricing and fees.

Custom quote

Connect with our sales team to get a custom quote for your organization.

Dataproc

A managed platform for Spark, Hadoop, and open source analytics

Product highlights:

Robust Hadoop ecosystem support

Managed Spark with Lightning Engine

Flexible cluster configuration and management

Open lakehouse connectivity

Secure your open source data processing

Empower data engineers and data scientists

Simplified cluster operations for powerful analytics

Data lake modernization and Hadoop migration

Modernize your data lake

Tutorials, quickstarts, & labs

Modernize your data lake

Large-scale batch ETL with Spark and Hadoop

Enterprise batch processing

Tutorials, quickstarts, & labs

Enterprise batch processing

Configurable data science and ML environments

Custom data science at scale

Tutorials, quickstarts, & labs

Custom data science at scale

Running diverse open source analytics engines

Flexible OSS

Tutorials, quickstarts, & labs

Flexible OSS

Pricing calculator

Custom quote

Start today

$300 in free credit for new customers

Have a large project?

Create a Dataproc cluster by using the Google Cloud console

Use the Cloud Storage connector with Apache Spark

The Architecture Center provides content resources across a wide variety of migration subjects and scenarios to help you