Dataproc

A managed platform for Spark, Hadoop, and open source analytics

Run fully managed Apache Spark, Hadoop, and 30+ open source framework clusters with ease and control. Accelerate Spark on Compute Engine with Lightning Engine and integrate with Google Cloud's open lakehouse.

Apache Spark is a trademark of the Apache Software Foundation.

Features

Robust Hadoop ecosystem support

Beyond Spark, Dataproc provides fully managed services for the complete Apache Hadoop stack (MapReduce, HDFS, YARN), plus Flink, Trino, Hive, and over 30 other open source tools. To support these, Dataproc integrates with Dataproc Metastore, a fully managed Hive Metastore service, simplifying metadata management for your traditional data lake components. Modernize traditional data lake workloads or build new applications with your preferred engines.

Managed Spark with Lightning Engine

Run demanding Spark workloads with the control of a managed Dataproc cluster, now supercharged with 3.6x* query speed by the Lightning Engine,** in Preview. Experience significant performance gains for Spark SQL and DataFrame operations. Configure Spark environments precisely to your needs, choosing versions and libraries.

*The queries are derived from the TPC-DS standard and TPC-H standard and as such are not comparable to published TPC-DS standard and TPC-H standard results, as these runs do not comply with all requirements of the TPC-DS standard and TPC-H standard specification.

**Available for Dataproc on Compute Engine premium tier.

Flexible cluster configuration and management

Customize Dataproc clusters with a wide range of machine types (including GPUs), preemptible VMs, disk options, autoscaling policies, initialization actions, container/images, and optional components. Use features like Workflow Templates for orchestrating complex jobs and manage clusters via the console, gcloud, API, or client libraries. Gain deep visibility into cluster performance and health through integration with Cloud Monitoring, providing comprehensive metrics, dashboards, and alerting capabilities.

Open lakehouse connectivity

Dataproc clusters integrate natively with BigLake Metastore, allowing you to process data stored in open formats like Apache Iceberg on Cloud Storage. For traditional Hive-based metadata needs, there’s seamless integration with the managed Dataproc Metastore service. Leverage Dataplex Universal Catalog for unified discovery, lineage, and governance across your lakehouse assets. Extend your data applications by connecting Dataproc with BigQuery, Vertex AI, Spanner, Pub/Sub, and Data Fusion, creating powerful, end-to-end solutions.

Secure your open source data processing

Benefit from Google Cloud's robust security. Configure Kerberos, manage access with IAM, enforce network policies with VPC Service Controls, and use CMEK. Integrate with Dataplex Universal Catalog for centralized policy management and enable fine grained access control with BigLake.

Empower data engineers and data scientists

Utilize familiar tools and IDEs, such as Jupyter and VS Code IDEs running on your laptops to connect Dataproc clusters. Integrate Dataproc with Vertex AI Workbench for interactive Spark development on clusters and build end-to-end AI/ML pipelines with Vertex AI.

How It Works

Simplified cluster operations for powerful analytics

Common Uses

Data lake modernization and Hadoop migration

Modernize your data lake

Migrate on-premises Hadoop and Spark workloads to the cloud with ease. Use Dataproc to run MapReduce, Hive, Pig, and Spark jobs on data in Cloud Storage, integrated with Dataproc Metastore and governed by Dataplex Universal Catalog.

Modernize your data lake

Migrate on-premises Hadoop and Spark workloads to the cloud with ease. Use Dataproc to run MapReduce, Hive, Pig, and Spark jobs on data in Cloud Storage, integrated with Dataproc Metastore and governed by Dataplex Universal Catalog.

Large-scale batch ETL with Spark and Hadoop

Enterprise batch processing

Process and transform massive datasets efficiently using Spark, accelerated by Lightning Engine with Dataproc on Compute Engine, or MapReduce on customizable Dataproc clusters. Optimize complex ETL pipelines for performance and cost in a controlled environment.

    Enterprise batch processing

    Process and transform massive datasets efficiently using Spark, accelerated by Lightning Engine with Dataproc on Compute Engine, or MapReduce on customizable Dataproc clusters. Optimize complex ETL pipelines for performance and cost in a controlled environment.

      Configurable data science and ML environments

      Custom data science at scale

      Spin up purpose-built Dataproc clusters with specific versions of Spark, Jupyter, and your required ML libraries for collaborative, large-scale model training and advanced analytics. Integrate with Vertex AI for MLOps.


      Custom data science at scale

      Spin up purpose-built Dataproc clusters with specific versions of Spark, Jupyter, and your required ML libraries for collaborative, large-scale model training and advanced analytics. Integrate with Vertex AI for MLOps.


      Running diverse open source analytics engines

      Flexible OSS

      Deploy dedicated clusters with Trino for interactive SQL, Flink for advanced stream processing, or other specialized open source engines alongside Spark and Hadoop, all managed by Dataproc.

      Flexible OSS

      Deploy dedicated clusters with Trino for interactive SQL, Flink for advanced stream processing, or other specialized open source engines alongside Spark and Hadoop, all managed by Dataproc.

      Pricing

      Dataproc pricing for managed clustersDataproc offers pay-as-you-go pricing. Optimize costs with autoscaling and preemptible VMs. Compute Engine premium tier enables faster Spark with Lightning Engine.

      Key components:

      • Compute Engine instances (vCPU, memory)
      • Dataproc service fee (per vCPU-hour)
      • Persistent Disks


      Example:

      A cluster with 6 nodes (1 main + 5 workers) of 4 CPUs each ran for 2 hours would cost $0.48. Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48

      Dataproc pricing for managed clusters

      Dataproc offers pay-as-you-go pricing. Optimize costs with autoscaling and preemptible VMs. Compute Engine premium tier enables faster Spark with Lightning Engine.

      Key components:

      • Compute Engine instances (vCPU, memory)
      • Dataproc service fee (per vCPU-hour)
      • Persistent Disks


      Example:

      A cluster with 6 nodes (1 main + 5 workers) of 4 CPUs each ran for 2 hours would cost $0.48. Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48

      Pricing calculator

      Estimate your monthly Dataproc costs, including region-specific pricing and fees.

      Custom quote

      Connect with our sales team to get a custom quote for your organization.

      Start today

      $300 in free credit for new customers

      Have a large project?

      Create a Dataproc cluster by using the Google Cloud console

      Use the Cloud Storage connector with Apache Spark

      The Architecture Center provides content resources across a wide variety of migration subjects and scenarios to help you

      Dataproc
      Google Cloud