Google Cloud Dataproc can deliver 18% to 60% cost savings compared to other cloud-based Hadoop and Spark alternatives. Get the ESG report.
Run fully managed Apache Spark, Hadoop, and 30+ open source framework clusters with ease and control. Accelerate Spark on Compute Engine with Lightning Engine and integrate with Google Cloud's open lakehouse.
Apache Spark is a trademark of the Apache Software Foundation.
Features
Beyond Spark, Dataproc provides fully managed services for the complete Apache Hadoop stack (MapReduce, HDFS, YARN), plus Flink, Trino, Hive, and over 30 other open source tools. To support these, Dataproc integrates with Dataproc Metastore, a fully managed Hive Metastore service, simplifying metadata management for your traditional data lake components. Modernize traditional data lake workloads or build new applications with your preferred engines.
Run demanding Spark workloads with the control of a managed Dataproc cluster, now supercharged with 3.6x* query speed by the Lightning Engine,** in Preview. Experience significant performance gains for Spark SQL and DataFrame operations. Configure Spark environments precisely to your needs, choosing versions and libraries.
*The queries are derived from the TPC-DS standard and TPC-H standard and as such are not comparable to published TPC-DS standard and TPC-H standard results, as these runs do not comply with all requirements of the TPC-DS standard and TPC-H standard specification.
**Available for Dataproc on Compute Engine premium tier.
Customize Dataproc clusters with a wide range of machine types (including GPUs), preemptible VMs, disk options, autoscaling policies, initialization actions, container/images, and optional components. Use features like Workflow Templates for orchestrating complex jobs and manage clusters via the console, gcloud, API, or client libraries. Gain deep visibility into cluster performance and health through integration with Cloud Monitoring, providing comprehensive metrics, dashboards, and alerting capabilities.
Dataproc clusters integrate natively with BigLake Metastore, allowing you to process data stored in open formats like Apache Iceberg on Cloud Storage. For traditional Hive-based metadata needs, there’s seamless integration with the managed Dataproc Metastore service. Leverage Dataplex Universal Catalog for unified discovery, lineage, and governance across your lakehouse assets. Extend your data applications by connecting Dataproc with BigQuery, Vertex AI, Spanner, Pub/Sub, and Data Fusion, creating powerful, end-to-end solutions.
Benefit from Google Cloud's robust security. Configure Kerberos, manage access with IAM, enforce network policies with VPC Service Controls, and use CMEK. Integrate with Dataplex Universal Catalog for centralized policy management and enable fine grained access control with BigLake.
Utilize familiar tools and IDEs, such as Jupyter and VS Code IDEs running on your laptops to connect Dataproc clusters. Integrate Dataproc with Vertex AI Workbench for interactive Spark development on clusters and build end-to-end AI/ML pipelines with Vertex AI.
Common Uses
Modernize your data lake
Migrate on-premises Hadoop and Spark workloads to the cloud with ease. Use Dataproc to run MapReduce, Hive, Pig, and Spark jobs on data in Cloud Storage, integrated with Dataproc Metastore and governed by Dataplex Universal Catalog.
Modernize your data lake
Migrate on-premises Hadoop and Spark workloads to the cloud with ease. Use Dataproc to run MapReduce, Hive, Pig, and Spark jobs on data in Cloud Storage, integrated with Dataproc Metastore and governed by Dataplex Universal Catalog.
Custom data science at scale
Spin up purpose-built Dataproc clusters with specific versions of Spark, Jupyter, and your required ML libraries for collaborative, large-scale model training and advanced analytics. Integrate with Vertex AI for MLOps.
Custom data science at scale
Spin up purpose-built Dataproc clusters with specific versions of Spark, Jupyter, and your required ML libraries for collaborative, large-scale model training and advanced analytics. Integrate with Vertex AI for MLOps.
Flexible OSS
Deploy dedicated clusters with Trino for interactive SQL, Flink for advanced stream processing, or other specialized open source engines alongside Spark and Hadoop, all managed by Dataproc.
Flexible OSS
Deploy dedicated clusters with Trino for interactive SQL, Flink for advanced stream processing, or other specialized open source engines alongside Spark and Hadoop, all managed by Dataproc.
Pricing
Dataproc pricing for managed clusters | Dataproc offers pay-as-you-go pricing. Optimize costs with autoscaling and preemptible VMs. Compute Engine premium tier enables faster Spark with Lightning Engine. |
---|---|
Key components: |
|
Example: | A cluster with 6 nodes (1 main + 5 workers) of 4 CPUs each ran for 2 hours would cost $0.48. Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48 |
Dataproc pricing for managed clusters
Dataproc offers pay-as-you-go pricing. Optimize costs with autoscaling and preemptible VMs. Compute Engine premium tier enables faster Spark with Lightning Engine.
Key components:
Example:
A cluster with 6 nodes (1 main + 5 workers) of 4 CPUs each ran for 2 hours would cost $0.48. Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48