Dataproc documentation
Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data. Learn more
Start your proof of concept with $300 in free credit
- Get access to Gemini 2.0 Flash Thinking
- Free monthly usage of popular products, including AI APIs and BigQuery
- No automatic charges, no commitment
Keep exploring with 20+ always-free products
Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.
Documentation resources
Guides
Related resources
Run a Spark job on Google Kubernetes Engine
Submit Spark jobs to a running Google Kubernetes Engine cluster from the Dataproc Jobs API.
Introduction to Cloud Dataproc: Hadoop and Spark on Google Cloud
This course features a combination of lectures, demos, and hands-on labs to create a Dataproc cluster, submit a Spark job, and then shut down the cluster.
Machine Learning with Spark on Dataproc
This course features a combination of lectures, demos, and hands-on labs to implement logistic regression using a machine learning library for Apache Spark running on a Dataproc cluster to develop a model for data from a multivariable dataset.
Workflow scheduling solutions
Schedule workflows on Google Cloud.
Migrate HDFS Data from On-Premises to Google Cloud
How to move data from on-premises Hadoop Distributed File System (HDFS) to Google Cloud.
Manage Java and Scala dependencies for Apache Spark
Recommended approaches to including dependencies when you submit a Spark job to a Dataproc cluster.
Python API samples
Call Dataproc APIs from Python.
Java API samples
Call Dataproc APIs from Java.
Node.js API samples
Call Dataproc APIs from Node.js.
Go API samples
Call Dataproc APIs from Go.