0% found this document useful (0 votes)
15 views

Spark Physical and Logical Plan Analysis

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Spark Physical and Logical Plan Analysis

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

#Spark #DataEngineering

Spark Physical and Logical Plan Analysis


using spark (Scala & Python)

@shubhamDey Swipe >>>>


#Spark #DataEngineering

What is Physical and Logical Plan?


Spark Logical and physical plan provides a visual representation of the data’s journey, including all steps from origin to
destination, with detailed information about where the data goes, who owns the data, and how the data is processed and
stored at each step. Spark-Lineage extracts all necessary metadata from every Spark-ETL job.
It is majorly used by developers for debugging purpose.

Here’s how it works:

• The code written is first noted as an unresolved logical plan, if it is valid then Spark converts this into a logical plan
• The logical plan is passed through the Catalyst Optimizer to apply optimized rules.
• The Optimized Logical Plan is then converted into a Physical Plan
• The Physical Plan is executed by the Spark executors.

@shubhamDey Swipe >>>>


#Spark #DataEngineering

Understanding with examples


We have used a financial dataset from Kaggle to analyse financial instrument monthly insights.
URL: https://www.kaggle.com/datasets/hanseopark/sp-500-stocks-value-with-financial-statement
Case study: we have written PySpark/Scala spark code for checking the monthly highs and lows for individual financial
instrument and we will see how spark creates the physical plan and logical plan while executing the snippet of code.
Note: We will use spark explain function to understand the physical and logical plan.

DataFrame.explain(extended=None, mode=None)[source]

Prints the (logical and physical) plans to the console for debugging purpose.
specifies the expected output format of plans:

• explain(mode=”simple”) – will display the physical plan


• explain(mode=”extended”) – will display physical and logical plans (like “extended” option)
• explain(mode=”codegen”) – will display the java code planned to be executed
• explain(mode=”cost”) – will display the optimized logical plan and related statistics (if they exist)
• explain(mode=”formatted”) – will display a split output composed of a nice physical plan outline and a section with each node details
• explain(extended=False) which projects only the physical plan
• explain(extended=True) which projects all the plans (logical and physical)

@shubhamDey Swipe >>>>


#Spark #DataEngineering

PySpark example
We have used aggregation function to achieve the above problem statement.

@shubhamDey Swipe >>>>


#Spark #DataEngineering

Scala Spark example


Similarly, we have used the below Scala Spark code for solving the problem statement.

@shubhamDey Swipe >>>>


#Spark #DataEngineering

Physical Plan explained


If you see the explain(“formatted”) provided the whole data journey in a readable format, starting from reading csv, followed
by arranging the column, followed by partial data aggregation, followed by data shuffling, followed by performing the final
aggregation and then final results.

@shubhamDey Swipe >>>>


#Spark #DataEngineering

How I can see the logical plan along


with physical plan?
Well, for that we need to change the explain argument to priceDf.explain('extended'). You can see the whole journey of
logical plan to physical plan. Similarly you can try each parameters and learn about spark internal working.

@shubhamDey Swipe >>>>

You might also like