What is DAG in Apache Spark - (2024)

Recipe Objective - What is DAG in Apache Spark?

DAG or Directed Acyclic Graph is defined as a set of the Vertices and the edges where the vertices represent Resilient distributed systems(RDD), and edges represent the Operation which is to be applied on RDD. In the Spark Directed acyclic graph or DAG, every edge directs from the earlier to later in sequence; thus, on calling of action, the previously created DAGs submits to the DAG Scheduler, which further splits a graph into stages of the task. Spark DAG is the strict generalization of the MapReduce model. The DAG operations can do better global optimization than the other systems like MapReduce. The Apache Spark DAG allows a user to dive into the stage and further expand on detail on any stage. In the stage view of DAG, the details of all the RDDs belonging to that stage are further developed. The Scheduler splits Spark RDD into stages based on the various transformation applied.

This recipe explains what DAG is in Spark and its importance in apache spark.

Explanation of DAG in Spark

The DAG starts its work in apache spark by interpreting the code with some modifications, and the interpreter is the first layer using a Scala interpreter. Further, Spark creates the operator graph when the code is entered in the Spark console.

// Importing the package
import org.apache.spark.sql.SparkSession

What is DAG in Apache Spark - (1)

The spark SQL spark session package is imported into the environment to run DAGs.

// Defining Transformations
val easyNumbers = spark.range(1, 1000000)
val diff_time = easyNumbers.selectExpr("id * 4 as id")

What is DAG in Apache Spark - (2)

Transformations are defined so that Spark builds up a dependency graph of the Dataframes that will execute when an action is called.

// Defining am action for DAGs
diff_time.show()

What is DAG in Apache Spark - (3)

What is DAG in Apache Spark - (4)

Spark performs computation after diff_time.show() function is called and executed that isAn action triggers a Spark job.

// Reading the DAGs
val toughNumbers = spark.range(1, 10000000, 2)
val splitting6 = toughNumbers.repartition(7)
splitting6.take(2)

What is DAG in Apache Spark - (5)

Reading of DAGs is done while defining range using the range() function and further repartition it using the repartition() function.

// Staging in DAGs
val dstage1 = spark.range(1, 10000000)
val dstage2 = spark.range(1, 10000000, 2)
val dstage3 = dstage1.repartition(7)
val dstage4 = dstage2.repartition(9)
val dstage5 = dstage3.selectExpr("id * 4 as id")
val joined = dstage5.join(dstage4, "id")
val sum = joined.selectExpr("sum(id)")
sum.show()

What is DAG in Apache Spark - (6)

What is DAG in Apache Spark - (7)

Vertical sequences in DAGs are known as "stages. Stages are implemented in DAGs using the range() function, and output is using the show() function.

Further, it proceeds to submit the operator graph to DAG Scheduler by calling an Action on Spark RDD at a high level. Spark Divide the operators into stages of the task in DAG Scheduler. Further, a stage contains task-based on the partition of the input data. The DAG scheduler pipelines operate together.

In DAG, The stages pass on to the Task Scheduler. The launches task through cluster manager. Dependencies of stages are unknown to the task scheduler. The Workers in DAG execute the task on the slave.

What is DAG in Apache Spark - (2024)
Top Articles
Latest Posts
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 6280

Rating: 4.8 / 5 (58 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.