Tuesday, 28 December 2021

Spark

What is Spark: Spark is an open-source engine for large-scale data processing. Spark distributes data on multiple nodes machines and tries to execute them in parallel. Spark distribute tasks on different stages and further distribute stages into two parts. A stage that can be executed parallelly and a stage that needs to execute in sequential. Spark is faster than Hadoop as it can cache the data and able to execute the operations in memory.


Spark Architecture:

Driver program: The driver program is used to create spark context. SparkContext is used to create a connection with the Spark cluster and can be used further to create RDDs, accumulators, and broadcast variables on that cluster.

Cluster manager: The driver program and SparkContext work with the cluster manager to split the task after the creation of RDD. The task is distributed into nodes to execute them parallel. Cluster Manager can be Hadoop yarn, apache mesos, or spark cluster. 

Worker node: The worker node has an executor which runs the task/program. A worker node is a machine from the Spark cluster. 

Spark can create a parallel collection of data or it can parallelize the data using RDD. We can split the RDD into different blocks. These blocks can be executed parallel onto the worker node.

Spark figures out the task execution path using DAG. In DAG, The node or worker machine is referred to as partition, and operation is referred to as edge. 




  

1 comment: