Tuesday 28 December 2021

Spark - RDD

RDD stands for resilient (restore in case of failure) distributed dataset. RDD is an immutable structure/collection. RDD divide the data into multiple blocks. One block can be executed by one worker node from the cluster. The default size of the HDFS block is 128MB. 


RDD can be created by parallelizing the existing collection using parallelize method from a spark. 

For ex:

 val sc = new SparkContext("local[*]", "JustExample")
 val input = List((10,1),(11,2), (13,2),(10,5))
 val rdd: RDD[(Int, Int)] = sc.parallelize(input)

RDD does two operation: 

Transformation: Returns a new RDD by applying the given function to all elements of the RDD. The function can be map, flatMap, filter, join, distinct, mapValues, reduceByKey, etc


Action: RDD is lazy and gets executed once an action is performed. The action operation created the actual result. After completion of an action, the result is given back to the driver program. The action can be count, collect, reduce, take, countByValue, foreach, etc.

ex:

val sc = new SparkContext("local[*]", "JustExample")
val input = List((10,1),(11,2), (13,2),(10,5))
val rdd: RDD[(Int, Int)] = sc.parallelize(input)
val res: Long = rdd.count()
pritln(res)
sc.stop()

Note that type of res is Long not RDD[Long] as action creates the actual result, not RDD.

RDD can be cached or persist using cache() and persist() method respectively. By default caching is done in memory. We can also cache in a disk, disk + memory in serialize or deserialize way.



2 comments: