Thursday 13 January 2022

Broadcast and Accumulator Variables

Broadcast variable: Broadcast is a shared variable across the executors. This is a read only variable. Executor cache the broadcast variable. 

How to create broadcast variable?

    val spark = SparkSession
      .builder
      .appName("broadcastVariableExample")
      .master("local[*]")
      .getOrCreate()
    val list = List("India", "Germany", "Japan")
    val brodcastList = spark.sparkContext.broadcast(list)
//extract value from broadcast variable
val extractedList = brodcastList.value


When to use?

When we have common data which is required for multiple executors, instead of shipping common data whenever required for the executors, we can use broadcast variable.

Accumulator:   Accumulator variables can only be used with the associative and commutative operation so they work well in parallel operation without any side effects. Accumulator used to perform sum, counter, average, etc operation. Only the driver program can read the value of the accumulator. An executor can only execute operations like add, avg on accumulator variable.

How to create?

    val spark = SparkSession
      .builder
      .appName("AccExample")
      .master("local[*]")
      .getOrCreate()

    val firstAccumulatorVariable = spark.sparkContext.doubleAccumulator("firstAccumulator")
    spark.sparkContext.parallelize(Array(1,2)).foreach(x => firstAccumulatorVariable.add(x))
    
    //extract value
    println(firstAccumulatorVariable.value) //3.0


No comments:

Post a Comment