Featured image of post Scala Data Collections Performance: Sequence Types

Scala Data Collections Performance: Sequence Types

Scala has different data collections and using the proper objects is important in optimizing the big data pipelines.

This post tries to study the characteristics of the Scala collection, such as:

  • Memory Usage,

  • Operations Time.

from the point of view of practical benchmarks. We will try to study the performance of data collection in two posts; part one related to Sequence collection, the second contains Maps and Sets. The diagram below demonstrates all of the collections in the package scala.collection. These are all high-level abstract classes or traits with both mutable and immutable implementations.

Mutable and Immutable Data Collection High-level Abstract Classes or Traits

Figure 1: Mutable and Immutable Data Collection High-level Abstract Classes or Traits

Sequence Type

The Seq trait represents sequences. A sequence is a type of iterable with a length and elements with fixed index positions beginning at 0. Seq Collection divided to type of immutable and mutable. The following figure shows all Seq collections in package scala.collection.immutable.

Immutable Seq Data Collections

Figure 2: Immutable Seq Data Collections

And the following figure shows Seq collections in package scala.collection.mutable.

Mutable Seq Data Collections Part 1

Figure 3: Mutable Seq Data Collections Part 1

Mutable Seq Data Collections Part 2

Figure 4: Mutable Seq Data Collections Part 2

Before seeing the collection benchmark tables, it is useful to review the collection definition and its properties.

Benchmark Codes

We created a Scala project with a sbt for assessment data collection.

// build.sbt
scalaVersion := "2.12.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2"
enablePlugins(PackPlugin)

To calculate the size of an object, I find org.apache.spark.util.SizeEstimator.estimate  function is useful. This function estimates the sizes of Java objects (number of bytes of memory they occupy).

import org.apache.spark.sql._
import org.apache.spark.sql.functions.col
import org.apache.spark.util.SizeEstimator.estimate
import scala.collection.AbstractSeq
import scala.collection.mutable
import scala.collection.mutable.{ArrayBuffer, ListBuffer}

To create a result dataframe and write the result, we use Spark (it is not necessary).

val spark = SparkSession
  .builder()
  .appName("Collection_Benchmark")
  .master("local[2]")
  .getOrCreate()

import spark.implicits._

We need a time-elapsing function to calculate run time, so use ’System.nanoTime` to measure time in nano resolution.

def timeElapsing(benchmarkFunction: => Unit, message:Boolean = false)(times:Int = 1): Double = {
  if(message) println("Benchmark: IS Starting ...")
  val startTime = System.nanoTime()
  for (_ <- 0 until times)
    benchmarkFunction
  val endTime = System.nanoTime()
  val timeElapsed = (endTime - startTime).toDouble / times.toDouble
  if(message) println(s"Operation Took $timeElapsed ms on average")
  timeElapsed
}

Among all the data collections, only some of them have an insert method. We define insertTime function only for these collections, as you see below.

def insertTime(x:AbstractSeq[Int], n:Int, m:Int):Double = x match {
  case x:ArrayBuffer[Int] => timeElapsing(x.updated(m,0))(n)
  case x:ListBuffer[Int] => timeElapsing(x.updated(m,0))(n)
  case x:mutable.MutableList[Int] => timeElapsing(x.updated(m,0))(n)
  case x:mutable.Queue[Int] => timeElapsing(x.updated(m,0))(n)
  case x:mutable.ArrayStack[Int] => timeElapsing(x.updated(m,0))(n)
  case _ => -1
}

The main parts of the benchmark are benchmark*** functions, which contain the time-elapsed of the main methods.

def benchmarkSeq(x:AbstractSeq[Int], n:Int, m:Int): Map[String, Double] = {
  Map(
    "volume" -> estimate(x),
    "head" -> timeElapsing(x.head)(n),
    "tail" -> timeElapsing(x.tail)(n),
    "apply" -> timeElapsing(x.apply(m))(n),
    "update" -> timeElapsing(x.updated(m,0))(n),
    "prepend" -> timeElapsing(0+:x)(n),
    "append" -> timeElapsing(x:+0)(n),
    "insert" -> insertTime(x, n, m)
  )
}

Similar to benchmarkSeq we define benchmarkString, benchmarkStringBuilder and benchmarkArray functions.

To calculate correct time elapsing related to Array we define Array[Object]

def obj = new Object()
def benchmarkArrayBoxed(x:Array[Object], n:Int, m:Int): Map[String, Double] =  { Map(
  "volume" -> estimate(x),
  "head" -> timeElapsing(x.head)(n),
  "tail" -> timeElapsing(x.tail)(n),
  "apply" -> timeElapsing(x.apply(m))(n),
  "update" -> timeElapsing(x.updated(m,0))(n),
  "prepend" -> timeElapsing(obj+:x)(n),
  "append" -> timeElapsing(x:+obj)(n),
  "insert" -> timeElapsing(x.updated(m,0))(n))
}

When determining the size of objects, we consider two measurements: size and method. Objects with a length of \(16^0,16^1,..., 16^5\) are generated to find their size. For checking the performance of methods, objects with a size of \(10000, 200000,..., 1000000\) are generated.

val sizes = ( 0 to 5).map(x => math.pow(16,x).toInt) ++ (1 to 10).map(_*100000)

As you can see below, each method is run 100 times on objects, and the results are collected.

val stats = for(s <- sizes) yield {
  val integers = 0 until s
  List(
    ("Immutable_List", integers.toList),
    ("Immutable_Stream", integers.toStream),
    ("Immutable_Vector", integers.toVector),
    ("Immutable_Queue", scala.collection.immutable.Queue(integers: _*)),
    ("Immutable_Range", integers),
    ("Immutable_String", "1" * s),
    ("Mutable_ArrayBuffer", scala.collection.mutable.ArrayBuffer(integers: _*)),
    ("Mutable_ListBuffer", scala.collection.mutable.ListBuffer(integers: _*)),
    ("Mutable_StringBuilder", new scala.collection.mutable.StringBuilder("1" * s)),
    ("Mutable_MutableList", scala.collection.mutable.MutableList(integers: _*)),
    ("Mutable_Queue", scala.collection.mutable.Queue(integers: _*)),
    ("Mutable_ArraySeq", scala.collection.mutable.ArraySeq(integers: _*)),
    ("Mutable_ArrayStack", scala.collection.mutable.ArrayStack(integers: _*)),
    ("Mutable_Array", integers.toArray),
    ("Mutable_Boxed_Array", {
      val boxedArray = new Array[Object](s)
      var i = 0
      while (i < s) {
        boxedArray(i) = obj; i += 1
      }
      boxedArray
    })

  ).map {
    case (c, cl: AbstractSeq[Int]) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkSeq(cl, 100, s - 1).map(x => (x._1, x._2.toString))
    case (c, cl: Array[Object]) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkArrayBoxed(cl, 100, s - 1).map(x => (x._1, x._2.toString))
    case (c, cl: Array[Int]) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkArray(cl, 100, s - 1).map(x => (x._1, x._2.toString))
    case (c, cl: String) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkString(cl, 100, s - 1).map(x => (x._1, x._2.toString))
    case (c, cl: StringBuilder) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkStringBuilder(cl, 100, s - 1).map(x => (x._1, x._2.toString))
  }
}

The last step is writing the results as a csv file with spark.write.

val colNames = stats(0).head.toList.sortBy(_._1).map(_._1)
  .zipWithIndex.map(x => col("value")(x._2).as(x._1))

stats.flatten.map(x => x.toList.sortBy(_._1).map(_._2))
  .toDF.select(colNames:_*)
  .coalesce(1).write.option("header","true").mode("overwrite")
  .csv("./collection_seq_size_benchmark.csv")

Object Size in Memory

The benchmark data is now available! The table below displays the expected size of various collections of different sizes in bytes.

The average memory size of each object is calculated and shown below.

Methods Performance

Before seeing the benchmark result, it is better to have an overview of the methods that are applied to objects. The below table has more details.

We can find performance characteristics of Scala collections in the Scala documents. The Scala performance table is provided below for comparison with the empirical results.

The performance results of each method and collection are plotted as a scatter plot. We add a regression line to the plot to see the growth rate. The plot below shows the performance of the immutable collection.

Immutable Collection Methods Performance

Figure 5: Immutable Collection Methods Performance

A similar plot is plotted for mutable collection.

Mutable Collection Methods Performance

Figure 6: Mutable Collection Methods Performance

Built with Hugo
Theme Stack designed by Jimmy