Scala has different data collections and using the proper objects is important in optimizing the big data pipelines.

This post tries to study the characteristics of the Scala collection, such as:

Memory Usage,
Operations Time.

from the point of view of practical benchmarks. We will try to study the performance of data collection in two posts; part one related to Sequence collection, the second contains Maps and Sets. The diagram below demonstrates all of the collections in the package scala.collection. These are all high-level abstract classes or traits with both mutable and immutable implementations.

Figure 1: Mutable and Immutable Data Collection High-level Abstract Classes or Traits

Sequence Type

The Seq trait represents sequences. A sequence is a type of iterable with a length and elements with fixed index positions beginning at 0. Seq Collection divided to type of immutable and mutable. The following figure shows all Seq collections in package scala.collection.immutable.

Figure 2: Immutable Seq Data Collections

And the following figure shows Seq collections in package scala.collection.mutable.

Figure 3: Mutable Seq Data Collections Part 1

Figure 4: Mutable Seq Data Collections Part 2

Before seeing the collection benchmark tables, it is useful to review the collection definition and its properties.

Table 1: Collection Types and Descriptions
Immutability	Collection	description
Immutable	List	A List is a collection that contains immutable data. The Scala List class holds a sequenced, linear list of items.
Stream	The Stream is a lazy list where elements are evaluated only when they are needed. Streams have the same performance characteristics as lists.
Vector	Vectors in Scala are immutable data structures providing random access for elements and is similar to the list. But, the list has incompetence of random access of elements.
Queue	A Queue is a first-in, first-out (FIFO) data structure. Scala offers both an immutable queue and a mutable queue. A mutable queue can be updated or extended in place. It means one can change, add, or remove elements of a queue as a side effect. Queue is implemented as a pair of lists. One is used to insert the elements and the second to contain deleted elements. Elements are added to the first list and removed from the second list. The two most basic operations of Queue are Enqueue and Dequeue.
Stack	A Stack is a data structure that follows the last-in, first-out(LIFO) principle. We can add or remove element only from one end called top. Scala has both mutable and immutable versions of a stack.
Range	The Range can be defined as an organized series of uniformly separated Integers. It is helpful in supplying more strength with fewer methods, so operations performed here are very quick.
String	A string is a sequence of characters. In Scala, objects of String are immutable which means they are constant and cannot be changed once created.
Mutable	ArrayBuffer	To create a mutable, indexed sequence whose size can change, the ArrayBuffer class is used. Internally, an ArrayBuffer is an Array of elements, as well as the store’s current size of the array. When an element is added to an ArrayBuffer, its size is checked. If the underlying array isn’t full, then the element is directly added to the array. If the underlying array is full, then a larger array is constructed and all the elements are copied to the new array. The key is that the new array is constructed larger than what is required for the current addition.
ListBuffer	The ListBuffer object is convenient when we want to build a list from front to back. It supports efficient prepend and append operations. The time taken to convert the ListBuffer into a List is constant.
StringBuilder	A String object is immutable. When you need to perform repeated modifications to a string, we need a StringBuilder class. A StringBuilder is utilized to append input data to the internal buffer. Numerous operations like appending data, inserting data, and removing data are supported in StringBuilder.
MutableList	A MutableList consists of a single linked list together with a pointer that refers to the terminal empty node of that list. This makes list append a constant time operation because it avoids having to traverse the list in search for its terminal node.
ArraySeq	Array sequences are mutable sequences of a fixed size that store their elements internally in an Array[Object]. You would typically use an ArraySeq if you want an array for its performance characteristics, but you also want to create generic instances of the sequence where you do not know the type of the elements and you do not have a ClassTag to provide them at run-time.
ArrayStack	An ArrayStack is a MutableStack that contains a FastList of data. ArrayStack iterates from top to bottom (LIFO order). The backing data structure grows and shrinks by 50% at a time, and size is constant. ArrayStack does not extend Vector, as does the Java Stack, which was one of the reasons for creating this data structure.
Array	Array is a special mutable kind of collection in Scala. it is a fixed size data structure that stores elements of the same data type.

Benchmark Codes

We created a Scala project with a sbt for assessment data collection.

// build.sbt
scalaVersion := "2.12.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2"
enablePlugins(PackPlugin)

To calculate the size of an object, I find org.apache.spark.util.SizeEstimator.estimate function is useful. This function estimates the sizes of Java objects (number of bytes of memory they occupy).

import org.apache.spark.sql._
import org.apache.spark.sql.functions.col
import org.apache.spark.util.SizeEstimator.estimate
import scala.collection.AbstractSeq
import scala.collection.mutable
import scala.collection.mutable.{ArrayBuffer, ListBuffer}

To create a result dataframe and write the result, we use Spark (it is not necessary).

val spark = SparkSession
  .builder()
  .appName("Collection_Benchmark")
  .master("local[2]")
  .getOrCreate()

import spark.implicits._

We need a time-elapsing function to calculate run time, so use ’System.nanoTime` to measure time in nano resolution.

def timeElapsing(benchmarkFunction: => Unit, message:Boolean = false)(times:Int = 1): Double = {
  if(message) println("Benchmark: IS Starting ...")
  val startTime = System.nanoTime()
  for (_ <- 0 until times)
    benchmarkFunction
  val endTime = System.nanoTime()
  val timeElapsed = (endTime - startTime).toDouble / times.toDouble
  if(message) println(s"Operation Took $timeElapsed ms on average")
  timeElapsed
}

Among all the data collections, only some of them have an insert method. We define insertTime function only for these collections, as you see below.

def insertTime(x:AbstractSeq[Int], n:Int, m:Int):Double = x match {
  case x:ArrayBuffer[Int] => timeElapsing(x.updated(m,0))(n)
  case x:ListBuffer[Int] => timeElapsing(x.updated(m,0))(n)
  case x:mutable.MutableList[Int] => timeElapsing(x.updated(m,0))(n)
  case x:mutable.Queue[Int] => timeElapsing(x.updated(m,0))(n)
  case x:mutable.ArrayStack[Int] => timeElapsing(x.updated(m,0))(n)
  case _ => -1
}

The main parts of the benchmark are benchmark*** functions, which contain the time-elapsed of the main methods.

def benchmarkSeq(x:AbstractSeq[Int], n:Int, m:Int): Map[String, Double] = {
  Map(
    "volume" -> estimate(x),
    "head" -> timeElapsing(x.head)(n),
    "tail" -> timeElapsing(x.tail)(n),
    "apply" -> timeElapsing(x.apply(m))(n),
    "update" -> timeElapsing(x.updated(m,0))(n),
    "prepend" -> timeElapsing(0+:x)(n),
    "append" -> timeElapsing(x:+0)(n),
    "insert" -> insertTime(x, n, m)
  )
}

Similar to benchmarkSeq we define benchmarkString, benchmarkStringBuilder and benchmarkArray functions.

To calculate correct time elapsing related to Array we define Array[Object]

def obj = new Object()
def benchmarkArrayBoxed(x:Array[Object], n:Int, m:Int): Map[String, Double] =  { Map(
  "volume" -> estimate(x),
  "head" -> timeElapsing(x.head)(n),
  "tail" -> timeElapsing(x.tail)(n),
  "apply" -> timeElapsing(x.apply(m))(n),
  "update" -> timeElapsing(x.updated(m,0))(n),
  "prepend" -> timeElapsing(obj+:x)(n),
  "append" -> timeElapsing(x:+obj)(n),
  "insert" -> timeElapsing(x.updated(m,0))(n))
}

When determining the size of objects, we consider two measurements: size and method. Objects with a length of \(16^0,16^1,..., 16^5\) are generated to find their size. For checking the performance of methods, objects with a size of \(10000, 200000,..., 1000000\) are generated.

val sizes = ( 0 to 5).map(x => math.pow(16,x).toInt) ++ (1 to 10).map(_*100000)

As you can see below, each method is run 100 times on objects, and the results are collected.

val stats = for(s <- sizes) yield {
  val integers = 0 until s
  List(
    ("Immutable_List", integers.toList),
    ("Immutable_Stream", integers.toStream),
    ("Immutable_Vector", integers.toVector),
    ("Immutable_Queue", scala.collection.immutable.Queue(integers: _*)),
    ("Immutable_Range", integers),
    ("Immutable_String", "1" * s),
    ("Mutable_ArrayBuffer", scala.collection.mutable.ArrayBuffer(integers: _*)),
    ("Mutable_ListBuffer", scala.collection.mutable.ListBuffer(integers: _*)),
    ("Mutable_StringBuilder", new scala.collection.mutable.StringBuilder("1" * s)),
    ("Mutable_MutableList", scala.collection.mutable.MutableList(integers: _*)),
    ("Mutable_Queue", scala.collection.mutable.Queue(integers: _*)),
    ("Mutable_ArraySeq", scala.collection.mutable.ArraySeq(integers: _*)),
    ("Mutable_ArrayStack", scala.collection.mutable.ArrayStack(integers: _*)),
    ("Mutable_Array", integers.toArray),
    ("Mutable_Boxed_Array", {
      val boxedArray = new Array[Object](s)
      var i = 0
      while (i < s) {
        boxedArray(i) = obj; i += 1
      }
      boxedArray
    })

  ).map {
    case (c, cl: AbstractSeq[Int]) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkSeq(cl, 100, s - 1).map(x => (x._1, x._2.toString))
    case (c, cl: Array[Object]) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkArrayBoxed(cl, 100, s - 1).map(x => (x._1, x._2.toString))
    case (c, cl: Array[Int]) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkArray(cl, 100, s - 1).map(x => (x._1, x._2.toString))
    case (c, cl: String) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkString(cl, 100, s - 1).map(x => (x._1, x._2.toString))
    case (c, cl: StringBuilder) => Map("size" -> s.toString, "collection" -> c) ++ 
    benchmarkStringBuilder(cl, 100, s - 1).map(x => (x._1, x._2.toString))
  }
}

The last step is writing the results as a csv file with spark.write.

val colNames = stats(0).head.toList.sortBy(_._1).map(_._1)
  .zipWithIndex.map(x => col("value")(x._2).as(x._1))

stats.flatten.map(x => x.toList.sortBy(_._1).map(_._2))
  .toDF.select(colNames:_*)
  .coalesce(1).write.option("header","true").mode("overwrite")
  .csv("./collection_seq_size_benchmark.csv")

Object Size in Memory

The benchmark data is now available! The table below displays the expected size of various collections of different sizes in bytes.

Table 2: Estimated Size of Scala Collections[Int] In Different Size (in bytes)
Immutability	Collection	1	16	256	4,096	65,536	1,048,576
Immutable	List	56	656	10,256	163,856	2,621,456	41,943,056
Queue	80	680	10,280	163,880	2,621,480	41,943,080
Range	40	40	40	40	40	40
Stream	120	120	120	120	120	120
String	48	72	552	8,232	131,112	2,097,192
Vector	216	456	5,448	84,744	1,353,192	21,648,072
Mutable	Array	24	80	1,040	16,400	262,160	4,194,320
Array[Object]	40	336	5,136	80,400	1,310,160	20,970,320
ArrayBuffer	120	360	5,160	80,424	1,310,184	20,970,344
ArraySeq	64	360	5,160	80,424	1,310,184	20,970,344
ArrayStack	64	360	5,160	80,424	1,310,184	20,970,344
ListBuffer	88	688	10,288	163,888	2,621,488	41,943,088
MutableList	88	688	10,288	163,888	2,621,488	41,943,088
Queue	88	688	10,288	163,888	2,621,488	41,943,088
StringBuilder	96	120	600	8,280	131,160	2,097,240

The average memory size of each object is calculated and shown below.

Methods Performance

Before seeing the benchmark result, it is better to have an overview of the methods that are applied to objects. The below table has more details.

Table 3: Operations That Are Tested on Sequence Types
operations	description
head	Selecting the first element of the sequence.
tail	Producing a new sequence that consists of all elements except the first one.
apply	Indexing.
update	Functional update (with updated) for immutable sequences, side-effecting update (with update for mutable sequences).
prepend	Adding an element to the front of the sequence. For immutable sequences, this produces a new sequence. For mutable sequences it modifies the existing sequence.
append	Adding an element and the end of the sequence. For immutable sequences, this produces a new sequence. For mutable sequences it modifies the existing sequence.
insert	Inserting an element at an arbitrary position in the sequence. This is only supported directly for mutable sequences.

We can find performance characteristics of Scala collections in the Scala documents. The Scala performance table is provided below for comparison with the empirical results.

Table 4: Performance characteristics of sequence types
Immutability	Collection	head	tail	apply	update	prepend	append	insert
Immutable	List	C	C	L	L	C	L
Stream	C	C	L	L	C	L
Vector	eC	eC	eC	eC	eC	eC
Stack	C	C	L	L	C	C	L
Queue	aC	aC	L	L	L	C
Range	C	C	C
String	C	L	C	L	L	L
Mutable	ArrayBuffer	C	L	C	C	L	aC	L
ListBuffer	C	L	L	L	C	C	L
StringBuilder	C	L	C	C	L	aC	L
MutableList	C	L	L	L	C	C	L
Queue	C	L	L	L	C	C	L
ArraySeq	C	L	C	C	-	-	-
Stack	C	L	L	L	C	L	L
ArrayStack	C	L	C	C	aC	L	L
Array	C	L	C	C	-	-	-

Table 5: Performance characteristics of sequence types
performance	description
C	The operation takes (fast) constant time.
eC	The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.
aC	The operation takes amortized constant time. Some invocations of the operation might take longer, but if many operations are performed on average only constant time per operation is taken.
Log	The operation takes time proportional to the logarithm of the collection size.
L	The operation is linear, that is it takes time proportional to the collection size.
-	The operation is not supported.

The performance results of each method and collection are plotted as a scatter plot. We add a regression line to the plot to see the growth rate. The plot below shows the performance of the immutable collection.

Figure 5: Immutable Collection Methods Performance

A similar plot is plotted for mutable collection.

Figure 6: Mutable Collection Methods Performance

Scala Data Collections Performance: Sequence Types

Sequence Type

Benchmark Codes

Object Size in Memory

Methods Performance

Refrences