Scala has different data collections and using the proper objects is important in optimizing the big data pipelines.
This post tries to study the characteristics of the Scala collection, such as:
Memory Usage,
Operations Time.
from the point of view of practical benchmarks.
We will try to study the performance of data collection in two posts; part one related to Sequence
collection, the second contains Maps
and Sets
.
The diagram below demonstrates all of the collections in the package scala.collection
. These are all high-level abstract classes or traits with both mutable and immutable implementations.
Sequence Type
The Seq
trait represents sequences. A sequence is a type of iterable with a length and elements with fixed index positions beginning at 0.
Seq
Collection divided to type of immutable and mutable.
The following figure shows all Seq
collections in package scala.collection.immutable
.
And the following figure shows Seq
collections in package scala.collection.mutable
.
Before seeing the collection benchmark tables, it is useful to review the collection definition and its properties.
Immutability | Collection | description |
Immutable | List | A List is a collection that contains immutable data. The Scala List class holds a sequenced, linear list of items. |
Stream | The Stream is a lazy list where elements are evaluated only when they are needed. Streams have the same performance characteristics as lists. | |
Vector | Vectors in Scala are immutable data structures providing random access for elements and is similar to the list. But, the list has incompetence of random access of elements. | |
Queue | A Queue is a first-in, first-out (FIFO) data structure. Scala offers both an immutable queue and a mutable queue. A mutable queue can be updated or extended in place. It means one can change, add, or remove elements of a queue as a side effect. Queue is implemented as a pair of lists. One is used to insert the elements and the second to contain deleted elements. Elements are added to the first list and removed from the second list. The two most basic operations of Queue are Enqueue and Dequeue. | |
Stack | A Stack is a data structure that follows the last-in, first-out(LIFO) principle. We can add or remove element only from one end called top. Scala has both mutable and immutable versions of a stack. | |
Range | The Range can be defined as an organized series of uniformly separated Integers. It is helpful in supplying more strength with fewer methods, so operations performed here are very quick. | |
String | A string is a sequence of characters. In Scala, objects of String are immutable which means they are constant and cannot be changed once created. | |
Mutable | ArrayBuffer | To create a mutable, indexed sequence whose size can change, the ArrayBuffer class is used. Internally, an ArrayBuffer is an Array of elements, as well as the store’s current size of the array. When an element is added to an ArrayBuffer, its size is checked. If the underlying array isn’t full, then the element is directly added to the array. If the underlying array is full, then a larger array is constructed and all the elements are copied to the new array. The key is that the new array is constructed larger than what is required for the current addition. |
ListBuffer | The ListBuffer object is convenient when we want to build a list from front to back. It supports efficient prepend and append operations. The time taken to convert the ListBuffer into a List is constant. | |
StringBuilder | A String object is immutable. When you need to perform repeated modifications to a string, we need a StringBuilder class. A StringBuilder is utilized to append input data to the internal buffer. Numerous operations like appending data, inserting data, and removing data are supported in StringBuilder. | |
MutableList | A MutableList consists of a single linked list together with a pointer that refers to the terminal empty node of that list. This makes list append a constant time operation because it avoids having to traverse the list in search for its terminal node. | |
ArraySeq | Array sequences are mutable sequences of a fixed size that store their elements internally in an Array[Object]. You would typically use an ArraySeq if you want an array for its performance characteristics, but you also want to create generic instances of the sequence where you do not know the type of the elements and you do not have a ClassTag to provide them at run-time. | |
ArrayStack | An ArrayStack is a MutableStack that contains a FastList of data. ArrayStack iterates from top to bottom (LIFO order). The backing data structure grows and shrinks by 50% at a time, and size is constant. ArrayStack does not extend Vector, as does the Java Stack, which was one of the reasons for creating this data structure. | |
Array | Array is a special mutable kind of collection in Scala. it is a fixed size data structure that stores elements of the same data type. |
Benchmark Codes
We created a Scala project with a sbt for assessment data collection.
// build.sbt
scalaVersion := "2.12.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2"
enablePlugins(PackPlugin)
To calculate the size of an object, I find org.apache.spark.util.SizeEstimator.estimate
function is useful.
This function estimates the sizes of Java objects (number of bytes of memory they occupy).
import org.apache.spark.sql._
import org.apache.spark.sql.functions.col
import org.apache.spark.util.SizeEstimator.estimate
import scala.collection.AbstractSeq
import scala.collection.mutable
import scala.collection.mutable.{ArrayBuffer, ListBuffer}
To create a result dataframe and write the result, we use Spark (it is not necessary).
val spark = SparkSession
.builder()
.appName("Collection_Benchmark")
.master("local[2]")
.getOrCreate()
import spark.implicits._
We need a time-elapsing function to calculate run time, so use ’System.nanoTime` to measure time in nano resolution.
def timeElapsing(benchmarkFunction: => Unit, message:Boolean = false)(times:Int = 1): Double = {
if(message) println("Benchmark: IS Starting ...")
val startTime = System.nanoTime()
for (_ <- 0 until times)
benchmarkFunction
val endTime = System.nanoTime()
val timeElapsed = (endTime - startTime).toDouble / times.toDouble
if(message) println(s"Operation Took $timeElapsed ms on average")
timeElapsed
}
Among all the data collections, only some of them have an insert
method. We define insertTime
function only for these collections, as you see below.
def insertTime(x:AbstractSeq[Int], n:Int, m:Int):Double = x match {
case x:ArrayBuffer[Int] => timeElapsing(x.updated(m,0))(n)
case x:ListBuffer[Int] => timeElapsing(x.updated(m,0))(n)
case x:mutable.MutableList[Int] => timeElapsing(x.updated(m,0))(n)
case x:mutable.Queue[Int] => timeElapsing(x.updated(m,0))(n)
case x:mutable.ArrayStack[Int] => timeElapsing(x.updated(m,0))(n)
case _ => -1
}
The main parts of the benchmark are benchmark***
functions, which contain the time-elapsed of the main methods.
def benchmarkSeq(x:AbstractSeq[Int], n:Int, m:Int): Map[String, Double] = {
Map(
"volume" -> estimate(x),
"head" -> timeElapsing(x.head)(n),
"tail" -> timeElapsing(x.tail)(n),
"apply" -> timeElapsing(x.apply(m))(n),
"update" -> timeElapsing(x.updated(m,0))(n),
"prepend" -> timeElapsing(0+:x)(n),
"append" -> timeElapsing(x:+0)(n),
"insert" -> insertTime(x, n, m)
)
}
Similar to benchmarkSeq
we define benchmarkString
, benchmarkStringBuilder
and
benchmarkArray
functions.
To calculate correct time elapsing related to Array we define Array[Object]
def obj = new Object()
def benchmarkArrayBoxed(x:Array[Object], n:Int, m:Int): Map[String, Double] = { Map(
"volume" -> estimate(x),
"head" -> timeElapsing(x.head)(n),
"tail" -> timeElapsing(x.tail)(n),
"apply" -> timeElapsing(x.apply(m))(n),
"update" -> timeElapsing(x.updated(m,0))(n),
"prepend" -> timeElapsing(obj+:x)(n),
"append" -> timeElapsing(x:+obj)(n),
"insert" -> timeElapsing(x.updated(m,0))(n))
}
When determining the size of objects, we consider two measurements: size and method. Objects with a length of \(16^0,16^1,..., 16^5\) are generated to find their size. For checking the performance of methods, objects with a size of \(10000, 200000,..., 1000000\) are generated.
val sizes = ( 0 to 5).map(x => math.pow(16,x).toInt) ++ (1 to 10).map(_*100000)
As you can see below, each method is run 100 times on objects, and the results are collected.
val stats = for(s <- sizes) yield {
val integers = 0 until s
List(
("Immutable_List", integers.toList),
("Immutable_Stream", integers.toStream),
("Immutable_Vector", integers.toVector),
("Immutable_Queue", scala.collection.immutable.Queue(integers: _*)),
("Immutable_Range", integers),
("Immutable_String", "1" * s),
("Mutable_ArrayBuffer", scala.collection.mutable.ArrayBuffer(integers: _*)),
("Mutable_ListBuffer", scala.collection.mutable.ListBuffer(integers: _*)),
("Mutable_StringBuilder", new scala.collection.mutable.StringBuilder("1" * s)),
("Mutable_MutableList", scala.collection.mutable.MutableList(integers: _*)),
("Mutable_Queue", scala.collection.mutable.Queue(integers: _*)),
("Mutable_ArraySeq", scala.collection.mutable.ArraySeq(integers: _*)),
("Mutable_ArrayStack", scala.collection.mutable.ArrayStack(integers: _*)),
("Mutable_Array", integers.toArray),
("Mutable_Boxed_Array", {
val boxedArray = new Array[Object](s)
var i = 0
while (i < s) {
boxedArray(i) = obj; i += 1
}
boxedArray
})
).map {
case (c, cl: AbstractSeq[Int]) => Map("size" -> s.toString, "collection" -> c) ++
benchmarkSeq(cl, 100, s - 1).map(x => (x._1, x._2.toString))
case (c, cl: Array[Object]) => Map("size" -> s.toString, "collection" -> c) ++
benchmarkArrayBoxed(cl, 100, s - 1).map(x => (x._1, x._2.toString))
case (c, cl: Array[Int]) => Map("size" -> s.toString, "collection" -> c) ++
benchmarkArray(cl, 100, s - 1).map(x => (x._1, x._2.toString))
case (c, cl: String) => Map("size" -> s.toString, "collection" -> c) ++
benchmarkString(cl, 100, s - 1).map(x => (x._1, x._2.toString))
case (c, cl: StringBuilder) => Map("size" -> s.toString, "collection" -> c) ++
benchmarkStringBuilder(cl, 100, s - 1).map(x => (x._1, x._2.toString))
}
}
The last step is writing the results as a csv file with spark.write
.
val colNames = stats(0).head.toList.sortBy(_._1).map(_._1)
.zipWithIndex.map(x => col("value")(x._2).as(x._1))
stats.flatten.map(x => x.toList.sortBy(_._1).map(_._2))
.toDF.select(colNames:_*)
.coalesce(1).write.option("header","true").mode("overwrite")
.csv("./collection_seq_size_benchmark.csv")
Object Size in Memory
The benchmark data is now available! The table below displays the expected size of various collections of different sizes in bytes.
Immutability | Collection | 1 | 16 | 256 | 4,096 | 65,536 | 1,048,576 |
Immutable | List | 56 | 656 | 10,256 | 163,856 | 2,621,456 | 41,943,056 |
Queue | 80 | 680 | 10,280 | 163,880 | 2,621,480 | 41,943,080 | |
Range | 40 | 40 | 40 | 40 | 40 | 40 | |
Stream | 120 | 120 | 120 | 120 | 120 | 120 | |
String | 48 | 72 | 552 | 8,232 | 131,112 | 2,097,192 | |
Vector | 216 | 456 | 5,448 | 84,744 | 1,353,192 | 21,648,072 | |
Mutable | Array | 24 | 80 | 1,040 | 16,400 | 262,160 | 4,194,320 |
Array[Object] | 40 | 336 | 5,136 | 80,400 | 1,310,160 | 20,970,320 | |
ArrayBuffer | 120 | 360 | 5,160 | 80,424 | 1,310,184 | 20,970,344 | |
ArraySeq | 64 | 360 | 5,160 | 80,424 | 1,310,184 | 20,970,344 | |
ArrayStack | 64 | 360 | 5,160 | 80,424 | 1,310,184 | 20,970,344 | |
ListBuffer | 88 | 688 | 10,288 | 163,888 | 2,621,488 | 41,943,088 | |
MutableList | 88 | 688 | 10,288 | 163,888 | 2,621,488 | 41,943,088 | |
Queue | 88 | 688 | 10,288 | 163,888 | 2,621,488 | 41,943,088 | |
StringBuilder | 96 | 120 | 600 | 8,280 | 131,160 | 2,097,240 |
The average memory size of each object is calculated and shown below.
Methods Performance
Before seeing the benchmark result, it is better to have an overview of the methods that are applied to objects. The below table has more details.
operations | description |
head | Selecting the first element of the sequence. |
tail | Producing a new sequence that consists of all elements except the first one. |
apply | Indexing. |
update | Functional update (with updated) for immutable sequences, side-effecting update (with update for mutable sequences). |
prepend | Adding an element to the front of the sequence. For immutable sequences, this produces a new sequence. For mutable sequences it modifies the existing sequence. |
append | Adding an element and the end of the sequence. For immutable sequences, this produces a new sequence. For mutable sequences it modifies the existing sequence. |
insert | Inserting an element at an arbitrary position in the sequence. This is only supported directly for mutable sequences. |
We can find performance characteristics of Scala collections in the Scala documents. The Scala performance table is provided below for comparison with the empirical results.
Immutability | Collection | head | tail | apply | update | prepend | append | insert |
Immutable | List | C | C | L | L | C | L | |
Stream | C | C | L | L | C | L | ||
Vector | eC | eC | eC | eC | eC | eC | ||
Stack | C | C | L | L | C | C | L | |
Queue | aC | aC | L | L | L | C | ||
Range | C | C | C | |||||
String | C | L | C | L | L | L | ||
Mutable | ArrayBuffer | C | L | C | C | L | aC | L |
ListBuffer | C | L | L | L | C | C | L | |
StringBuilder | C | L | C | C | L | aC | L | |
MutableList | C | L | L | L | C | C | L | |
Queue | C | L | L | L | C | C | L | |
ArraySeq | C | L | C | C | - | - | - | |
Stack | C | L | L | L | C | L | L | |
ArrayStack | C | L | C | C | aC | L | L | |
Array | C | L | C | C | - | - | - |
performance | description |
C | The operation takes (fast) constant time. |
eC | The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys. |
aC | The operation takes amortized constant time. Some invocations of the operation might take longer, but if many operations are performed on average only constant time per operation is taken. |
Log | The operation takes time proportional to the logarithm of the collection size. |
L | The operation is linear, that is it takes time proportional to the collection size. |
- | The operation is not supported. |
The performance results of each method and collection are plotted as a scatter plot. We add a regression line to the plot to see the growth rate. The plot below shows the performance of the immutable collection.
A similar plot is plotted for mutable collection.