NESCALA 2013

Premature Optimization

Erik Osheim (@d6)

YMMV

This talk is informed by my experiences benchmarking and optimizing Scala while working on:

Performance tuning is fiddly. It's hard to speak in generalities. I'm still learning.

What are we talking about?

This talk mainly covers low-level optimization.

What are we talking about?

This talk mainly covers low-level optimization.

The plan is to increase run-time performance by producing better bytecode for the JVM.

What are we talking about?

This talk mainly covers low-level optimization.

The plan is to increase run-time performance by producing better bytecode for the JVM.

(Specifically, better bytecode for HotSpot.)

How will we do this?

By focusing on how Scala code is written, and via profiling and benchmarking.

I like using Caliper and JProfiler. Other tools work too.

How else could you do this?

Of course, you can always do other things:

How else could you do this?

Of course, you can always do other things:

How else could you do this?

Of course, you can always do other things:

These are all good strategies. But they are not the focus of this talk.

Why is it so important to worry about HotSpot?

Why is it so important to worry about HotSpot?

Why is it so important to worry about HotSpot?

HotSpot does a ton of useful stuff! Check out HotSpot Internals.

What can foil HotSpot?

How important is HotSpot? VERY IMPORTANT

(Proper Inlining can give a 10x speed up by itself.)

How do common Scala patterns interact with HotSpot?

The Good

Scala encourages lots of small methods, which inline well.

The JVM deals very well with composition.

Assumptions of immutability mean fewer defensive checks.

How do common Scala patterns interact with HotSpot?

The Bad

The biggest problem is garbage.

Scala is a litter-bug, and idiomatic Scala produces tons of garbage.

(But since Scala is so expressive, you don't see it unless you know where to look.)

How do common Scala patterns interact with HotSpot?

The Ugly

Scala produce lots of classfiles (potentially squeezing permgen/code cache).

Due to being more functional/expressive, the "inlining problem" hurts us more.

Why is Scala a litter-bug?

Boxing.

How many java.lang.Integer instances do you see?

val ns = Vector(1, 2, 3, 4, 5)
val squares = ns.map(n => n * n)
val sum = squares.foldLeft(0)(_ + _)
    

Why is Scala a litter-bug?

Collections API.

How many :: instances do you see?

def everySecondElem(ns: List[Int]) = ns.
  zipWithIndex.
  filter(_._2 % 2 == 1).
  map(_._1)
    

(Also, how many iterations do you see? How many boxes?)

Why is Scala a litter-bug?

Short-lived immutable objects.

How many Tuple2 instances do you see?

def minAndMax(ns: Array[Int]): (Int, Int) = {
  if (ns.isEmpty) sys.error("oh no!")

  ns.foldLeft((ns(0), ns(0))) {
    case ((amin, amax), a) =>
      (min(amin, a), max(amax, a))
  }
}
    

Similar things happen with Option.

Why is Scala a litter-bug?

By-name parameters and Functions.

def f(x: Int, y: => Int): Int =
  if (x == 0) x else x * y

def g(ns: Array[Int]): Array[Int] =
  ns.map(n => f(n, 9))

def h(as: Array[Int], bs: Array[Int]) =
  as.map(a => bs.map(_ + a))
    

(I'm looking at scalaz.Semigroup[S]#append.)

Why is Scala a litter-bug?

Implicit Enrichment.

implicit class addTwice(n: Double) {
  def twice = n * 2
}

def foo(ns: Array[Double]) =
  ns.map(_.twice)

def bar(ns: Array[Double]) =
  ms.map(n => if(n.isNaN) 0.0 else n)
    

(Look out for conversions to RichInt, RichDouble, etc.)

So I have to write Java?

So I have to write Java?

No!

Then what?

First, let's note that the JVM actually does fine with Scala most of the time.

Then what?

First, let's note that the JVM actually does fine with Scala most of the time.

Modern garbage collectors are optimized to deal with short-lived objects.

Doing a relatively inefficient thing a few times (usually) doesn't matter.

Then what?

Problems occur in situations where:

  1. We have O(n) (or worse) allocations.
  2. n gets really big.

So, we want the ratio of actual work to allocations to be high.

When is this?

Often developers will have a good intuition about which code is in a "critical path".

(But sometimes we make mistakes.)

That's why we have profilers (Jprofiler, YourKit, etc).

What about Knuth?

Donald Knuth cautions us about premature optimization:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

What about Knuth?

However, Knuth also says:

In established engineering disciplines a 12 % improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering.

What about Knuth?

Most of the techniques here provide (at least) 2x speed up.

But are they "easily obtained"?

Erik's "premature optimization" tips

Why do you hate functional programming?

I don't hate functional programming.

In fact, I think functional programming informs this style.

Why do you hate functional programming?

This stye is useful precisely when you've been doing functional programming and find certain hot spots that need to be faster.

Case Study #1: foldLeft over Array[Int]

Let's reconsider some earlier example code:

val a = as(0)
as.foldLeft((a, a)) {
  (tpl, a) =>
    (min(tpl._1, a), max(tpl._2, a))
}
    

How can we do better?

Case Study #1: foldLeft over Array[Int]

Let's reconsider some earlier example code:

  def folder2[@spec(Int) A, @spec(Int) B, @spec(Int) C]
    (arr: Array[A])
    (b: B, c: C)
    (f1: (B, A) => B)
    (f2: (C, A) => C):
      (B, C) = { ... }
    

Case Study #1: foldLeft over Array[Int]

Let's reconsider some earlier example code:

  def folder2[...](...): (B, C) = {
    var state1 = b
    var state2 = c
    var i = 0
    val len = arr.length
    while (i < len) {
      val a = arr(i)
      state1 = f1(state1, a)
      state2 = f2(state2, a)
      i += 1
    }
    (state1, state2)
  }
    

Case Study #1: foldLeft over Array[Int]

That looks moderately horrible. I hope it's worth it!

Case Study #1: foldLeft over Array[Int]

It is 100-330x faster.

Case Study #1: foldLeft over Array[Int]

For n=256, 100x means 25ns versus 2.4us.

For n=131072, 330x means 4.7us versus 1.5ms!!

@specialized?

Allows you to write generic code but (usually) avoid boxing.

Specialization is an incredibly subtle, difficult, and necessary tool.

If I was a character in a norse saga, specialization would be my tragic flaw
(the thing that shows up to kill me at Ragnarok).

@specialized?

Code you might write:

def mySum[@spec A:Ring](ns: List[A], total:A): A =
 ns match {
  case n :: ns => mySum(ns, Ring[A].plus(total, n))
  case Nil => total
 }

mySum[Int](ns, 0)
    

@specialized?

After specialization phrase (give-or-take):

def mySum$mIc$sp(ns: List[Int], total:Int)
 (implicit ev:Ring$mcI$sp): Int =
  ns match {
   case n :: ns =>
     mySum(
       ns,
       new RingOps$mcI$sp(total)(ev).plus$mcI$sp(n)
     )
   case Nil => total
  }

mySum$mIc$sp(ns, 0)(Ring.IntIsRing)
    

@specialized: Too Many Dragons

There are a lot of gotchas:

@specialized: Too Many Dragons

The best way to use specialized is with very simple classes/traits.

Basis only uses specialized traits and does "manual specialization" with codegen. This is not a bad idea.

The code will often end up being ugly. This is OK.

Case Study #2: building List[(Int, Int, Int)]

Sometimes List is really what you want (i.e. a LIFO)

How well can we do in these cases?

Case Study #2: building List[(Int, Int, Int)]

Given as, bs, and cs that are Array[Int]:

var i = 0
var d = List.empty[(Int, Int, Int)]
val n = as.length
while (i < n) {
  d = (as(i), bs(i), cs(i)) :: d
  i += 1
}
    

Case Study #2: building List[(Int, Int, Int)]

Let's not use Tuple3 (less boxing):

case class III(a: Int, b: Int, c: Int)

var i = 0
var d = List.empty[III]
val n = as.length
while (i < n) {
  d = III(as(i), bs(i), cs(i)) :: d
  i += 1
}
    

Case Study #2: building List[(Int, Int, Int)]

Can we have still fewer allocations? Sure!

sealed trait IIIList {
  def a: Int
  def b: Int
  def c: Int
  def tail: IIIList
}

case object IIINil extends IIIList {
  def a = sys.error("die")
  def b = sys.error("die")
  def c = sys.error("die")
  def tail = sys.error("die")
}

case class IIICons(a: Int, b: Int, c: Int, tail: IIIList) extends IIIList
    

Case Study #2: building List[(Int, Int, Int)]

Using the IIIList types still looks pretty similar to our List example.

var i = 0
var d: IIIList = IIINil
val n = as.length
while (i < n) {
  d = IIICons(as(i), bs(i), cs(i), d)
  i += 1
}
    

Case Study #2: building List[(Int, Int, Int)]

Can we do better still? Maybe (it depends).

sealed trait I6List {
  def a1: Int
  def a2: Int
  def b1: Int
  def b2: Int
  def c1: Int
  def c2: Int
  def tail: I6List
  def prepend(a: Int, b: Int, c: Int): I6List = I6Cons(a, b, c, this)
}

case object I6Nil extends I6List {
  def a1: Int = sys.error("argh")
  def a2: Int = sys.error("argh")
  def b1: Int = sys.error("argh")
  def b2: Int = sys.error("argh")
  def c1: Int = sys.error("argh")
  def c2: Int = sys.error("argh")
  def tail: I6List = sys.error("argh")
}

case class I6Cons(a1: Int, b1: Int, c1: Int, tail: I6List) extends I6List {
  def a2: Int = sys.error("argh")
  def b2: Int = sys.error("argh")
  def c2: Int = sys.error("argh")
  override def prepend(a: Int, b: Int, c: Int): I6List = I6ConsCons(a, b, c, a1, b1, c1, tail)
}

case class I6ConsCons(a1: Int, b1: Int, c1: Int, a2: Int, b2: Int, c2: Int, tail: I6List) extends I6List
    

Case Study #2: building List[(Int, Int, Int)]

Things like I6List or Vector have:

Case Study #2: building List[(Int, Int, Int)]

So... what does all this look like?

Case Study #2: building List[(Int, Int, Int)]

For n=131072, we have:

Of course, using an array would be even faster if we don't need a cons-like structure.

Cheating is Good

At a low-level, we'd like to be able to cut any corner we can.

Classical information-hiding (often considered best practice) makes this harder.

Typeclasses are a nice way mechanism to allow type-specific cheating.

Case Study #3: processing 2D points

Eliminating boxing and allocations isn't always the best way to improve performance.

Cache behavior is also very important.

Case Study #3: processing 2D points

case class Point(x: Double, y: Double)

// naive
type PointArray = Array[Point]

// avoid allocating Point instances
case class Points(
  xs: Array[Double],
  ys: Array[Double]
)

// interleave X and Y coordinates
case class InterleavedPoints(
  xs: Array[Double],
  ys: Array[Double]
)
    

Case Study #3: processing 2D points

Let's perform a simple scaling operation:

// implemented for Array[Point]
val len = n
val ps2 = new Array[Point](len)
var i = 0
while (i < len) {
  val p = ps(i)
  val x = p.x
  val y = p.y
  val s = sqrt(x * x + y * y)
  ps2(i) = Point(x / s, y / s)
  i += 1
}
    

Case Study #3: processing 2D points

I'm going to skip the other two implementations, since they are similar.

Remember: InterleavedPoints will have a single array that's twice as long.

Case Study #3: processing 2D points

Results:

InterleavedPoints is 3-6x faster than Points.

At n=1M the Array[Point] is 14x slower than Points.

Case Study #3: processing 2D points

Lesson: good cache usage can be more significant than minimizing allocations.

The JVM is surprisingly good at dealing with 131,072 Point allocations.

Less so at dealing with 1,048,576 allocations.

Aren't arrays bad?

Arrays are mutable. It's a fact.

I like to consider arrays mutable only during a "construction phase". Since cloning arrays is actually quite fast, don't be afraid of making copies when it's reasonable.

(If you find yourself manually appending to an array with copying, stop!)

If you're hardline, consider a wrapper that only exposes Array#apply.

Aren't arrays bad?

For a good example of how to handle arrays well, look at the GeoTrellis project.

Arrays are assembled mutably, but considered immutable once they are shared/global.

Lots of high-level laziness, parallelism, etc with a very fast core.

(I helped build it while at Azavea. Hi guys!)

What about Vector?

For the kinds of problems being discussed, I don't think Vector is that great.

Works great for shared state, mid-to-high level code. Less useful at the low-level.

Effectively constant time access is still much slower than array access time.

Also boxing.

What about other Scala collections?

There are very few built-in Scala types that are specialized. When I checked they were:

It's safe to assume that any other generic types/methods you use will box.

Even though we have value classes, lots of implicit adaptors still allocate.

What about other Scala collections?

If you have a relatively precise idea of what you want, you can probably build data structures that are faster than the built-in ones.

Debox is my own exercise in this direction.

Arent you advocating NIH?

NIH ("Not Invented Here") is worth avoiding, but sometimes (re-) inventing things is how you make them faster.

You (or your team) are in the best decision to evaluate the trade-offs.

I tend to consider a 2x speed up "worth it" for the critical path.

Understanding the Ethos

You can find all the benchmarking code on Github. The examples use Caliper.

When doing this kind of experimenting and testing it's very important to:

Closing Remarks

See Also

There is a huge wealth of JVM optimization lore out there. Here are some useful links:

The End