# Scala LiftOff London 2011 - Day 1

Compared to last year’s LiftOff, which was an un-conference pretty much all the way, this year’s is split into morning, structured sessions, and afternoon open spaces resembling 2010 format. As usual, I tried to jot down the more interesting points from the day. The below is by no means a complete record and definitely suffers from insufficient editing and research (no time for that, sorry!). With this disclaimer out of the way:

## David Pollak’s Keynote

David started off by – actually, having arrived late I’m not exactly sure what he started with, but the key points of his talk can be summarised as:

• “crossing the chasm” into mass popularity takes time; e.g. for NeXT it took years until the ideas promoted by the company found their way to mass market, via Apple

• London, NY, Bay Area – most active for Scala; Scala programmers in these centres tend to hear mostly the opinions of others in a similar position. Voices of people from outside are often ignored by the community

• Scala needs to cater for “daily developer” who uses Eclipse, not Emacs

After the keynote the conference split into three tracks, of which I picked the one with what looked like the most challenging content.

## Miles Sabin: Scala Coroutines and Generators

Miles has a history of presenting interesting research results based on his in-depth understanding of the type system and language features. This time he showed an implementation of two constructs – coroutines and generators – using Scala’s delimited continuations.

### Classification of continuations

The continuations can be classified along two axes:

• full/delimited: full captures the program until the very end, delimited only captures portion
• direct implementation/CPS transform: direct uses stack manipulation (copying, “spaghetti stack”), CPS transformation modifies the functions to accept a continuation parameter

Full continuation support is common (call/cc in Lisp, setjmp/longjmp in C, also Smalltalk, ML); until recently not possible on JVM because we have not had access to the runtime. Continuations are primarily used for defining higher-level control constructs like concurrency constructs and exception handling. They offer a lot more in typed setting, but Miles didn’t go into the details this time round.

### Continuations in Scala

shift/reset – delimited continuation via localised CPS transformation; the transformation allowed it to run efficiently on a standard JVM (no longer that relevant, with JVM coroutines effort, on which later)

• reset { ... } delimits the continuation
• shift { (cont: ...) => ... } captures the continuation up to the enclosing reset

shift must be enclosed within a corresponding reset – this is enforced by the compiler using effect system encoded in types. The entire functionality is provided by a compiler plugin which has to be explicitly enabled: scala -P:continuations:enable.

### Coroutines

Coroutines are an imperative control structure: subroutines with multiple entry and exit points. They interleave computations, can replace threads in some situations, but without parallelism – iterleaving is performed sequentially. They are an useful abstraction for event-drive, asynchronous network applications.

Miles created a library which provides coroutines in Scala. Example:

object TestCoroutines {
import Coroutines._

def main(args: Array[String]) {
coroutine {
while (true) {
prntln("Hi!")
yld
}
}

coroutine {
while (true) {
println("Ho!")
yld
}
}

for (_ <- 1 until 20) Scheduler.yld
}
}


This, as can be expected, will print “Hi!” followed by “Ho!”, both repeated twenty times.

How does it work?

sealed trait Trampoline

def coroutine(body: => Unit @cps[Trampoline]): Unit =
Scheduler.addThunk((Unit) => reset { body ; Done })

def yld: Unit @cps[Trampoline] = ... // invokes shift

// there is also a Scheduler object which invokes thunks


the type of yld guarantees it can only be called inside coroutine – Unit with @cps annotation is not a subtype of Unit without the annotation.

### Generators

Generators encapsulate production of a sequence of values lazily. They are the primary means used by Python for lazy iterators, there is even a framework, Twisted, which based its concurrency model on generators.

Again, Miles wrote a Scala library providing generators:

object TestGenerators {
import Generators._

def main {
val fibs = generator {
def fib(s0: Int, s1: Int): Unit @Gen[Int] = {
yld(s0)
fib(s1, s0 + s1)
}

fib(0, 1)
}

for (i <- fibs take 10) println(i)
}
}


Some interesting points:

• fib(0, 1) looks like a non-terminating call – except for the yld call
• @Gen is a CPS annotation – again, protects us from using yld outside of a generator

The implementation is very similar to coroutines. Generators completely hide the continuations; a library can expose a generator and its user doesn’t need to know anything about the CPS transformation.

### Limitations

Scala delimited continuations have a number of serious limitations:

First of all, CPS effect types are a mixed blessing: everything along the path from shift to reset has to be CPS-aware. We can’t shift from within any Java code, or in general any library which is not CPS-aware. Consider this call stack:

scala: reset
java lib
scala: shift


The shift produces a CPS-annotated result, but it can’t pass through the Java library layer.

Second, related problem is that the effect types don’t play nicely with some language constructs:

  coroutine {
for (_ <- 1 until 10) {
println("Hi!")
yld // <-- causes error
}
}


Scala for is desugared into method calls, but these are not CPS-aware!

Finally, the behaviour of transformed code can be hard to reason about: e.g. our fib looks like we should be able to annotate it with @tailrec, but in fact the compiler will complain, because it doesn’t look to it like it is tail-recursive after the CPS transformation is performed! This is puzzling, given the fact that by definition CPS code is tail-recursive, but shows that even the compiler can get confused by the transformation.

### JVM Coroutines

The limitations of JDK with respect to interesting control flows are addressed by a patch on OpenJDK created by Lukas Stadler as a part of Da Vinci Machine Project. It offers direct support for coroutines and generators through library additions and JIT-level modifications, but with no bytecode changes. This is implemented via stack manipulation – a hybrid of stack copying and spaghetti stack.

The structure of a coroutine is similar to what Miles coded up in his library:

class MyCoroutine extends Coroutine {
// some code
yield()
}


The advantage of JVM solution is that it lets us use coroutines which interact with Java libraries. An example provided by Miles was a SAX parser handler, where a coroutine allowed to deal with the inversion of control imposed by a push parser:

parser.parse(
new File("contentxml"),
DefaultHandler() {
override def startElemt(...) = {
ret(some(qName))
}
}
)


As mentioned before, this can’t be done with Scala continuations because startElement is not CPS-aware.

## Phil Bagwell: Lock-Free Resizable Concurrent Tries

In an enthusiastic, crazily paced talk, Phil, who works with Alex Prokopec and Tiark Rompf at EPFL, has shown us some serious data structure/algorithm/bit twiddling results. His motivation was to create a data structure that is suitable for highly parallel:

• caches
• dictionaries
• sparse arrays
• convergence algorithms

### Hash Tries

The presentation relied heavily on diagrams, so sadly these textual notes do not capture his points effectively, but I’ll try anyway:

We use 4-way branching in each tree node for illustration (in practice it will be 32). The place for a value in the tree is found by looking at the hash code. The first four bits determine the branch we take in the root node, following two specify position on the second level, etc:

000000: [x| | | ]
010000: [ |x| | ]
000100: [p| | | ]; p -> [x| | | ]


This is fast, but we end up with trees that have too many gaps and waste memory. Phil then explained a possible strategy for compression of the tree, where I got lost (did I mention the pace of the talk?). The proposed solution suffered from linear cost for finding the item at the lowest level of the trie. Solution: provide a bitmap index showing which cells are real and then compress all levels:

[48| |57| ]   ->   1010:[48| |57| ]   ->   10:[48|57]


Some bit-twiddling was required to interpret the bitmap.

TODO: look up the details of this structure

In practice, 32 fan out in every node turned out to be a good match for CPU cache line sizes; the trade off for the distance to go between items versus making a copy of a part of the trie gave optimal results at this node size.

### Making it Concurrent

How do we make our trie concurrent? For scalability in the presence of massive concurrency we want to avoid locks. Compare-and-swap instruction is a well-known alternative: CAS(address, expected_value, new_value): bool is na “optimistic lock”, and is available on the JVM and natively in Intel processors. Sample usage:

def counter() =
do {
b = a + 1


This can also be implemented using a spinlock – but then, it’s blocking, which we don’t want.

It turned out that making tries concurrent presented a number of challenges, a solution of each leading to the next problem.

### Inserting into a trie concurrently

We can allocate the new/updated node, then do a CAS on the cell in the parent node where the reference to our updated node needs to be inserted. This can immediately lead to a data race, because the node we are CAS-ing on might have been modified already, hence removed from the trie. According to Phil, after three weeks of pondering on this issue, and some beers the night before, he came up with a solution: introduce intermeidate I-nodes between the nodes containing data and perform CAS on them.

It turned out that the speed was not impaired; as a metter of fact, good properties of the new structure wrt. caching outweighted the overhead of additional node hops.

### Deleting from a trie

With that out of the way, the problem now is that deletion can’t compress levels when only a single item is left in a leaf; if it did that, we could lose an insertion to that leaf which happened concurrently. The solution involved disallowing insertion during compression. How can this be achieved without a lock? It turns out that if a special cell (“T-node”) is attached to the leaf to be compressed, and the threads are instructed to do their best to get rid of that node, it results in the expected behaviour; many threads might be compressing the same node in parallel and race for whose result will make it to the trie, but in the end the results of their work are the same, so it doesn’t matter which one wins. While it was not immediately obvious that this algorithm was in fact lock-free, Alex Prokopec proved it.

In the end, ensuring lock freedom took most of the time spent working on this data structure.

### Scalability

Phil presented a number of plots comparing CHT with alterantive hash map implementations. While it appeared to scale very nicely on 8-way system, it really shone on a 64-core systems. In particular, it had a huge advantage over alternatives for insertions on 4x 8-core i7, this was due to peculiarities of this particular architecture.

### Snapshots

Additional work allowed the tries to be snapshot, in constant time:

val sn = cht.snap


This allows the program to carry on modifying cht while sn continues to represent the state of the trie when the snapshot was taken. Its uses involve massively parallel computations, e.g. fluid flow analysers, monte carlo simulators and resource optimisers.

### Fast Hash Functions

The second part of Phil’s talk circled around “the best function he has ever written”, which turned out to be a hashing function.

With parallel collections it is sometimes needed to re-assemble a hash map from the results of parallelised computations. While doing so, it has been measured that the existing Java hash function was one of the major performance blockers. Phil tried to devise a hashing function that would scale well while having all the pre-requisite properties spelled out by an authority in the field, Bob Jenkins:

• 1 bit change causes half bits to change (on average)
• no funnels
• Poisson distribution

also useful:

• Chi²
• be reversible

He started off by multiplying the input value by a magic number: 101101 * 010011, but soon noticed a problem: the bits on the right influenced the left, but not the other way round. If the right-hand side has little entropy, clusters of hash codes are formed. A solution was to reverse the result and multiply again:

val magic = 0x9e3775cd

def bhash(i: Int) = reverseBytes(i * magic) * magic

// slow version; Integer.reverseBytes does this in one x86 instruction!
def reverseBytes = {
val i = (x << 16) + (x >>> 16)
((i & K8) << 8) + ((i >>> 8) & K8)
}


How do we choose magic number? It:

• should have about the same number of 1s and 0s
• should have max run of 3 of same
• must be odd (otherwise is not reversible, loses the last bit)

Performance results and statistical properties are encouraging:

AlgorithmTimeStatistical properties
Murmur hash (baseline)16 nspoor
Bhash3216 nsgood
Bhash6414 nsgood

The hypothesis on why 64 runs faster is that it is a result of the layout of registers in x86 processor.

### Kojo

With a couple of minutes left, Phil did a quick demo of Kojo – a learning environment, which turned out to be a very effective presentation tool for Phil’s job as the marketing manager for TypeSafe. The environment indeed looks like an excellent learning aid.

## Park Bench/Fish Bowl

After the lunch break there was a “park bench” session, where delegates took turns to sit at the centre and discuss Scala-related topics. The beer, handed out before the session, was meant to encourage more vigorous discussion, but on me it only had the effect of inducing drowsiness, which meant I have not remembered much of what has been discussed.