Thursday, February 6, 2014

Degrees of Lazy Evaluation

The topic of lazy evaluation often comes up when thinking about transforming collections of data. Lazy Evaluation is particularly useful for values that are expensive to compute, but it is also useful as a way of chaining functions to apply in a single pass through the underlying data. Laziness can be coded manually, or baked into an API such as a collection or transformation. Implementing some kinds of laziness is more expensive than others. Sometimes laziness may be inappropriate, such as when a database connection is open and waiting to be read from. I couldn't find degrees of laziness with a quick web search, so I'm defining some here.

Levels of Evaluation Laziness

0: Eager
The entire data source is evaluated immediately. Most statements in most languages work this way most of the time, because that's the way the underlying bytecodes and hardware are designed to work. Eager evaluation of a collection can be done concurrently from inside the collection. The reduce function is implemented this way in Clojure.
1: Delayed
The entire source data is evaluated on the first usage of the result. Examples: builder patterns, isTrueForAll(), forEach(), Java 8 collection transformations, and default Hibernate joined table loading. Delayed evaluation of a collection can be done concurrently from inside the collection.
2: Chunked
The source data is evaluated in the minimum size chunks for each call: filter(), contains(). This level is defined by asymptotic minimum and maximum laziness of levels 1 and 3.
3: Incremental
A maximum of one item is evaluated per call: map(). Incremental evaluation cannot be done concurrently from inside the collection because that would require evaluating more than one item at a time.

Level 2 is weird. A filter that matches nothing, or only the last element is effectively level 1. A filter that matches every element is effectively level 3. I call it level 2 in both cases because of its range. I think it's important to acknowledge level 2 as a unique level because it's chunked. A balanced hashtable of a certain size and load could evaluate one bucket at a time for contains() and be very reliably level 2 without ever being level 1 or level 3. Concurrent processing of Level 2 may be possible inside a collection in some circumstances, but its practicality may be limited when the chunk sizes are not known in advance.

For a single element value, Lazy levels 1, 2, and 3 are indistinguishable - it's either Eager or it's Lazy - period. Two elements can be either Lazy 1 or Lazy 3. With 3 or more elements, Lazy 2 becomes a meaningful description of the laziness.

The results of lazy evaluation can be cached, particularly when that evaluation is expensive and the values will be reused. Clojure Sequences are based on a linked-list data structure which lends itself to (Lazy level 3) incremental evaluation. They are also cached/memoized by default, making them very safe, but somewhat resource intensive.

Using Clojure Sequences requires care with some data sources in order to process a finite subset of the data that fits in memory, or to iterate without holding onto the head of the sequence (thus making the head eligible for garbage collection before the end of the sequence is processed)1. Clojure evaluates (some?) sequences in chunks for efficiency, but incremental evaluation can be forced with a little cleverness.1

Implementation Concerns

Lazy level 0-3 can be represented as a lightweight, one-pass disposable view*, or in a heavier memoized persistent data structure like a linked list. Since views and linked lists traverse the underlying data incrementally, neither is suitable for internal concurrency. Either of these can be processed concurrently externally if necessary, but doing so seems like a design failure at some level - why eagerly process something you went to the trouble of making lazy?

What's a View?

The Iterator interface in Java and some other languages is not suitable for using concurrently. It has 2 methods: hasNext() and next() and lets you enumerate an underlying data source once. Some data sources are ordered, and some are not - the iterator respects that.

The problem is that even if you synchronize both of these methods, two threads can both call hasNext() when there is only one element left. At any time, a broken client can call next() without calling hasNext(). In each of these cases, a thread that calls hasNext() and gets "true" can still call next() and get NoSuchElementException. Yuck!

A better interface, which I'm calling a View (possibly because I misunderstood Paul Phillips using that word). It can be lazily evaluated or iterate over a previously realized collection, but it has a single method: next() which can return the next element (whether or not its null) or a sentinal value: USED_UP. This is 100% thread safe. An arbitrary number of threads can safely and quickly call:

T item = next();
while (item != USED_UP) {
    // do something with item
    item = next();
}

Footnotes

1 Many thanks to my friend Jeff Dik for pointing this out and correcting my earlier comments.

No comments: