Scala Stream Hygiene III: Scalaz EphemeralStream Fills Quite A Canyon

The source code for this post is available on GitHub.

The following is the third part in a four-part series. Part I listed the coding rules that help you avoid memory leaks when using the standard Scala Stream class. Part II demonstrated that an optimizing JIT compiler and precise GC render most of those rules superfluous, and argued that circumventing Stream memoization in production code is dangerous. This part will discuss a readily available third-party alternative that enables representing potentially infinite data structures without the risk of leaking memory:

scalaz.EphemeralStream

Scalaz is an open-source Scala library that implements type classes and pure functional data structures. In particular, it provides a class (a trait, to be more precise) EphemeralStream, described as follows:

Like scala.collection.immutable.Stream, but doesn’t save computed values. As such, it can be used to represent similar things, but without the space leak problem frequently encountered using that type.

First of all, don’t let the scarcity of the method list in EphemeralStream documentation mislead you: an EphemeralStream can be implicitly converted to an Iterable, so most methods of the latter are in fact available (but not all of them, more on that below).

The "does not save computed values" part is somewhat misleading too. An EphemeralStream cell actually caches both the value and the next cell reference (once they get computed) using Java weak references.

Objects that only have weak references to them get garbage collected on first try. Therefore you can safely store an EphemeralStream in a val and pattern match on it as you please:

def tailAvg(xs: EphemeralStream[Int]): Option[Int] = {
  xs match {
    case y ##:: ys => Some(ys.sum / ys.length)
    case _ => None
  }
}

Unfortunately, EphemeralStream suffers from numerous issues hindering its practical use:

It implements memoization using Java WeakReferences wrapped in Scala Options wrapped in closures, which all add memory overheads, so the GC gets invoked more frequently compared to the technique described in Part I. For instance, an EphemeralStream[Int] needs exactly twice as much memory as a Stream[Int] on the 32-bit Java HotSpot Server VM.
Because of all that wrapping, it is also way slower than a standard Stream even if each element is only accessed once. Depending on the number of elements, the EphemeralStream.length method, which does not access elements at all, was from three to six times slower than Stream.length in my tests. Computing sum of an EphemeralStream[Int] took approximately 3x more time compared to a Stream[Int].
It is poorly documented and comes with no usage examples. In fact, the only comment in its source is the (misleading) scaladoc comment quoted above in its entirety. Looks like an auxiliary class to me.
The scalaz-core jar is 9MB in size, which is a bit too big an overhead if you only need a single class. And it is not easy to extract EphemeralStream for standalone use, because it depends on several other Scalaz classes and traits.

There are also quite a few minor incompatibilities that prevent using scalaz.EphemeralStream as a drop-in replacement for the standard Scala Stream:

There is no empty EphemeralStream[Nothing] object, so you cannot match against a pattern similar to x #:: Stream.Empty. Workaround:
```
case x ##:: xs if xs.isEmpty => ...
```
The fold methods have different signatures, with curried functions:
```
def foldLeft[B](z: => B)(f: (=> B) => (=> A) => B): B
def foldRight[B](z: => B)(f: (=> B) => (=> A) => B): B
```
which means they won’t accept a shorthand such as _ + _ as the second parameter. Instead, you have to write:
```
xs.foldLeft(0)(x => y => x + y)
```
EphemeralStream does not implement convenience methods from and continually.
As of Scala 2.10, the scalac compiler seems to be unable to infer that an implicit conversion of an EphemeralStream[Int] to an Iterable[Int] yields an Iterable[Numeric], and that Ordering is defined, so methods such as sum and min are not available either.

If scalaz.EphemeralStream is not the solution, the only option left is to roll out our own non-leaky stream class. In Part IV we’ll try to do just that. Stay tuned!

Tags: garbage collection, scala, Scalaz, streams

This entry was posted on Monday, August 11th, 2014 at 10:13 am and is filed under Scala. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

« Scala Stream Hygiene II: HotSpot Kicks In | What Microsoft Got Right That JetBrains Didn’t »

Talkback

michal
01-Nov-2014
7:25 pm

1

Hey Dmitry, I found your stream series really informative and interesting. I was trying to use ephemeral streams but CPU overhead is simply unacceptable for cases where consuming stream is a main task. I really looking forward for the last part.
Marc
06-Nov-2014
3:38 am

2

Dmitry, thank you for these posts. They are excellent. Looking forward to part IV.
Dmitry Leskov
06-Nov-2014
10:01 pm

3

Michal, Marc, thank you very much. I am taking the Programming Languages course on Coursera at the moment, which, combined with family duties and full-time job, does not leave much time for coding and writing. But your comments are so encouraging that I’ll try to accelerate!

Scala Stream Hygiene III: Scalaz EphemeralStream Fills Quite A Canyon

scalaz.EphemeralStream

Talkback

About Me

Pages

Categories