Scala Stream Hygiene III: Scalaz EphemeralStream Fills Quite A Canyon
The source code for this post is available on GitHub.
The following is the third part in a four-part series. Part I listed the coding rules that help you avoid memory leaks when using the standard Scala Stream
class. Part II demonstrated that an optimizing JIT compiler and precise GC render most of those rules superfluous, and argued that circumventing Stream
memoization in production code is dangerous. This part will discuss a readily available third-party alternative that enables representing potentially infinite data structures without the risk of leaking memory:
scalaz.EphemeralStream
Scalaz is an open-source Scala library that implements type classes and pure functional data structures. In particular, it provides a class (a trait, to be more precise) EphemeralStream
, described as follows:
Like
scala.collection.immutable.Stream
, but doesn’t save computed values. As such, it can be used to represent similar things, but without the space leak problem frequently encountered using that type.
First of all, don’t let the scarcity of the method list in EphemeralStream
documentation mislead you: an EphemeralStream
can be implicitly converted to an Iterable
, so most methods of the latter are in fact available (but not all of them, more on that below).
The "does not save computed values" part is somewhat misleading too. An EphemeralStream
cell actually caches both the value and the next cell reference (once they get computed) using Java weak references.
Objects that only have weak references to them get garbage collected on first try. Therefore you can safely store an EphemeralStream
in a val
and pattern match on it as you please:
def tailAvg(xs: EphemeralStream[Int]): Option[Int] = {
xs match {
case y ##:: ys => Some(ys.sum / ys.length)
case _ => None
}
}
Unfortunately, EphemeralStream
suffers from numerous issues hindering its practical use:
-
It implements memoization using Java
WeakReference
s wrapped in ScalaOption
s wrapped in closures, which all add memory overheads, so the GC gets invoked more frequently compared to the technique described in Part I. For instance, anEphemeralStream[Int]
needs exactly twice as much memory as aStream[Int]
on the 32-bit Java HotSpot Server VM. -
Because of all that wrapping, it is also way slower than a standard
Stream
even if each element is only accessed once. Depending on the number of elements, theEphemeralStream.length
method, which does not access elements at all, was from three to six times slower thanStream.length
in my tests. Computing sum of anEphemeralStream[Int]
took approximately 3x more time compared to aStream[Int]
. -
It is poorly documented and comes with no usage examples. In fact, the only comment in its source is the (misleading) scaladoc comment quoted above in its entirety. Looks like an auxiliary class to me.
-
The
scalaz-core
jar is 9MB in size, which is a bit too big an overhead if you only need a single class. And it is not easy to extractEphemeralStream
for standalone use, because it depends on several other Scalaz classes and traits.
There are also quite a few minor incompatibilities that prevent using scalaz.EphemeralStream
as a drop-in replacement for the standard Scala Stream
:
-
There is no empty
EphemeralStream[Nothing]
object, so you cannot match against a pattern similar tox #:: Stream.Empty
. Workaround:case x ##:: xs if xs.isEmpty => ...
-
The fold methods have different signatures, with curried functions:
def foldLeft[B](z: => B)(f: (=> B) => (=> A) => B): B def foldRight[B](z: => B)(f: (=> B) => (=> A) => B): B
which means they won’t accept a shorthand such as
_ + _
as the second parameter. Instead, you have to write:xs.foldLeft(0)(x => y => x + y)
-
EphemeralStream
does not implement convenience methodsfrom
andcontinually
. -
As of Scala 2.10, the
scalac
compiler seems to be unable to infer that an implicit conversion of anEphemeralStream[Int]
to anIterable[Int]
yields anIterable[Numeric]
, and thatOrdering
is defined, so methods such assum
andmin
are not available either.
If scalaz.EphemeralStream
is not the solution, the only option left is to roll out our own non-leaky stream class. In Part IV we’ll try to do just that. Stay tuned!
Tags: garbage collection, scala, Scalaz, streams
01-Nov-2014
7:25 pm
Hey Dmitry, I found your stream series really informative and interesting. I was trying to use ephemeral streams but CPU overhead is simply unacceptable for cases where consuming stream is a main task. I really looking forward for the last part.
06-Nov-2014
3:38 am
Dmitry, thank you for these posts. They are excellent. Looking forward to part IV.
06-Nov-2014
10:01 pm
Michal, Marc, thank you very much. I am taking the Programming Languages course on Coursera at the moment, which, combined with family duties and full-time job, does not leave much time for coding and writing. But your comments are so encouraging that I’ll try to accelerate!