↑ Best viewed this side up ↑

Scala Stream Hygiene I: Avoiding Memory Leaks

July 19th, 2014

The source code for this post is available on GitHub.

Lazy evaluation, also known as call-by-need, is commonly found in functional languages. Some of them go as far as to make it the default evaluation strategy; perhaps the most prominent example is Haskell. Language authors however seem to prefer eager (strict) evaluation, whether because it results in better performance in the majority of practical use cases, or because it plays better with the imperative features of their languages, such as I/O and exceptions, or because the authors find it easier to implement. So they add a number of features to the language and the standard library that enable the developers to use lazy evaluation if they really want to.

In Scala, the language features are the lazy modifiers for vals and by-name function parameters. And in the standard library, amongs others, there is the class Stream (scala.collection.immutable.Stream). It was the subject of my recent studies, the results of which I share in this series.

Memoization

Lazy evaluation often goes hand-in-hand with memoization. Without memoization, many programs implemented using lazy evaluation would exhibit terrible performance characteristics. Stream implements memoization and hence can be reasoned about as a List of elements computed on-demand.

However, there are also scenarios in which memoization is highly undesirable. One benefit of lazy evaluation is the ability to define potentially infinite data structures. In case of streams, these can be a stream of natural numbers or a stream of data packets coming in from the network. Problem is, all computers commercially available today only support a finite amount of memory, so memoization of such data structures is very capable of making your program throw an OutOfMemoryError.

When you either have no need to use stream elements more then once, as in the network packet filtering scenario, or stream elements can be re-computed very cheaply (natural numbers), you better ensure that they do not get memoized.

Below are the rules that will help you avoid memoization of Scala streams. I’ve collected them from various sources and confirmed by compiliing and decompiling test programs. If you know of any other techniques or edge cases, please post in the comments.

Rules for Avoiding Stream Memoization

  1. Define streams using def and never store them in vals.

    This should be obvious, because val ensures that memoizaion occurs – see Stream scaladoc – but obvious things are often worth stating explicitly.

  2. Consume streams in tail-recursive functions.

    Again, this is rather obvious – if the consuming function is recursive, but not tail-recursive, a reference to the original stream will remain on the call stack until the recursion completes, effectively holding the entire stream in memory. (Not to mention that such a function would likely throw a StackOverflowError when there is still plenty of memory available on the heap.)

    How the tail-recursive functions manage to avoid the OOM? Let’s decompile an example:

    @tailrec
    def sum(xs: Stream[Int], z: Int = 0): Int = 
      if (xs.isEmpty) z else sum(xs.tail, z + xs.head)

    Here is what the decompiler produces:

    public int sum(Stream<Object> xs, int z) {
      for (;;) {
        if (xs.isEmpty()) return z;
        z += BoxesRunTime.unboxToInt(xs.head());
        xs = (Stream)xs.tail();
      }
    }

    Notice that the xs parameter is reused. It gets overwritten on each loop iteration, so it always holds a reference to the not-yet processed remainder of the original stream.

  3. Pass streams around via by-name parameters. (Make sure to read the corollary below.)

    Sometimes you need to pass a stream trough intermediate functions before its consumption, but that would leave references to the stream on the call stack.

    Typical example:

    def sum(xs: Stream[Int]): Int = {
      @tailrec
      def loop(acc: Int, xs: Stream[Int]): Int =
        if (xs.isEmpty) acc else loop(acc + xs.head, xs.tail)
      loop(0, xs)
    }

    Although the inner function loop is tail-recursive, the sum function that calls it will hold a reference to the head of the stream in its parameter.

    The advice commonly found on the Net is to pass the stream around in a "container", such as a single-element array or an AtomicReference, and nullify its contents in the consuming function. But this results in awkward-looking, impure code. I am not sure why the built-in language feature that achieves the same effect gets overlooked.

    In the above example, if you make xs a by-name parameter to sum, what gets actually passed is a function, computed right before the call to loop, so its result does not hold the entire stream:

    def sum(xs: => Stream[Int]): Int = {
       .  .  .

    As you may have noticed when reading about Rule #2, you could also get rid of the outer function altogether using a default parameter value:

    @tailrec
    def sum(xs: Stream[Int], z: Int = 0): Int = 
      if (xs.isEmpty) z else sum(xs.tail, z + xs.head)

    but that is not always possible.

    Corollary: When defining stream-consuming functions in traits, wrap them in functions accepting streams as by-name parameters.

    This one is subtle, and I would say the most unfortunate, because you have no control over the root cause of this restriction. The root cause is that trait methods are not called directly, but via a forwarder method generated by the compiler, even if the caller is a member of the same trait. The forwarder method will hold a reference to the entire stream, that is, unless the stream is passed as a by-name parameter.

    Example:

    trait StreamConsumers {
      @tailrec
      final def sum(xs: Stream[Int], z: Int = 0): Int = {
        if (xs.isEmpty) z else sum(xs.tail, z + xs.head)
      }
      def sumByName(xs: => Stream[Int]): Int = {
        @tailrec def loop(acc: Int, xs: Stream[Int]): Int =
          if (xs.isEmpty) acc else loop(acc+xs.head, xs.tail)
        loop(0, xs)
      }
    
       .  .  .
    
    object Main extends StreamConsumers {
       .  .  .

    And here are the forwarders in the decompiled code of Main:

    public final class Main$
      implements StreamConsumers
    {
      public static final  MODULE$;
    
      public final int sum(Stream<Object> xs, int z) {
        return StreamConsumers.class.sum(this, xs, z);
      }
      public int sumByName(Function0<Stream<Object>> xs) {
        return StreamConsumers.class.sumByName(this, xs);
      }
       .  .  .

    You may also notice that enclosing a tail-recursive function in a wrapper method relieves you from the need to declare that method as final.

  4. Do not pattern match against streams outside the consuming functions.

    It is perfectly okay to use pattern matching inside a tail-recursive function that consumes the stream:

    def sumPatMatInner(xs: => Stream[Int]): Int = {
      @tailrec
      def loop(acc: Int, xs: Stream[Int]): Int =
        xs match {
          case Stream.Empty => acc
          case y #:: ys => loop(acc + y, ys)
        }
      loop(0, xs)
    }

    Hence a pattern matching addict might write something like this:

    def sumPatMat(xs: => Stream[Int]): Int = {
      @tailrec
      def loop(acc: Int, xs: Stream[Int]): Int =
        xs match {
          case Stream.Empty => acc
          case y #:: ys => loop(acc + y, ys)
        }
      xs match {
        case Stream.Empty => 0
        case x #:: Stream.Empty => x
        case y #:: ys => loop(y, ys)
      }
    }

    Why can this lead to an OOM? Let’s consider a simpler example:

    createStream match {
      case x #:: xs => consumeStream(x, xs)
      case _ => println("No data to process")
    }

    As of Scala 2.10, this code is an exact equivalent of the following:

    val foo: Option[(A, Stream[A])] = Stream.#::.unapply(createStream)
    if (foo.isEmpty) println("No data to process")
    else {val x = foo.get._1; val xs = foo.get._2; consumeStream(x, xs)}

    where A is the type of stream elements and foo is a unique name not used anywhere else.

    As you may see, if the first pattern matches, val xs will hold a reference to the tail of the stream returned by createStream. In fact, the temporary val foo will contain a reference to the entire stream.

    StackOverflow user Daniel Martin described a solution for safely matching on stream tail, which I think is a nice demostration of Scala implicits, but otherwise an overkill, so I won’t reproduce it here.

  5. Only call the eagerly evaluated Stream methods that are marked as "optimized for GC". The methods foreach, foldLeft, and reduceLeft have been specialized for the class Stream. length is also "GC-safe". (Of course, they all loop forever if the receiver is an infinite stream.)

    However, the method /: was left with the default implementation from scala.collections.TraversableOnce, which simply calls foldLeft, effectively holding the reference to the receiver on its stack frame:

    def /:[B](z: B)(op: (B, A) => B): B = foldLeft(z)(op)

    This also applies to methods forall, exists, find, max, min, sum, product, and possibly others.

Now I have some news for you. Under certain circumstances, your program can get away with breaking Rules 2 to 5. Part II will dig into that. Stay tuned!

Updated AddFLACs to Extract Track Metadata From Pathnames

May 17th, 2014

I have updated AddFLACs, my iTunes for Windows automation script that I wrote back in 2011. It helps you import FLAC files into iTunes in Apple Lossless format.

About a week ago, a reader complained in the comments to the original post:

does it still skip over FLAC with no meteadata. (None of my FLAC have metadata, i always go by filename).

Fact is, I already had half-baked code for extracting metadata from pathnames using regular expressions, so today I integrated that code and published AddFLACs 1.0.

Now, for instance, if your untagged FLACs collection is hierarchically organized by artist and then by album, you can import it into iTunes as follows:

AddFLACs -r ".*\\(.*)\\(.*)\\(.*)\.flac$" ^
  -artist "$1" -album "$2" -name "$3"

If file names begin with track numbers followed by a dash, you can extract track numbers too:

AddFLACs -r ".*\\(.*)\\(.*)\\(\d+)-(.*)\.flac$" ^
  -artist "$1" -album "$2" -tracknumber "$3" -name "$4"

Even if you are not a master of regular expressions, with some tooling you can go way further:

AddFLACs ^
  -r ".*\\(.*)\\(.*)\\(.*?)( \((\d{4})\))?\\(\d+)-(.*)\.flac$" ^
  -genre $1 -artist "$2" -album "$3" -year "$5" ^
  -tracknumber "$6" -name "$7"

I would recommend you to run it with --dry-run (-n) option first, to verify that all files match and metadata gets extracted as intended.

Refer to AddFLACs documentation for details.

Credits

I have used Regular Expressions 101, a terrific online regular expression tester and debugger, to debug the last example.

Retrieving a Web Site for Offline View Using Wget

February 23rd, 2013

I’ve volunteered to do a few minor CSS tweaks on a third-party Web site, implemented in Java and hosted on Heroku. Given the tiny scope of my task, figuring out how to build the entire thing and run it on a staging instance or locally would have been an overkill. So I’ve sought a way to create a static local mirror of the site. That turned out to be less straightforward than running wget --mirror Home-Page-URL.

First, the Web site in question has its stylesheets and other static files served from a CDN (content distribution network.) It also relies on third-party services for Web fonts, video streaming, and chat.

Second, it has some really big downloads. Fortunately, they are served from a subdomain.

Third, it has a blog section, also in a subdomain, which uses a completely different stylesheet that I did not need to touch.

To cut the long story short, here is the wget command line that worked for me:

wget --mirror \
  --page-requisites \
  --convert-links \
  --span-hosts \
  --domains domain-list \
  --reject pattern-list \
  Home-Page-URL

and here is the explanation:

--mirror
Enable infinite recursion and time-stamping.
--page-requisites
Also download files required to view the web page: images, stylesheets and so on.
--convert-links
Edit links in the downloaded documents so as to enable offline viewing. This includes links to page requisites. As a result, links to the also-downloaded files point to local copies, all other links get replaced with complete URLs.
--span-hosts
Permit recursion and retrieval of page requisites to span across hosts. (Use with caution or you’d download the entire Internet.)
--domains domain-list
Restrict the list of domains to download files from. In my case, those were the “www” subdomain of the Web site being mirrored and the domain of the CDN serving its static files.

Example: --domains www.example.com,somecdn.net

--reject pattern-list
Do not mirror certain files. list is a comma-separated list of file name suffixes or patterns.

Example: --reject mp3,ogg

KB2756872 Windows 8 Update Fails to Install? Remove Realtek Audio Drivers

December 22nd, 2012

My notebook, just upgraded to Windows 8, has spent quite some time trying to install the KB2756872 update and rolling back after failures.

For the record, I have run delmigprov.exe accompanying the standalone download of that update, but that alone did not help.

The solution in my case was the removal of Realtek audio drivers. (It is interesting that audio has continued to function, so now I am not sure whether I need those drivers at all.)

It’s Been a Long Time Since I Had To Patch a Binary Executable…

December 11th, 2012

Needed to push a file created on a notebook the other night to a Git repo on a desktop, both running Windows and connected to my home router. Thought the git protocol would work. It did not due to a bug in msysgit. (TL/DR: the bug has been open since March 2010, but nobody so far has volunteered to find the root cause and remedy, or sponsor such an effort. The only sensible workaround is to recompile msysgit from source with the side-band-64k protocol capability disabled, as the older side-band does not exhibit the problem, but the newer, faster alternative always takes precedence if both client and server support it.)

Followed the advice from Eli Billauer’s blog and patched git.exe on my desktop, which plays the role of a “central” Git server.

Here is a patch script that worked for me. It requires Gsar for Windows from the GnuWin32 collection; CygWin likely includes gsar too.

@echo off
copy /-y git.exe git.exe~
if errorlevel 1 goto copyfailed
gsar -o -sside-band-64k -rKW6YzEZbBv584 git.exe
if errorlevel 1 goto patchfailed
echo git.exe successfully patched
goto quit
:copyfailed
echo Could not create a backup copy of git.exe
goto quit
:copyfailed
echo Could not patch git.exe with GSAR
goto quit
:quit

The value of the -r option is just a random 13-character string generated by DuckDuckGo using the query password 13. You may wish to use different values on each Windows machine you may be pushing to using the git protocol.

Outlook Macro to Nicely Format Skype Chat Excerpts

December 1st, 2012

My day job involves a lot of communication, mostly via email and Skype IM. From time to time, I need to file an important excerpt from a Skype chat for later retrieval, or email it to a customer, partner, or colleague.

For years, I would have select that excerpt, copied it to the clipboard and pasted into a new Exchange Mail or Post item.

However, what got pasted was unformatted plain text, way harder to read than the original chat displayed in Skype:

Raw paste

I used to format the lengthier excerpts manually, out of respect to the recipients and/or future readers. Tedious work.

Earlier this year, I had proposed to celebrate our company’s 13th anniversary with a hackathon. Excelsior Hack Day I was a success, and I used it as a chance to take one bit off the routine part of my work.

My solution

Skype IM Pretty Printer is a VBA Macro for Microsoft Outlook that takes a Skype chat from the clipboard, formats it nicely and pastes into a new HTML message:

Paste using Skype IM Pretty Printer

If you want to give Skype IM Pretty Printer a shot, I have open sourced it under the MIT/X11 license. You can fork it on GitHub or visit the official page for download and installation instructions.

Running Online Python Tutor in a Local Linux VM

October 29th, 2012

Online Python Tutor (OPT) enables first-year CS students to watch the nicely visualized execution of their Python programs step-by-step.

A fresh edX student very much liked OPT but had two problems with its online nature: sometimes the OPT Web site was not responding, sometimes she had no Internet connection. Fortunately, OPT is open sourced on GitHub, so I was able to set it up on her Windows notebook as follows:

OPT runs on Google App Engine, but there is a local development server in the GAE SDK. I’ve set it up up on top of a small VirtualBox VM running Linux, so as to minimize interference with other software and simplify migration.

  1. Set up or clone a baseline Linux VM. I had a baseline Ubuntu 12.04 LTS disk image already, so just followed my own VirtualBox VM cloning recipe.
  2. In the meantime, download the Linux version of the Google App Engine SDK for Python:

    wget -c http://googleappengine.googlecode.com/files/google_appengine_1.7.3.zip

    (look up the current URL on the SDK download page)

  3. Fetch the latest version of OPT from GitHub:

    wget -c -O online-python-tutor.zip https://github.com/pgbovine/OnlinePythonTutor/zipball/master
  4. It turned out that Ubuntu 12.04 Server has Python installed even in the minimal configuration. I had to install unzip though:

    sudo apt-get install unzip
  5. Unpack both packages. I have chosen to put them under /opt (pun not intended – that is where FHS says you should put optional packages):

    cd /opt
    sudo unzip ~/google_appengine_1.7.3.zip
    sudo unzip ~/online-python-tutor.zip
    
  6. (optional) Rename the OPT directory:

    sudo mv pgbovine-OnlinePythonTutor-c4880ea online-python-tutor
  7. Try running OPT:

    sudo /opt/google_appengine/dev_appserver.py \
      -a 0.0.0.0 \
      -p 80 \
      --skip_sdk_update_check \
      /opt/online-python-tutor/v3
    

    There will be warnings about the unavailability of some APIs and such, but OPT apparently does not use those, so you may ingore the warnings.

  8. Try connecting to the VM from your browser. You should see the main OPT screen and be able to use it:

    Check that it works, then get back to the VM console/terminal and press Ctrl-C to shutdown the development server.

  9. Finally, make OPT start automatically on boot. On Ubuntu and other Upstart-enabled systems, add a .conf file to /etc/init:

    sudoedit /etc/init/pythontutor.conf

    with the following content (change installation directories if necessary):

    start on runlevel [2345]
    stop on runlevel [!2345]
    
    expect fork
    exec /opt/google_appengine/dev_appserver.py \
      --skip_sdk_update_check  \
      -a 0.0.0.0 -p 80 \
      /opt/online-python-tutor/v3 &
    
  10. Start the pythontutor job:

    sudo start pythontutor

    If this time you cannot connect to OPT from your browser, look for clues in /var/log/upstart/pythontutor.

Now that everything is working, you may wish to reduce the amount of RAM allocated to the VM. 128MB is more than enough to run a copy of the OPT just for the user connecting from the host, but watch memory use if you install e.g. a shared copy for your class or something.

Running programs on Linux boot up

September 29th, 2012

The other day I needed to configure a Linux VM to run a few programs at system startup. It turned out that there is no single way to accomplish that that would work across all major Linux distros and Unix flavors.

Read the rest of this entry »

Pushing Files from Windows to Linux/Unix Hosts with cwRsync

July 16th, 2012

“Use the best tool for the job” is a great principle. I however reserve the right to define which tool is the best when the person doing the job is going to be me. That is why I develop my Web properties, such as this blog, on a Windows PC, as I am more comfortable with Windows as a desktop platform, but for a very similar reason I run them on a Linux VPS.

This in particular means I need to deploy from Windows to Linux. Back then, I manually copied the new and changed files using the WinSCP plugin for FAR, and that was okay while there were just a few files. I also have a staging environment — a VirtualBox VM that more or less replicates my VPS setup, so I could have set up some shared folders on that VM, and then use rsync to push changes from staging to production. But instead I have set up rsync to push files right from Windows to either staging or production. Here is how you can do that too:

Read the rest of this entry »

Protecting Downloads From Hotlinking – The Soft Way

December 20th, 2011

The Story

Once upon a time, on a Web site 23 hops away from my home PC, there was a free software download that required registration. An email with the download page URL was sent to the visitor after registration. The download page contained usage instructions and a direct, static URL of the download in the form http://host/download/file.

Someone had registered and published the latter URL in a public directory, so people started to download the file directly, without seeing even the download page, let alone the registration form.

The URL of the download was changed, but the story repeated itself in a couple of weeks.
Read the rest of this entry »