08 March 2011

One Thread to Rule Them All

Among other assertions, the Node.js home page declares, “Because nothing blocks, less-than-expert programmers are able to develop fast systems.” It’s tempting to jump to the conclusion that Node.js is a panacea for the difficulties of developing complex systems. Counterpoint: I love async, but I can’t code like this. Perhaps there is a degree of nuance to the aysynchronous programming model.

The techniques described in the remainder of this post were not invented by me. There are numerous libraries dedicated to unwinding asynchronous spaghetti callback hell, one of which (async.js) does things very similar to what I’m going to describe. But that’s not the point. The objective here is to identify some common async coding patterns, exercise the javascripty side of my brain to drum up solutions, and hopefully illuminate a point or two in the process.

The code presented in this post can be examined in its fully glory over at Github. This script can be invoked with node on the command line.

$ node asynchronicity.js

One final caveat. This post is quite long, and I decided not to make it quite longer by dealing with errors. All of my examples, in reality, need to handle errors. I’ll save that as an exercise for the reader.

Our First Contrived Example

Imagine we want to retrieve some data over a very slow network, and we only have one thread on which to do it. Let’s look at the uber-simple synchronous model:

// Slow network getters
var get_first_part = function() {
    // Pretend this blocks for three seconds.
      return 'hello';
};
var get_second_part = function() {
    // Pretend this blocks for one second.
      return 'world';
};
// Let's get some data!
console.log(get_first_part() + ' ' + get_second_part());

Simple. Easy. Wonderful. Other than the fact that our lone thread is locked up for four seconds while we sit around waiting to print ‘hello world’.

Asynchronize It

Instead of waiting around doing nothing while we wait for the slow network to respond, we can pass callback functions that will be invoked when the data is available. Now we have getters that look something like this:

var slow_network = {

  get_first_part: function (callback) {
    setTimeout(function() { callback('hello'); }, 3000);
  },

  get_second_part: function (callback) {
    setTimeout(function() { callback('async'); }, 1000);
  },

  get_third_part: function (callback) {
    setTimeout(function() { callback('world'); }, 500);
  },

};

At this point, I’ll also introduce a quick helper that will be used to report results and timing information for async operations.

var exhibit = function (title) {
  var start_time = Date.now();

  return {
    log: function (msg) {
      console.log(title + ': ' + msg);
    },
    report: function (result) {
      console.log('report for ' + title +
                  ':\n\tresult: ' + result +
                  '\n\telapsed: ' +
                  (Date.now() - start_time) + ' ms');
    }
  };
};

It is common to use nested callbacks to for asynchronous functions. Let’s retrive our data over the slow async network:

var exhibit_a = exhibit('basic nested callbacks');

slow_network.get_first_part(function (first_part) {
  slow_network.get_second_part(function (second_part) {
    slow_network.get_third_part(function (third_part) {
      exhibit_a.report([first_part, second_part, third_part].join(' '));
    });
  });
});

This doesn’t look all that scary, but it’s trending considerably toward spaghetti when compared to our original synchronous version. When each callback requires additional logic before carrying on to the next invocation, it can be very difficult to retain a clear mental context for what is happening.

More importantly, there’s absolutely no reason to invoke these independent functions sequentially. It took four and a half seconds to get to ‘hello async world’ despite the fact that no individual call took longer than three seconds. We should be able fire off all three methods at once and carry on as soon as all three have responded. First, the ugly way:

var exhibit_b = exhibit('parallel polling');
var first, second, third;

slow_network.get_first_part(function (first_part) {
  first = first_part;
});

slow_network.get_second_part(function (second_part) {
  second = second_part;
});

slow_network.get_third_part(function (third_part) {
  third = third_part;
});

var interval_id = setInterval(function () {
  if (first && second && third) {
    clearInterval(interval_id);
    exhibit_b.report([first, second, third].join(' '));
  }
}, 100);

Yes, That Polling Timer is Gross

Speaking more generally, we have a number of asynchronous functions executing in parallel, and we have code that cannot be executed until all of the parallel functions have called back. This is a common pattern in async programming, so we certainly need a better solution than polling. The following helper method allows us to specify our async functions as a series of arguments. The final argument is the callback that will be invoked once all of the parallel functions have called back. This callback provides an array of arbitray result objects that correspond to each of the parallel functions.

var async_helper = {

  parallel: function () {

    var args = Array.prototype.slice.call(arguments),
        callback = args.pop(),
        results = [],
        in_progress = args.length;

    args.forEach(function (async_call, index) {
      async_call(function (result) {
        results[index] = result;
        if (--in_progress == 0) {
          callback(results);
        }
      });
    });
  }

};

Aided by the parallel helper, our slow network example now looks like this:

var exhibit_c = exhibit('async.parallel');

async_helper.parallel(
  slow_network.get_first_part,
  slow_network.get_second_part,
  slow_network.get_third_part,
  function (results) {
    exhibit_c.report(results.join(' '));
  }
);

This form is more efficient than simple nested callbacks, and in my opinion also more readable.

Avoiding Sequential Callback Hell

We don’t always have the luxury of calling functions in parallel. If each function in a sequence needs to pass its results to the next function, we can’t improve in efficiency over the nested async callbacks. We can, however, still provide a helper to keep sequential async calls clean and readable:

var async_helper = {

  parallel: function () {

  // See implementation above ...

  },

  sequence: function () {

    var args = Array.prototype.slice.call(arguments),
        callback = args.pop(),
        results = [];

    function next() {
      var func = args.shift();

      func(function (result) {
        results.push(result);
        args.length > 0 ? next() : callback(results);
      }, results);
    }

    next();
  }

};

Note that the sequential function calls take two arguments: a function callback and a results object. This allows the results to be passed down the sequence. If we pretend that ‘hello async world’ can’t be retrieved in parallel, the sequential helper could be used like this:

var exhibit_d = exhibit('async_helper.sequence');

async_helper.sequence(
  slow_network.get_first_part,
  slow_network.get_second_part,
  slow_network.get_third_part,
  function (results) {
    exhibit_d.report(results.join(' '));
  }
);

Mix It Up

The sequence and parallel helpers can, of course, be combined to create more complex sets of async function calls. For my next contrived example, assume that we have a disk that is extremely slow to read. Fortunately, our contrived slow disk allows us to do some reads in parallel. (Yes, I’m completely making something up to suit the problem I want to solve).

Our contrived disk has two kinds of reads. Direct reads are invoked with a getter and call back with a numeric value. Indirect reads are also invoked with a getter, but indirect reads call back with the name of a direct getter that then must be called to retrieve a value.

Pretend that our awesome slow disk has the following getters:

'get_a' -> 'get_e' (indirect)
'get_b' -> 'get_c' (indirect)
'get_c' ->    2    (direct)
'get_d' ->    5    (direct)
'get_e' ->    1    (direct)

First, I’ll build an object that implements my slow disk’s behavior.

var slow_disk = (function () {

  var getter_factory = function(getters) {

    var widget = {},
        params = null,
        name = null;

    function build_getter(val, delay) {
      return function (callback, results) {
        setTimeout(function() { callback(val); }, delay);
      };
    }

    for (name in getters) {
      params = getters[name];
      widget[name] = build_getter(params.val, params.delay);
    }

    return widget;
  }

  return getter_factory({
    get_a: { val: 'get_e', delay: 2000 },
    get_b: { val: 'get_c', delay: 3000 },
    get_c: { val: 2, delay: 1000 },
    get_d: { val: 5, delay: 1500 },
    get_e: { val: 1, delay: 5000 },
  });

}());

Now that we have a contrived slow disk, let’s invent a problem to solve with async helpers. I want to know the answer to:

a + b + c + d + e

We need to get five values in parallel, and for the indirect values we need to make two sequential calls. We’ll hold off on cleverness for the moment and implement that directly:

var exhibit_e = exhibit('parallel/sequence mix');

async_helper.parallel(

  // Get a, then get the value of what a points to
  function (callback) {
    async_helper.sequence(
      slow_disk.get_a,
      function (callback, results) {
        slow_disk[results[0]](function (val) { callback(val); });
      },
      function (results) {
        // We want the result of the second function
        callback(results[1]);
      }
    );
  },

  // Get b, then get the value of what b points to
  function (callback) {
    async_helper.sequence(
      slow_disk.get_b,
      function (callback, results) {
        slow_disk[results[0]](function (val) { callback(val); });
      },
      function (results) {
        // Again, we want the result of the second function
        callback(results[1]);
      }
    );
  },

  slow_disk.get_c,

  slow_disk.get_d,

  slow_disk.get_e,

  function (results) {
    var sum = results.reduce(function (sum, val) { return sum + val; });

    exhibit_e.report('sum = ' + sum);
  }
);

That does the job, but I have two issues with the solution.

First, the sequential indirect getters are messy, and they clearly repeat code. That issue can be solved with another helper function:

function get_indirect(indirect_method) {
  var self = smarter_slow_disk;

  return function (callback) {
    async_helper.sequence(
      self[indirect_method],
      function (callback, results) {
        self[results[0]](function (val) { callback(val); });
      },
      function (results) {
        callback(results[1]);
      }
    );
  };
};

The second issue with the parallel/sequence mix solution arises from the fact that we’re wasting time. If I can safely assume (of course we can; I defined the problem) that the values on my slow disk are static enough to be cached, we don’t need to make the same slow calls multiple times. But simple result caching only gets us half way.

My problem was carefully designed to include a second call to get_e that is invoked while the first call to get_e is in the middle of its roundtrip. At the time the second function is called, we don’t have a cached response. We need a way to call the second callback when we receive the first result.

var async_cache = function(obj) {

  var responses = {},
      waiting = {},
      caching_obj = {},
      func = null;

  caching_obj.prototype = obj;

  function wrap_method(method, func) {
    return function(callback) {
      if (responses[method]) {
        // We already have this result, invoke callback immediately.
        callback(responses[method]);
      } else if (waiting[method]) {
        // Someone else is waiting for this result. Just add
        // ourself to the waiting list.
        waiting[method].push(callback);
      } else {
        // No one has asked for this yet. Store the caller's
        // callback on the waiting list and handle the response
        // callback by looping through the waiting list and
        // invoking all waiting callbacks.
        waiting[method] = [callback];
        func(function (val) {
          responses[method] = val;
          waiting[method].forEach(function(callback) {
            callback(val);
          });
          delete waiting[method];
        });
      }
    }
  }

  for (var method in obj) {
    func = obj[method];
    // Don't try cache magic for non-functions
    if (typeof func === 'function') {
      caching_obj[method] = wrap_method(method, func);
    }
  }

  return caching_obj;
}

The function above extends the object passed into it with async caching capabilities. It ensures that a slow asynchronous function call is only dispatched once. Repeated calls to the method either use a stored response or add their callback to a list of callbacks that will be invoked when the initial method calls back.

Using both our indirect helper and the caching extension, the solution to the slow disk problem is now:

var smarter_slow_disk = async_cache(slow_disk);

var exhibit_f = exhibit('sequence/parallel mix with caching');
async_helper.parallel(
  get_indirect('get_a'),
  get_indirect('get_b'),
  smarter_slow_disk.get_c,
  smarter_slow_disk.get_d,
  smarter_slow_disk.get_e,
  function (results) {
    var sum = results.reduce(function (sum, val) { return sum + val; });

    exhibit_f.report('sum = ' + sum);
  }
);

Node Doesn’t Solve Your Problems

At no point in this post did I write any complex, multi-threaded code. Such is the promise of Node. I will allow the reader to decide whether or not this exercise involved any complex, single-threaded code.

Node doesn’t necessarily make it easier to solve difficult problems. The onus is still on the developer to identify and abstract away common async patterns. There are no freebies. Clean, maintainable, readable code still requires thoughtfulness and diligence. Or at least it does for less-than-experts like myself who want to try their hands at fast systems.