OJ’s rants What would OJ do?

25Jun/097

Data Crunching in Haskell

A few days ago I was having a chat to a friend of mine about a little data parsing problem. He had the need to parse a multi-dimensional array to pull out some values. That array was guaranteed to be square, but not necessarily in contiguous memory. He needed to parse each "column" of the array, calculate a total, and then determine the biggest and smallest of those totals.

A sample of the data might look something like this:

data = ({150,200,45,57,95,2,45,32,15,10,5,2,2,4},
         12,20,45,37,10,5,2,2,10,95,2,45,32,7},
         32,15,10,5,2,23,24,15,20,45,57,95,0,45})

So the first step would be to add 150, 12 and 32 and store the value. Then 200, 20 and 15, and store the value. Do this for all of the columns, then get a maximum and a minimum.

This little algorithm was going to be part of his project, and hence needed to be implemented in AS3. So I picked his brains about the AS3 syntax, because I have absolutely no clue given that I've never worked with any version of ActionScript in the past.

Together, we came up with the following solution:

var columnTotal:Number;
var biggest:Number;
var smallest:Number;
biggest = smallest = sum(vData, 0);

for(var i = 1; i < _scope.period_mcs.length; ++i)
{
  columnTotal = sum(vData, i);
  biggest = Math.max(biggest, columnTotal);
  smallest = Math.min(smallest, columnTotal);
}

// helper function
function sum(var data:Array, var index:Integer):Number
{
  var total:Number = 0;
  for(int i = 0; i < data.length; ++i)
  {
    total += data[i][index];
  }
  return total;
}

Does it work? Yes, it sure does. Is it optimal? Yes, and no :) I was lazy and used the Math.max and Math.min functions instead of doing the obvious...

if(biggest < columnTotal) biggest = columnTotal;

So if we did that, to reduce the need for function calls and unnecessary assignments, we end up with this:

var columnTotal:Number;
var biggest:Number;
var smallest:Number;
biggest = smallest = sum(vData, 0);

for(var i = 1; i < _scope.period_mcs.length; ++i)
{
  columnTotal = sum(vData, i);
  if(biggest < columnTotal)
  {
    biggest = columnTotal;
  }
  else if(smallest > columnTotal)
  {
    smallest = columnTotal;
  }
}

// helper function
function sum(var data:Array, var index:Integer):Number
{
  var total:Number = 0;
  for(int i = 0; i < data.length; ++i)
  {
    total += data[i][index];
  }
  return total;
}

I can't see many ways to improve on this without going overboard with optimisation. Any AS3 guru's are more than welcome to prove me wrong!

So after thinking about this in an imperative language, I couldn't help but have a look at what the functional version might look like. Of course, my current chosen Functional toy is Haskell and so I fired up VIM and GHCI and had a bit of a play.

First, I put the data into Haskell's list format:

vals = [[150,200,45,57,95,2,45,32,15,10,5,2,2,4],
        [12,20,45,37,10,5,2,2,10,95,2,45,32,7],
        [32,15,10,5,2,23,24,15,20,45,57,95,0,45]]

That was easy enough. The next step was to break the problem down so that I could use some of the built in functions of Haskell's Prelude libraries. My thought processes were:

  1. Summing the column of the array could be done using sum, but I'd need to change the list so that rows become columns, and vice-versa.
  2. To switch rows and columns, I could use the transpose function.
  3. Then all I'd need to do is use maximum and minimum to get the right values out.

Being a bit of a primitive Haskeller, my first pass was something like this:

extremes n = (b, s)
  where
    l = map sum $ transpose n
    b = maximum l
    s = minimum l

The function extremes is the function which takes the data (list of lists) and spits out a tuple of (max, min).

Before the end of the conversation with my designer colleague in arms, I pinged him my version of the solution in Haskell and needless to say he was a little surprised as how concise it was. I was sure to point out that there is no doubt a better way of representing this solution with regards to speed and conciseness.

The first thing that I thought could be improved would be using a custom fold to get the max and min while parsing the transposed list. This would allow us to calculate the values in a single pass and hence be a little better with regards to performance. That would obviously sacrifice a little bit of the conciseness we're looking for.

When the conversation ended, I jumped onto IRC and spoke to some more seasoned Haskellers. The first suggested improvement that popped out of that chat was to use arrows to remove the need for the where clause. That solution looks like this:

extremes = (maximum &&& minimum) . map sum . transpose

Nifty :) Of all the other options, this proved to be the most readable and concise, though not the best performing.

The next most notable solution included the fold which calculated the min and max in a single parse, not a double parse:

extremes = foldl1 (\(a, b) -> max a *** min b) . join zip . map sum . transpose

Folds really are fantastic aren't they. Again we're using arrows here to do a bit of heavy lifting and that keeps things looking a little nicer.

So after this little session, my designer friend was aware of how easy it can be to crunch certain types of data using a functional language, like Haskell. It made me think again about how it'd be nice to just be able to plug in whichever language we wanted whenever we felt it would do the job better than whatever the current tool is.

So how would you improve it? :)

  • OJ

    @James: Wow, thanks! You've saved me the pain of having to do it myself. :) The timings don't surprise me to be honest. The compiler is pretty damned smart.

    I do prefer my solution so far though. Transposing is a very obvious solution to the problem and if anyone needed to read the code afterwards I think the solution with transpose would be digested the fastest. Yup that's the maintenance coder coming out in me.

    Anyway, thanks for taking the time to do the profiling, and the commenting :)

    @Axman6: Thanks for commenting! I agree, it does look like it'd be nice and quick. If and when I get the chance to profile it I'll put some stats up. Unless you've already profiled it...??? :)

    @Dan: That's a great question :) One that I've spent a few days thinking about because I wasn't sure exactly what the answer was. To be honest, I'm not sure I do now! But I'll give it a crack. Pardon me if I don't articulate it very well.

    The first part of the problem is that most people who are writing code that lacks performance don't know what they don't know. It's becoming more and more common for non-developers, or inexperienced developers, to build production level systems in code without really having a full grasp of what's required to write high quality and efficient software. That wasn't intended to be a stab :) That's a gross generalisation!

    So people will go through their formal education in their own areas and then attempt to fill whatever gaps they have in the real world with their untrained efforts.

    The perfect parallel to this is programmer art! Everyone knows what happens when a developer attempts to create a user interface. It's abysmal. The colours are bad, there's no "flow", it's abrasive to use and the UI looks awful. It's a typical case of a programmer thinking they can design. Most of the time, they can't. They just think they can! Thankfully, I am in the boat where I know I can't do design :) I can write you an interface that works, but don't expect it to be nice! That's where I aim to utilise other people's talents.

    So I guess what I'm saying is that without a good background in development or some formal training in algorithms, you're going to find it hard because the points where there are inefficiencies aren't going to be immediately obvious to you.

    Having said all that, there are definitely things you can do to aid you in making your systems function better. These are the kinds of things that anyone can do. Those things are the content of another blog post :) I'm inspired to write them up in a more formal way and actually try to do a better job of justifying what I've said here.

    So without sounding cheezy, watch this space! I'll have a solid answer for you very shortly.

  • Just my 2 cents, this looks fairly ugly, but i have a feeling it might be fairly efficient:
    extremes = (\(y:ys) -> foldl' (\(l,h) x -> (min l x, max h x)) (y,y) ys) . foldl (zipWith (+)) (repeat 0)

    Seems I had a lot of the same ideas as Dougal. I think that at least the left foldl' should be optimised pretty nicely.
    -- Axman6
  • I ran these three:
    e1 = (maximum &&& minimum) . map sum . transpose
    e2 = foldl1 (curry (uncurry max . fst &&& uncurry min . snd)) . join zip . map sum . transpose
    e3 = foldl1 (curry (uncurry max . fst &&& uncurry min . snd)) . join zip . foldr (zipWith (+)) (repeat 0)

    ... and they all came up around the 3s mark, jostling for top spot depending on how many times I can them.

    I think this might mean that GHC is clever enough to prune the intermediate matrix? Or this could be just an effect of lazy evaluation (effectively doing the pruning automatically). Hard to tell really.

    @Dan - The haskell profiling stuff is pretty good, easiest way to find out if something is going wrong (wrong order polynomial) is to run it a bunch of times and graph it!
  • Dan
    @jberryman @oj - this raises another question (at least in my mind) - is there any giveaways that suggest code needs optimisation? (other than the knowledge that im not a coding guru)? IE, OJ, you talked about O(n) with relation to the indexOf method in AS3 - but how did you know that that method was the weakest link and the thing that needed improvement?
  • OJ

    @Dougal: Thanks for your comment mate. You're right, the list folding method with zipWith is indeed a better option. I saw your comment on reddit (didn't know it was submitted there until you posted ;)) and I have to agree, the removal of the transpose looks to be a good one.


    @jberryman: You raise a good point :) I am yet to profile any of the solutions mentioned above. That is something I'll aim to do at some stage today or tomorrow and I'll post the results as an edit to the blog post. Thanks for the feedback!

  • Out of curiosity, did you profile your various approaches? It would probably take a larger data set, but I often find that what looks like an obviously-inefficient piece of code turns out to be no better than my more explicit and longer version.
    Would be interested to see if that is the case here.
  • I said this on Reddit more succinctly, but:
    There shouldn't be any need to transpose the lists. For each pair of lists [a1,b1,c1,...] and [a2,b2,c2,...] you want to join them element-wise: zipWith (+) [a1,b1,c1,...] [a2,b2,c2,...]
    .
    If you're guaranteed to have valid lists you can use foldr1 to repeatedly apply this until all the lists are joined together element-wise. If not foldr f (repeat 0) will work.
    I don't actually know if this is any cheaper, but it seems foolish to transpose a list-of-lists if you don't actually want that resulting format.
blog comments powered by Disqus