Further beating C with 20 lines of Haskell: wc

February 2, 2020 // haskell, performance

tl;dr: today we’ll look at implementing a toy wc command that is about 4-5 times faster than the corresponding GNU Coreutils implementation.

So I’ve recently come across a post by Chris Penner describing a Haskell implementation of the Unix wc command. Chris did a great job optimizing the Haskell version as well as showing how some high-level primitives (monoids and streaming, for one) turn out to be useful here, although the result was still a bit slower than C. There’s also a parallel version that relies on the monoidal structure of the problem a lot, and that one actually beats C.

But that post left me wondering: is it possible to do better without resorting to parallel processing?

Turns out the answer is yes. With some quite minor tweaks, the Haskell version manages to beat the hell out of the C version that presumably has decades of man-hours put into it.

Experimental setup

As usual, let’s go through the benchmarking procedure first.

Input data

I’ve downloaded this file and concatenated it with itself so that the total size is about 1.8 gigabytes:

% for i in `seq 1 10`; cat part.txt >> test.txt
% du -sh test.txt
1.8G    test.txt

By the way, the file is residing on a tmpfs partition, so there is no disk IO involved.

Hardware and software

I’m running Gentoo Linux on Core i7 4770 with 32 gigabytes of RAM with most of it free during the benchmarks.

All the Haskell code is built using ghc 8.8.2.

I’m competing against coreutils 8.31 built using gcc 9.2 with -O2 -march=native:

% wc --version
wc (GNU coreutils) 8.31
Packaged by Gentoo (8.31-r1 (p0))

By the way, that -march=native part is a little bit of cheating in favour of C, since the resulting Haskell binary can run on any x86-64 machine, while wc compiled with that flag can only run on my CPU and newer ones (that is, assuming the compiler actually used some of the newer SIMD extensions). But, well, let’s be kind and give the C version a bit of advantage.

Measuring

Measurements are done using time. time shows both user and system time, but I only consider user time for comparison, since:

System time proves to be quite consistently equal to about 0.3 s.
I’m not benchmarking the kernel anyway, so I’m curious how much time is spent in my code.

As with any fully deterministic benchmark, I run each executable several times (5 in this case) and report the minimal (user) time.

Baseline

So how does the C code perform? Let’s run wc to find out!

Benchmarking as described above gives us 7.20 seconds of user time. So, this sets the baseline for our implementation.

It’s also worth noting that I have ru_RU.UTF-8 locale set, but wc run with the C locale actually performs worse, taking about 12 seconds instead. We’ll take the best result though.

Haskell

Let’s see where do we start. I’ll take this version from Chris’ post, changing the function name and renaming cs to bs since we’re counting bytes anyway:

import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Char

wc :: BS.ByteString -> (Int, Int, Int)
wc s =
    let (bs, ws, ls, _) = BS.foldl' go (0, 0, 0, False) s
     in (bs, ws, ls)
  where
    go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
    go (!bs, !ws, !ls, !wasSpace) c =
        let addLine | c == '\n' = 1
                    | otherwise = 0
            addWord | wasSpace = 0
                    | isSpace c = 1
                    | otherwise = 0
         in (bs + 1, ws + addWord, ls + addLine, isSpace c)

Sure, there are better-performing single-threaded versions mentioned in Chris’ post, but it’s just that this one is a bit more friendly to the kind of changes I will be making later on. Also, I don’t really need the monoidal structure if I’m not going to use associativity, and I’m not going to use associativity as I keep this single-threaded.

Anyway, according to the numbers in Chris’ post, this version is about 9 times slower than wc. I was unable to reproduce this: on my machine it’s 433% of wc, taking 31.23 seconds to run.

Also, this version has a tiny bug (it doesn’t count last word if it’s not followed by a space or a newline), but I’m not fixing it just yet, especially since the fix is trivial and amounts to considering the last Bool field of the tuple that’s currently being merely dropped — so it shouldn’t affect the performance that much. We’ll fix this in the final version.

Anyway, how can this be improved?

Records to the rescue

Firstly, I don’t like big tuples, especially tuples having elements of the same type — it’s just too error-prone for my small attention span. So let’s replace the tuple with a record:

{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}

import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Char

data State = State
  { bs :: Int
  , ws :: Int
  , ls :: Int
  , wasSpace :: Bool
  }

wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 False) s

    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) (isSpace c)
      where
        addLine | c == '\n' = 1
                | otherwise = 0
        addWord | wasSpace = 0
                | isSpace c = 1
                | otherwise = 0

We don’t use bang patterns anymore since we effectively made the fields of the record strict via the {-# LANGUAGE Strict #-} pragma. This wouldn’t change the performance much, would it?

Turns out it does, and quite a lot — by a factor of four! This version takes 7.56 seconds or 105% of wc, so we’re almost as fast as the baseline now! How come?

Well, the answer is simple, and it’s also hinted at in the original post: once we’ve defined a record type with strict fields, the compiler has easier time figuring out that it could unpack the contents of those fields. So we’re effectively having

data State = State
  { bs :: {-# UNPACK #-} !Int
  , ws :: {-# UNPACK #-} !Int
  , ls :: {-# UNPACK #-} !Int
  , wasSpace :: !Bool
  }

and this saves us an indirection and a memory allocation per Int field (even if the allocated values never leave the nursery thus being quite cheap).

CSE

Next, note we’re computing isSpace c at least once for every character, but most of the times we do it twice. Let’s make sure we indeed only call isSpace once!

So we change our recursive go function like this:

    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
      where
        isSp = isSpace c
        addLine | c == '\n' = 1
                | otherwise = 0
        addWord | wasSpace = 0
                | isSp = 1
                | otherwise = 0

The result? The run time more than halved, taking 2.93 seconds or 41% of wc. So, we’re more than twice as fast as wc now, still having pure and idiomatic code!

I’m not sure why ghc doesn’t do this optimization for us, especially given that it never increases the amount of work done in this particular case, and doing the optimization manually (as in the code above) gives some good results, improving the run time quite considerably.

My guess is that isSpace gets inlined and optimized further before the compiler has any opportunity to do common subexpression elimination.

Compiler options and pragmas

Another thing worth trying is to play around with compiler options. Two biggest knobs are optimization level and the codegen backend, and, in addition to that, we could try force inlining wc. There is no good systematic approach here, so we’ll just try these things in all possible combinations:

LLVM codegen (via -fllvm) — it sometimes gets better on hardcore number-crunching code, but it might help here too.
Optimization level (via -O2) — most of the times the default -O is good enough, and there is no observable difference between that and -O2, but why not try?
Inlining the function via the {-# INLINE wc #-} pragma. I don’t really think it would help since this function is called only once so the corresponding overhead is negligible, and I don’t expect any further optimization opportunities to kick after inlining. But, again, let’s try and see.

I’m not even showing the resulting code since the function body isn’t affected at all.

So here are the results in tabular form:

LLVM	`-O2`	Inlining	Time, s	% of `wc` time
			2.93	41
✓			3.96	55
✓		✓	2.61	36
		✓	2.59	36
	✓		2.23	31
✓	✓	✓	2.02	28
✓	✓		2.01	28
	✓	✓	2.01	28

Not sure why inlining helps since this function is called only once (so the function call overhead doesn’t matter) and I don’t really see any opportunity for other optimizations to kick in after inlining, but I won’t argue with the numbers.

On the other hand, -O2 really helps, while LLVM doesn’t give any improvement, and, in fact, it makes things worse when used alone.

Anyway, we could just as well stop here and push this version to prod. This is still quite an idiomatic Haskell, and C already looks pretty beaten up (28%!). If I were to present the results somewhere all while showing the elegance of functional programming, this is the version I’d present.

But let’s keep going (and let’s keep -fllvm in our flags, cause why not).

More unpacking

The speedup we got after replacing the tuple with a custom record suggests another possible optimization.

Note that the Bool that we have in the record doesn’t get unpacked since it has more than one constructor. What if we replace it with an Int, where 1 denotes True and 0 denotes False? This is a bit ugly, but, you know, we’re competing with C, so let’s do it!

data State = State
  { bs :: Int
  , ws :: Int
  , ls :: Int
  , wasSpace :: Int
  }

wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 0) s

    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
      where
        isSp | isSpace c = 1
             | otherwise = 0
        addLine | c == '\n' = 1
                | otherwise = 0
        addWord | wasSpace == 1 = 0
                | isSp == 1 = 1
                | otherwise = 0

This by itself doesn’t change much (I’m still getting the same results), but it opens some further opportunities. Namely, notice that the expression for addWord can be translated from guards (which are essentially nested ifs) to a single arithmetic expression:

        addWord = (1 - wasSpace) * isSp

Proving the equivalence is left as an exercise to the reader.

Anyway, this change further reduces the run time to 1.91 s, or 26% of wc.

Less Unicode

What if we wanted to push just a little further?

A reasonable compromise seems to be to ignore the presence of Unicode spaces outside of ASCII, avoiding the isSpace function (which didn’t do its Unicode job anyway since we are looking at the input a byte at a time). It’s crucial to note that it does not prevent us from correctly counting characters in later developments of this little wc substitute, but I’ll elaborate on this in subsequent posts.

While we’re at it, we’ll also import Data.ByteString.Lazy instead of Data.ByteString.Lazy.Char8 to avoid unpacking and repacking bytes into Chars.

{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}

module Data.WordCount where

import qualified Data.ByteString.Lazy as BS

data State = State
  { bs :: Int
  , ws :: Int
  , ls :: Int
  , wasSpace :: Int
  }

wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 0) s

    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
      where
        isSp | c == 32 || c - 9 <= 4 = 1
             | otherwise = 0
        addLine | c == 10 = 1
                | otherwise = 0
        addWord = (1 - wasSpace) * isSp
{-# INLINE wc #-}

This change results in quite a dramatic improvement, reducing the run time to 1.45 s, or 20% of wc time, even with a bit of an advantage given to wc in form of -march=native.

And, I think, that’s about as best as it could get with reasonable effort, so let’s stop here.

Fixing the off-by-one error

The problem is that we don’t count the last word as a word if it isn’t followed by a newline or a space character. Luckily, the fix is easy: just take into account the wasSpace field and set its initial value to 1, resulting in code like

wc s = (bs, ws + 1 - wasSpace, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 1) s
    ...

Obviously, this does not affect performance at all.

Bonus: reducing system time

Remember those 0.35 seconds spent in kernel that we agreed not to consider? How about trying to cut that down as well?

I haven’t shown yet how do I actually use the wc function. So here’s the main:

main :: IO ()
main = do
  [path] <- getArgs
  contents <- BS.readFile path
  print $ wc contents

readFile wastes some cycles unnecessarily copying the data from the kernel to the userspace. We could eliminate that overhead if we used something like mmap. Let’s take, for instance, the bytestring-mmap package that provides ByteStrings over mmaped files with zero copying. We’ll change the main to go along the lines of

main :: IO ()
main = do
  [path] <- getArgs
  contents <- unsafeMMapFile path
  print $ wc contents

and since unsafeMMapFile returns a strict ByteString, we’ll also change our word counting module to import Data.ByteString instead of Data.ByteString.Lazy.

The result? We’ve significantly reduced the time spent in kernel: it’s now 0.04-0.06 seconds instead of 0.35 s we had previously. This does not have any statistically significant effect on our user time, though.

One might say the downside is that we’ve lost referential transparency: if the file changes our ByteString will change as well, but I’d say that it does not matter in our case since we only observe our ByteString just once.

Conclusion

This section will be quite short.

So we’ve managed to further improve on the previous results and overcome the performance of a C program that was looked at by thousands of eyes of quite hardcore low-level Unix hackers over a few decades. We did this with a handful of lines of pure, mutation-less, idiomatic Haskell, achieving about 4 to 5 times of throughput of the C version and spending less than an hour on all the optimizations.

The resulting code is available at github.

The thing to keep in mind is that our implementation does way less than what wc does, so this comparison doesn’t mean that production-ready Haskell version of wc will be just as fast (or will be expressed in just as many lines of code). As usual with all these benchmarks, take this with a grain of salt.

Stay tuned for a second part, where we will investigate shaping all this into an actual wc substitute, where different statistics can be turned on or off, all while not computing what the user hasn’t requested.

Edit history

Feb 04: reworded the conclusion to emphasize the shortcomings and mentioned that wc runs slower with the C locale for me.