Further beating C with 20 lines of Haskell: wc
tl;dr: today we’ll look at implementing a toy wc
command
that is about 4-5 times faster than the corresponding GNU Coreutils implementation.
So I’ve recently come across a post by Chris Penner
describing a Haskell implementation of the Unix wc
command.
Chris did a great job optimizing the Haskell version as well as
showing how some high-level primitives (monoids and streaming, for one) turn out to be useful here,
although the result was still a bit slower than C.
There’s also a parallel version that relies on the monoidal structure of the problem a lot, and that one actually beats C.
But that post left me wondering: is it possible to do better without resorting to parallel processing?
Turns out the answer is yes. With some quite minor tweaks, the Haskell version manages to beat the hell out of the C version that presumably has decades of man-hours put into it.
Experimental setup
As usual, let’s go through the benchmarking procedure first.
Input data
I’ve downloaded this file and concatenated it with itself so that the total size is about 1.8 gigabytes:
% for i in `seq 1 10`; cat part.txt >> test.txt
% du -sh test.txt
1.8G test.txt
By the way, the file is residing on a tmpfs
partition, so there is no disk IO involved.
Hardware and software
I’m running Gentoo Linux on Core i7 4770 with 32 gigabytes of RAM with most of it free during the benchmarks.
All the Haskell code is built using ghc 8.8.2.
I’m competing against coreutils 8.31 built using gcc 9.2 with -O2 -march=native
:
% wc --version
wc (GNU coreutils) 8.31
Packaged by Gentoo (8.31-r1 (p0))
By the way, that -march=native
part is a little bit of cheating in favour of C,
since the resulting Haskell binary can run on any x86-64 machine,
while wc
compiled with that flag can only run on my CPU and newer ones
(that is, assuming the compiler actually used some of the newer SIMD extensions).
But, well, let’s be kind and give the C version a bit of advantage.
Measuring
Measurements are done using time
.
time
shows both user and system time,
but I only consider user time for comparison, since:
- System time proves to be quite consistently equal to about 0.3 s.
- I’m not benchmarking the kernel anyway, so I’m curious how much time is spent in my code.
As with any fully deterministic benchmark, I run each executable several times (5 in this case) and report the minimal (user) time.
Baseline
So how does the C code perform? Let’s run wc
to find out!
Benchmarking as described above gives us 7.20 seconds of user time. So, this sets the baseline for our implementation.
It’s also worth noting that I have ru_RU.UTF-8
locale set,
but wc
run with the C
locale actually performs worse, taking about 12 seconds instead.
We’ll take the best result though.
Haskell
Let’s see where do we start. I’ll take this version from Chris’ post,
changing the function name and renaming cs
to bs
since we’re counting bytes anyway:
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Char
wc :: BS.ByteString -> (Int, Int, Int)
wc s =
let (bs, ws, ls, _) = BS.foldl' go (0, 0, 0, False) s
in (bs, ws, ls)
where
go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
go (!bs, !ws, !ls, !wasSpace) c =
let addLine | c == '\n' = 1
| otherwise = 0
addWord | wasSpace = 0
| isSpace c = 1
| otherwise = 0
in (bs + 1, ws + addWord, ls + addLine, isSpace c)
Sure, there are better-performing single-threaded versions mentioned in Chris’ post, but it’s just that this one is a bit more friendly to the kind of changes I will be making later on. Also, I don’t really need the monoidal structure if I’m not going to use associativity, and I’m not going to use associativity as I keep this single-threaded.
Anyway, according to the numbers in Chris’ post, this version is about 9 times slower than wc
.
I was unable to reproduce this: on my machine it’s 433% of wc
, taking 31.23 seconds to run.
Also, this version has a tiny bug
(it doesn’t count last word if it’s not followed by a space or a newline),
but I’m not fixing it just yet, especially since the fix is trivial
and amounts to considering the last Bool
field of the tuple
that’s currently being merely dropped — so it shouldn’t affect the performance that much.
We’ll fix this in the final version.
Anyway, how can this be improved?
Records to the rescue
Firstly, I don’t like big tuples, especially tuples having elements of the same type — it’s just too error-prone for my small attention span. So let’s replace the tuple with a record:
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Char
data State = State
{ bs :: Int
, ws :: Int
, ls :: Int
, wasSpace :: Bool
}
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 False) s
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) (isSpace c)
where
addLine | c == '\n' = 1
| otherwise = 0
addWord | wasSpace = 0
| isSpace c = 1
| otherwise = 0
We don’t use bang patterns anymore since we effectively made the fields of the record strict
via the {-# LANGUAGE Strict #-}
pragma.
This wouldn’t change the performance much, would it?
Turns out it does, and quite a lot — by a factor of four!
This version takes 7.56 seconds or 105% of wc
,
so we’re almost as fast as the baseline now! How come?
Well, the answer is simple, and it’s also hinted at in the original post: once we’ve defined a record type with strict fields, the compiler has easier time figuring out that it could unpack the contents of those fields. So we’re effectively having
data State = State
{ bs :: {-# UNPACK #-} !Int
, ws :: {-# UNPACK #-} !Int
, ls :: {-# UNPACK #-} !Int
, wasSpace :: !Bool
}
and this saves us an indirection and a memory allocation per Int
field
(even if the allocated values never leave the nursery thus being quite cheap).
CSE
Next, note we’re computing isSpace c
at least once for every character,
but most of the times we do it twice.
Let’s make sure we indeed only call isSpace
once!
So we change our recursive go
function like this:
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
where
isSp = isSpace c
addLine | c == '\n' = 1
| otherwise = 0
addWord | wasSpace = 0
| isSp = 1
| otherwise = 0
The result? The run time more than halved, taking 2.93 seconds or 41% of wc
.
So, we’re more than twice as fast as wc
now, still having pure and idiomatic code!
I’m not sure why ghc doesn’t do this optimization for us, especially given that it never increases the amount of work done in this particular case, and doing the optimization manually (as in the code above) gives some good results, improving the run time quite considerably.
My guess is that isSpace
gets inlined and optimized further
before the compiler has any opportunity to do common subexpression elimination.
Compiler options and pragmas
Another thing worth trying is to play around with compiler options.
Two biggest knobs are optimization level and the codegen backend,
and, in addition to that, we could try force inlining wc
.
There is no good systematic approach here,
so we’ll just try these things in all possible combinations:
- LLVM codegen (via
-fllvm
) — it sometimes gets better on hardcore number-crunching code, but it might help here too. - Optimization level (via
-O2
) — most of the times the default-O
is good enough, and there is no observable difference between that and-O2
, but why not try? - Inlining the function via the
{-# INLINE wc #-}
pragma. I don’t really think it would help since this function is called only once so the corresponding overhead is negligible, and I don’t expect any further optimization opportunities to kick after inlining. But, again, let’s try and see.
I’m not even showing the resulting code since the function body isn’t affected at all.
So here are the results in tabular form:
LLVM | -O2 |
Inlining | Time, s | % of wc time |
---|---|---|---|---|
2.93 | 41 | |||
✓ | 3.96 | 55 | ||
✓ | ✓ | 2.61 | 36 | |
✓ | 2.59 | 36 | ||
✓ | 2.23 | 31 | ||
✓ | ✓ | ✓ | 2.02 | 28 |
✓ | ✓ | 2.01 | 28 | |
✓ | ✓ | 2.01 | 28 |
Not sure why inlining helps since this function is called only once (so the function call overhead doesn’t matter) and I don’t really see any opportunity for other optimizations to kick in after inlining, but I won’t argue with the numbers.
On the other hand, -O2
really helps,
while LLVM doesn’t give any improvement, and, in fact, it makes things worse when used alone.
Anyway, we could just as well stop here and push this version to prod. This is still quite an idiomatic Haskell, and C already looks pretty beaten up (28%!). If I were to present the results somewhere all while showing the elegance of functional programming, this is the version I’d present.
But let’s keep going (and let’s keep -fllvm
in our flags, cause why not).
More unpacking
The speedup we got after replacing the tuple with a custom record suggests another possible optimization.
Note that the Bool
that we have in the record
doesn’t get unpacked since it has more than one constructor.
What if we replace it with an Int
,
where 1
denotes True
and 0
denotes False
?
This is a bit ugly, but, you know, we’re competing with C, so let’s do it!
data State = State
{ bs :: Int
, ws :: Int
, ls :: Int
, wasSpace :: Int
}
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 0) s
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
where
isSp | isSpace c = 1
| otherwise = 0
addLine | c == '\n' = 1
| otherwise = 0
addWord | wasSpace == 1 = 0
| isSp == 1 = 1
| otherwise = 0
This by itself doesn’t change much (I’m still getting the same results),
but it opens some further opportunities.
Namely, notice that the expression for addWord
can be translated from guards (which are essentially nested if
s)
to a single arithmetic expression:
addWord = (1 - wasSpace) * isSp
Proving the equivalence is left as an exercise to the reader.
Anyway, this change further reduces the run time to 1.91 s, or 26% of wc
.
Less Unicode
What if we wanted to push just a little further?
A reasonable compromise seems to be to ignore the presence of
Unicode spaces outside of ASCII,
avoiding the isSpace
function
(which didn’t do its Unicode job anyway since we are looking at the input a byte at a time).
It’s crucial to note that it does not prevent us from correctly counting characters
in later developments of this little wc
substitute,
but I’ll elaborate on this in subsequent posts.
While we’re at it,
we’ll also import Data.ByteString.Lazy
instead of Data.ByteString.Lazy.Char8
to avoid unpacking and repacking bytes into Char
s.
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
module Data.WordCount where
import qualified Data.ByteString.Lazy as BS
data State = State
{ bs :: Int
, ws :: Int
, ls :: Int
, wasSpace :: Int
}
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 0) s
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
where
isSp | c == 32 || c - 9 <= 4 = 1
| otherwise = 0
addLine | c == 10 = 1
| otherwise = 0
addWord = (1 - wasSpace) * isSp
{-# INLINE wc #-}
This change results in quite a dramatic improvement, reducing the run time to 1.45 s,
or 20% of wc
time, even with a bit of an advantage given to wc
in form of -march=native
.
And, I think, that’s about as best as it could get with reasonable effort, so let’s stop here.
Fixing the off-by-one error
The problem is that we don’t count the last word as a word
if it isn’t followed by a newline or a space character.
Luckily, the fix is easy: just take into account the wasSpace
field
and set its initial value to 1
, resulting in code like
wc s = (bs, ws + 1 - wasSpace, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 1) s
...
Obviously, this does not affect performance at all.
Bonus: reducing system time
Remember those 0.35 seconds spent in kernel that we agreed not to consider? How about trying to cut that down as well?
I haven’t shown yet how do I actually use the wc
function.
So here’s the main
:
main :: IO ()
main = do
[path] <- getArgs
contents <- BS.readFile path
print $ wc contents
readFile
wastes some cycles unnecessarily copying the data from the kernel to the userspace.
We could eliminate that overhead if we used something like mmap
.
Let’s take, for instance, the
bytestring-mmap
package that provides ByteString
s over mmap
ed files with zero copying.
We’ll change the main
to go along the lines of
main :: IO ()
main = do
[path] <- getArgs
contents <- unsafeMMapFile path
print $ wc contents
and since unsafeMMapFile
returns a strict ByteString
,
we’ll also change our word counting module to import Data.ByteString
instead of Data.ByteString.Lazy
.
The result? We’ve significantly reduced the time spent in kernel: it’s now 0.04-0.06 seconds instead of 0.35 s we had previously. This does not have any statistically significant effect on our user time, though.
One might say the downside is that we’ve lost referential transparency:
if the file changes our ByteString
will change as well,
but I’d say that it does not matter in our case since we only observe our ByteString
just once.
Conclusion
This section will be quite short.
So we’ve managed to further improve on the previous results and overcome the performance of a C program that was looked at by thousands of eyes of quite hardcore low-level Unix hackers over a few decades. We did this with a handful of lines of pure, mutation-less, idiomatic Haskell, achieving about 4 to 5 times of throughput of the C version and spending less than an hour on all the optimizations.
The resulting code is available at github.
The thing to keep in mind is that our implementation does way less than what wc
does,
so this comparison doesn’t mean that production-ready Haskell version of wc
will be just as fast
(or will be expressed in just as many lines of code).
As usual with all these benchmarks, take this with a grain of salt.
Stay tuned for a second part,
where we will investigate shaping all this into an actual wc
substitute,
where different statistics can be turned on or off, all while not computing what the user hasn’t requested.
Edit history
- Feb 04: reworded the conclusion to emphasize the shortcomings
and mentioned that
wc
runs slower with the C locale for me.