The joys and perils of beating C with Haskell: productionizing wc

March 10, 2020 // haskell, performance, types

Last time we’ve looked at implementing a toy wc-like program and we’ve also compared its performance against the full-blown Unix wc. The results were quite interesting: our implementation managed to beat wc by a factor of 5. Of course, that’s quite an unfair comparison: our implementation is hardcoded to count just the bytes, lines and words. wc, on the other hand, has command-line options to select specific statistics, it supports some additional ones like maximum line length, it treats Unicode spaces properly (in an Unicode-aware locale, of course), and so on. In other words, it’s better to consider what we’ve done last time as a proof-of-concept showing that it’s possible to achieve (and overcome) C-like performance on this task, even if with all those concessions.

Today we’ll look at ways of productionizing the toy program from the previous post. Our primary goal here is allowing the user to select various statistics, computing just what the user has selected to compute. We’ll try to do this in a modular and composable way, striving to isolate each statistic into its own unit of some sorts.

Indeed, if we look at the C version — well, personally I wouldn’t call that as a prime example of readable and maintainable code, as different statistics are computed in a single big 370-lines-long function. This is something we’ll try to avoid here.

Moreover, we’ll try to express that certain statistics like byte count or lines count can be computed more efficiently if we don’t have to look at each byte, while other statistics like word count or max line length just need to look at each byte one by one (unless one does some clever and non-trivial broadword programming or SIMD-enabled things, which is beyond the scope of this post). For instance, byte count can be computed in O(1) if we know we’re reading from a file — we can just take the file size and call it a day!

In addition to that, we will, among other things:

implement more statistics with ease, enjoying local reasoning;
throw up some tests, enjoying local reasoning once more;
try out some kinda-dependently-typed techniques, successfully obtaining working code but failing spectacularly on the performance side of things;
play around with Template Haskell;
marvel at the (un)predictability and (un)reproducibility of the resulting code performance.

So, here’s the code from the previous post that we begin with:

{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}

module Data.WordCount where

import qualified Data.ByteString.Lazy as BS

data State = State
  { bs :: Int
  , ws :: Int
  , ls :: Int
  , wasSpace :: Int
  }

wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws + 1 - wasSpace, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 1) s

    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
      where
        isSp | c == 32 || c - 9 <= 4 = 1
             | otherwise = 0
        addLine | c == 10 = 1
                | otherwise = 0
        addWord = (1 - wasSpace) * isSp
{-# INLINE wc #-}

We want to break it down into separate functions computing bytes, words and lines count respectively. How can we do this?

Composable left folds

First of all, note that what we’re doing is a fold — hey, even the name of the function BS.foldl' tells us so!

Some time ago I’ve come across foldl — a library for "composable, streaming, and efficient Left folds". This sounds like exactly what we need! Moreover, luckily for us, it even has a module for folding over ByteStrings. In particular, it already has functions for 2 of the 3 statistics we’ve implemented: bytes count is equal to the input string length which is length in said module, while count allows counting the lines (via count 10). Looks like we just need to write a fold to count the words, and we’re golden!

So here is the said fold:

{-# LANGUAGE Strict #-}

import qualified Control.Foldl as L
import qualified Data.ByteString as BS

data WordState = WordState { ws :: Int, wasSpace :: Int }

wordsCount :: L.Fold BS.ByteString Int
wordsCount = L.Fold (BS.foldl' go) (WordState 0 1) (\WordState { .. } -> ws + 1 - wasSpace)
  where
    go WordState { .. } c = WordState (ws + addWord) isSp
      where
        isSp | c == 32 || c - 9 <= 4 = 1
             | otherwise = 0
        addWord = (1 - wasSpace) * isSp

Now we can express counting bytes, lines and words simultaneously as

import qualified Control.Foldl.ByteString as BL

main :: IO ()
main = do
  [path] <- getArgs
  contents <- unsafeMMapFile path
  let res = BL.fold ((,,) <$> BL.length <*> BL.count 10 <*> wordsCount) (BSL.fromStrict contents) :: (Int, Int, Int)
  print res

Composable indeed! But what is the run time? If I benchmark it as described in the previous post (running on a 1.8 gigabytes file residing in tmpfs), I get about 2.5 seconds on my machine. Note it’s a different machine from the one I did the benchmarks for the previous post — it’s my development machine with a different CPU and more sources of noise (IDE, browser, etc), but it’ll do to get a rough idea of the performance.

Ok, so 2.5 seconds. That’s almost twice as bad as it was before.

What if we count just the length and words?

  let res = BL.fold ((,) <$> BL.length <*> wordsCount) contents :: (Int, Int)
  print res

1.55 seconds. Hmm. What about just the lines count?

  let res = BL.fold (BL.count 10) contents :: Int
  print res

1.05 seconds.

Dang. It’s almost additive. But it shouldn’t be additive. The lines count (boiling down to counting '\n') should be dwarved by the way more complicated logic of words counting, but it isn’t.

Something strange’s going on here. Let’s get our hands dirty and look at what the foldl library does.

In general, foldl feeds a chunk of input to each of the folds composed Applicatively. foldl folds over lazy ByteStrings, and a chunk here is the strict ByteString that the lazy one consists of. In this particular case, that’s about 256 KiB worth of data, which definitely does not fit L1 cache, so we end up paying quite some CPU cycles to wait for L2 → L1 cache and even L1 cache → register transfers (since otherwise the data we’re working with is readily available).

Sure, we could have just decreased the size of the chunk to 16-32 KiB so that it fits L1, but that’s not as interesting.

And, what’s worse, the compiler seemingly isn’t able to elide unnecessary computations: the run time of BL.fold ((,) <$> wordsCount <*> wordsCount) contents is twice as much as the run time of BL.fold wordsCount contents.

Anyway, that approach is not gonna fly.

Moreover, it’s not entirely obvious how to compose the folds “on demand”, say, based on the program command-line options.

So let’s write…

Our very own folds

Before we actually get to coding, let’s think a little about the design. Remember that ultimately we need to compose the folds according to what the user has selected. How it might even look like?

Let’s say we have base, “atomic” folds (that is, statistics) f1, f2, f3 at our disposal, and the user requests the composition of f1 and f3. I wouldn’t really expect the compiler to be smart enough to see that code like

  -- options is a list of booleans per individual statistic
  options <- parseCliOptions

  -- theFold is the resulting composition
  let theFold = foldl' f (zip options [f1, f2, f3]) emptyFold
  where
    f acc (True, stat) = acc `compose` stat
    f acc (False, _) = acc

implies that the compiler needs to specialize all the compositions of folds that might arise.

The only way to be sure the compiler has all the relevant information is to lift this to the type level — Haskell ultimately gets rid of the types during compile-time, so any information conveyed via them must be usable by the compiler.

Types to the rescue

Given that, how do we represent a fold? One type-friendly way of doing this is via a type class.

And, before we get to that, let’s just enumerate the type system extensions we’ll need to get them out of the way:

{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE MultiParamTypeClasses, UndecidableInstances, FlexibleInstances #-}
{-# LANGUAGE TypeFamilyDependencies, FunctionalDependencies, PolyKinds, DataKinds, GADTs, TypeOperators #-}
{-# LANGUAGE ScopedTypeVariables #-}

So what constitutes a fold in our case, which is some statistic over the input? A fold-computed statistic is defined by:

the initial state,
the step function updating the state according to an input character,
the function to extract the final result from the state,
the state and the result types.

Let’s write this down:

class Statistic s res st | res -> s, st -> s
                         , s -> res, s -> st
                         where
  initState :: st
  extractState :: st -> res
  step :: st -> Word8 -> st

here s denotes the statistics identifier (it will be a DataKinds-thing later on), and we use functional dependencies to denote that the statistic uniquely determines the result type and the state type, and vice versa. In particular, this allows us to avoid spelling out all of the class type variables in the methods signatures, and it suffices to mention just the ones that we actually need. Otherwise we’d need something like

  initState :: proxy1 s -> proxy2 res -> st
  extractState :: proxy s -> st -> res
  step :: proxy1 s -> proxy2 res -> st -> Word8 -> st

Ugh. Ugly. Good we don’t need to spell out as much.

Chunked statistics

Anyway, this one looks like a good start, but something’s missing. Let’s think how we might express the bytes count statistic, for instance. If the user has requested just this statistic, then there’s no point in folding over the input and incrementing a counter on each step. Instead, we can just take the length of the whole input chunk (as in BS.length) and call it a day. That’s O(1) on the input length.

On the other hand, the word counting or, even worse, max line length counting statistic does not have an easily expressible similarly “chunked” computation mode. They necessarily need to look at each byte of the input (modulo some SIMD implementations that we don’t consider here — but they still are O(n) on the input).

Moreover, we’ll later want to combine different statistics so that if both stats support chunked mode of computation, then we still do chunked, but otherwise we degrade to byte-level.

So how do we express that some statistics support both “chunked” computations as well as byte-level ones, while others only support the latter? GADTs to the rescue! We introduce an enumeration to denote the sort, or computation mode of the statistic, and we also introduce a GADT wrapping the functions that make sense for that sort. Or, in code:

data StatCompTyOf = Chunked | ByteOnly

data StatComputation st compTy where
  ChunkedComputation :: (st -> Word8 -> st)
                     -> (st -> BS.ByteString -> st)
                     -> StatComputation st 'Chunked
  ByteOnlyComputation :: (st -> Word8 -> st)
                      -> StatComputation st 'ByteOnly

Here (and in the subsequent text) BS is the strict ByteString module.

We also modify our Statistic class to account for that (note the new comp type parameter):

class Statistic s res st comp | res -> s, st -> s
                              , s -> res, s -> st, s -> comp
                              where
  initState :: st
  extractState :: st -> res
  computation :: StatComputation st comp

Here functional dependencies help again. And, in general, it’s sufficient to know any of s, res or st to infer the rest of the variables.

One other thing worth noting is that we could generalize this approach and have, for instance, three types of computations: our current byte-only and chunked ones plus SIMD-accelerated that are looking at 16-32 bytes at the same time. I’m keeping just two types for simplicity, though.

Stats implementation

What statistics do we have? I choose to implement

byte count,
(UTF-8) characters count,
word count,
max line length,
lines count.

Or in code:

data Statistics = Bytes | Chars | Words | MaxLL | Lines deriving (Eq, Ord)

This should be more than enough to illustrate the approach.

The base statistics are just instances of the Statistic class. And, since most statistics will have just an integer as their state, let’s introduce a convenience typed wrapper for the state and result types:

newtype Tagged a = Tagged Word64 deriving (Eq, Show, Num)

here the a is just a dummy parameter to distinguish, say, Tagged 'Bytes from Tagged 'Chars.

Given these definitions, let’s start with the most straightforward one: the bytes count. In code:

instance Statistic 'Bytes (Tagged 'Bytes) (Tagged 'Bytes) 'Chunked where
  initState = 0
  extractState = id
  computation = ChunkedComputation (\st _ -> st + 1) (\st str -> st + fromIntegral (BS.length str))

This is probably quite self-documenting:

We say that Bytes denotes a statistic having Tagged 'Bytes both as state type and result type and supporting chunked computations.
The initial state (that is, the byte count) is 0.
The extraction function doesn’t do anything fancy and just returns the accumulated count.
The computation is necessarily a ChunkedComputation, since we said 'Chunked on the first line. The step function ignores the input and just increments the counter, while the chunk function adds the length of the input to the counter.

Easy-peasy so far.

The rest of the statistics are implemented similarly. The implementations are boring, so I’m hiding them under a spoiler, but feel free to

check them out

Counting lines is similarly straightforward, and this statistic also supports both byte-only and chunked computations:

instance Statistic 'Lines (Tagged 'Lines) (Tagged 'Lines) 'Chunked where
  initState = 0
  extractState = id
  computation = ChunkedComputation (\st c -> st + if c == 10 then 1 else 0) (\st str -> st + fromIntegral (BS.count 10 str))

What about counting words? This one supports just the byte-only computation, and we largely borrow the implementation from the last post:

data WordsState = WordsState { ws :: Word64, wasSpace :: Word64 }

instance Statistic 'Words (Tagged 'Words) WordsState 'ByteOnly where
  initState = WordsState 0 1
  extractState WordsState { .. } = Tagged (ws + 1 - wasSpace)
  computation = ByteOnlyComputation step
    where
      step WordsState { .. } c = WordsState (ws + (1 - wasSpace) * isSp) isSp
        where
          isSp | c == 32 || c - 9 <= 4 = 1
               | otherwise = 0

This is a good example of a statistic with a non-trivial extraction function.

Cool, we’ve managed to port the stats that we already had. What about the new ones: the UTF-8 chars count and the max line length?

The first one is a bit more involved, but all the complexity is about the right bit mangling:

instance Statistic 'Chars (Tagged 'Chars) (Tagged 'Chars) 'ByteOnly where
  initState = 0
  extractState = id
  computation = ByteOnlyComputation $ \cnt c -> cnt + 1 - fromIntegral (   ((c .&. 0b10000000) `shiftR` 7)
                                                                       .&. (1 - ((c .&. 0b01000000) `shiftR` 6))
                                                                       )

Here we rely on the following property of the UTF-8 encoding: each character has one and only one byte that doesn’t follow the 10xxxxxx bit pattern. In other words, we don’t have to properly decode UTF-8 just for the sake of characters count.

What about the max line length? Here all the complexity is about properly accounting the non-printable characters and mimicking tabulation behaviour:

instance Statistic 'MaxLL (Tagged 'MaxLL) MaxLLState 'ByteOnly where
  initState = MaxLLState 0 0
  extractState MaxLLState { .. } = Tagged $ max maxLen curLen
  computation = ByteOnlyComputation step
    where
      step MaxLLState { .. } 9 = MaxLLState maxLen $ curLen + 8 - (curLen `rem` 8)
      step MaxLLState { .. } 8 = MaxLLState maxLen $ max 0 (curLen - 1)
      step MaxLLState { .. } c | c == 10
                              || c == 12
                              || c == 13 = MaxLLState (max maxLen curLen) 0
                               | c < 32 = MaxLLState maxLen curLen
      step MaxLLState { .. } _ = MaxLLState maxLen (curLen + 1)

Note that this function even handles backspaces properly — something that the original wc doesn’t take into account. Looks a bit messy, but what you’re gonna do?

Having implemented all the base statistics, we now turn to the fun part: how do we combine them together?

Combining stats

If a is a statistic and b is a statistic, then the pair of (a, b) is also a statistic. So let’s start with the type for a pair of statistics:

infixr 5 :::
data a ::: b = a ::: b deriving (Show)

We could have just as well used the standard Prelude pair (as in (,)), but this way we don’t rely on the compiler eliding laziness (since tuples are lazy), and a custom type for statistics also makes the type-promoted code a bit more readable IMO.

Let’s now express how the statistics are combined.

First, what about the combination of chunked vs byte-only statistics? If both statistics support chunked computations, then the result can also do chunked stats, otherwise it should degrade to byte-only computation. The type-level expression of this idea would be

type family CombineCompTy a b where
  CombineCompTy 'Chunked 'Chunked = 'Chunked
  CombineCompTy _ _ = 'ByteOnly

How would an instance of Statistic look like for a pair of statistics? We might write something like this:

instance (Statistic sa resa sta compa, Statistic sb resb stb compb)
       => Statistic (sa '::: sb) (resa ::: resb) (sta ::: stb) (CombineCompTy compa compb) where
  initState = initState ::: initState
  extractState (a ::: b) = extractState a ::: extractState b
  computation =
    case (computation :: StatComputation sta compa, computation :: StatComputation stb compb) of
         (ByteOnlyComputation a, ChunkedComputation b _) -> ByteOnlyComputation $ combine a b
         (ChunkedComputation a _, ByteOnlyComputation b) -> ByteOnlyComputation $ combine a b
         (ByteOnlyComputation a, ByteOnlyComputation b)  -> ByteOnlyComputation $ combine a b
         (ChunkedComputation stepA chunkA, ChunkedComputation stepB chunkB)
                                                         -> ChunkedComputation (combine stepA stepB) (combine chunkA chunkB)
    where
      combine fa fb = \(a ::: b) w -> fa a w ::: fb b w

So if sa is a statistic with the result type resa, state type sta and computation mode compa, and similarly for sb / resb / stb / compb, then the pair sa ::: sb is a statistic with the result type being a pair resa ::: resb, state type being sta ::: stb and the computation mode — you guessed it — CombineCompTy compa compb.

Note the (ticked, promoted) :::-as-a-constructor vs (unticked, unpromoted) :::-as-a-type in the instance head. sa and sb are terms promoted to the type level, so we use the (promoted) term constructor near them, and the rest are the usual types (living in Type), so we use the (unpromoted) type constructor there.

All is good, except… We can’t write this instance. Having a type family in the instance head in current Haskell is illegal. The easy way to fix this is to introduce a fresh type variable comp and add the constraint that it’s equal to the result of the type family application:

instance (Statistic sa resa sta compa,
          Statistic sb resb stb compb,
          comp ~ CombineCompTy compa compb)
       => Statistic (sa '::: sb) (resa ::: resb) (sta ::: stb) comp where

The rest of the instance definition remains the same.

Looking at the actual term-level stuff, the first two methods are easy and straightforward:

Initial state of a pair of statistics is a pair of the corresponding initial states.
Extracting state of a pair of statistics is a pair of the extracted states of the individual statistics.

Note we don’t need a single type annotation here!

The third method for building the statistics computation for the pair is somewhat more boilerplate-ish, but it largely follows the definition of the CombineCompTy type (and the type checker ensures that it indeed does follow that definition). Namely, if both computations support chunked stats, the resulting statistic also supports those (that’s the last clause). Otherwise we fall back to the byte-only computation.

The major part of the boilerplate comes from the need to explicitly pattern-match on at least one of the computers. Why? The type checker needs to ensure that the right-hand side of each case branch has the right type.

What’s the expected right-hand side type? According to the class definition, it’s StatComputation st comp, where, according to the instance definition, comp ~ CombineCompTy compa compb. So it depends on the particular compa and compb, and in order to reduce the CombineCompTy type family, the type checker needs to know whether both compa and compb are Chunked or not.

How does the type checker know the value of compa or compb? In general it doesn’t — unless we pattern-match on the values of the corresponding computations. Then the GADT machinery would tell us — by the definition of StatComputation, if the value is constructed with the ChunkedComputation constructor, then the corresponding type variable is necessarily Chunked, otherwise, if it’s the ByteOnlyComputation constructor, then the value is necessarily ByteOnly.

Note that if we wrote CombineCompTy without wildcard patterns, enumerating all four possibilities, then the type checker would have had to know the values of both compa and compb.

One other thing worth noting is that this representation of combined stats allows having duplicates: 'Words '::: 'Words represents a combined statistic of two word counts. This would probably be frowned upon by ⊤rue type safety zealots, but it is good enough for our purposes, and, moreover, this particular feature will come in handy quite soon.

Using stats

Great, we wrote a bunch of code. How do we actually use it?

Suppose we’re given an type satisfying the Statistic class and a ByteString to compute statistics over, We first unwrap the computation of said instance. If it’s a ChunkedComputation, we run it on the whole input string. Otherwise it’s a ByteOnlyComputation, and we BS.foldl' over the input byte-by-byte.

Or, in code:

wc :: forall s res st comp. Statistic s res st comp => BS.ByteString -> res
wc s = extractState $! runCompute computation
  where
    runCompute :: StatComputation st comp -> st
    runCompute (ByteOnlyComputation step) = BS.foldl' step initState s
    runCompute (ChunkedComputation _ chunker) = chunker initState s

Fundeps save the day again here, since the type checker is able to figure out all of the class arguments just by looking at res. On the other hand, unfortunately, the type checker seemingly isn’t able to infer the type of runCompute, so it has to be specified explicitly. Thus the variables st and comp in its signature have to be the same ones as in the top-level wc type, hence the forall keyword along with the ScopedTypeVariables extension.

So, having this wc function, we select the statistics we want to compute either by an explicit (return) type annotation:

let result = wc someBS :: Tagged 'Words ::: Tagged 'Lines

or using the TypeApplications extension and explicitly giving the value of the s type variable in its signature:

let result = wc @('Words '::: 'Lines) someBS

Both invocations are equivalent, but the latter is a bit more appealing to me, as, IMO, it communicates the intent a bit better.

Performance, preliminarily

So, do all our efforts really pay off, or did we just do all this type classes mumbo-jumbo in vain?

Let’s measure how long does wc @'Words take, using the same benchmarking strategy as the last time. The best running time is 1.51 seconds — a bit more than counting bytes, words and lines in the unmodularized version.

How good is the compiler at eliding unnecessary computations with a representation like this? Let’s measure wc @('Words '::: 'Words) to find out!

But here the weirdness begins. I expected to see something that’s maybe a bit slower, ideally something that runs in pretty much the same time, but… It actually runs faster, finishing in 1.34 seconds. And wc @('Words '::: 'Words '::: 'Words) takes about 1.30 seconds. Adding more Words doesn’t help further, though.

There’s something even weirder. I’ve discussed this with some folks on the #haskell freenode channel and, in particular, I’ve asked if they were able to reproduce this behaviour. They weren’t. The results were pretty consistent and independent on the number of duplicates of the same computation.

I don’t have a good explanation for this. I spent some time looking at GHC Core — to no avail. If it was just the performance improvement from “more work to be done”, then I’d probably be able to propose a few hypotheses about the behaviour of the inliner, or the specializer, or something along those lines. But, given that it doesn’t reproduce on at least some of the others’ machines… I don’t know. And I can’t say I like this.

Anyway, enough for the ramblings, let’s proceed with our quick and dirty performance measurements. What about all the three statistics we had previously? Let’s measure wc @('Bytes '::: 'Words '::: 'Lines)!

The run time of that one is 1.53 seconds. It’s a bit worse than 1.45 seconds we had previously with all three statistics hardcoded into a single tight loop, but, in my opinion, this is still a good result. So we’ve empirically proven that our approach stands a chance. Let’s see what benefits we can reap due to this modularizaton.

Testing

Oh, testing this is a pleasure! Local reasoning enables us to test each individual statistic in isolation and still be confident all of them work correctly in any combination.

For instance, we can check that, for any string, the result of the Words statistic on that string is equal to the length of the list returned by the standard words function applied to the same string, ignoring all spaces at codepoints above 127. In fact, we can postulate this is the defining property of the Words statistic, but that’s more of a question of philosophy and attitude.

In code, we’ll just have some QuickCheck-style properties written for both ASCII and UTF-8 inputs:

import qualified Data.ByteString.Char8 as BS
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
-- and some testing framework imports

import Data.WordCount

wrapUnicode :: UnicodeString -> (BS.ByteString, T.Text)
wrapUnicode ustr = (T.encodeUtf8 txt, txt)
  where
    txt = T.pack $ getUnicodeString ustr

replaceNonAsciiSpaces :: Char -> Char
replaceNonAsciiSpaces ch | ch >= chr 127 && isSpace ch = '_'
                         | otherwise = ch

main :: IO ()
main = hspec $ parallel $ modifyMaxSuccess (const 10000) $ modifyMaxSize (const 1000) $ do
  describe "ASCII support" $ do
    it "Counts bytes correctly" $ property $
      \(getASCIIString -> str) -> wc @'Bytes (BS.pack str) `shouldBe` genericLength str
    it "Counts chars correctly" $ property $
      \(getASCIIString -> str) -> wc @'Chars (BS.pack str) `shouldBe` genericLength str
    it "Counts words correctly" $ property $
      \(getASCIIString -> str) -> wc @'Words (BS.pack str) `shouldBe` genericLength (words str)
    it "Counts lines correctly" $ property $
      \(getASCIIString -> str) -> wc @'Lines (BS.pack str) `shouldBe` genericLength (filter (== '\n') str)
  describe "UTF8 support" $ do
    it "Counts bytes correctly" $ property $
      \(wrapUnicode -> (bs, _))   -> wc @'Bytes bs `shouldBe` fromIntegral (BS.length bs)
    it "Counts chars correctly" $ property $
      \(wrapUnicode -> (bs, txt)) -> wc @'Chars bs `shouldBe` fromIntegral (T.length txt)
    it "Counts words correctly" $ property $
      \(wrapUnicode -> (bs, txt)) -> wc @'Words bs `shouldBe` genericLength (T.words $ T.map replaceNonAsciiSpaces txt)
    it "Counts lines correctly" $ property $
      \(wrapUnicode -> (bs, txt)) -> wc @'Lines bs `shouldBe` fromIntegral (T.count "\n" txt)

And that’s about it for the tests!

A couple of things worth mentioning:

We don’t check any properties for the max line length statistic since it’s… non-trivial.
We do some Data.Text-related string handling, but that’s just probably the easiest way to get a ByteString containing UTF-8 bytes.
The tests also execute reasonably fast: it takes about 3-5 seconds to get through 10000 samples having max string length of 1000 characters on my machine for each property.

Anyway, the reader is now invited to try implementing something similar for the GNU Coreutils C version.

Handling user options

We’ll use the optparse-applicative library to do command-line options parsing. We introduce a type for the command-line options along with a parser for it:

data Options = Options
  { countBytes :: Bool
  , countChars :: Bool
  , countLines :: Bool
  , countMaxLineLength :: Bool
  , countWords :: Bool
  , files :: [FilePath]
  }

options :: Parser Options
options = Options
  <$> switch (long "bytes" <> short 'c' <> help "print the byte counts")
  <*> switch (long "chars" <> short 'm' <> help "print the character counts")
  <*> switch (long "lines" <> short 'l' <> help "print the newline counts")
  <*> switch (long "max-line-length" <> short 'L' <> help "print the maximum display width")
  <*> switch (long "words" <> short 'w' <> help "print the word counts")
  <*> some (argument str (metavar "FILES..."))

We start our main with actually running that parser and mapping the options onto the values of the Statistics type we had previously, counting bytes, words and lines if the user hasn’t selected anything explicitly:

main :: IO ()
main = do
  Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
  let selectedStats = map snd $ filter fst [ (countBytes, Bytes), (countChars, Chars)
                                           , (countWords, Words), (countMaxLineLength, MaxLL)
                                           , (countLines, Lines)
                                           ]
  let stats | null selectedStats = [Bytes, Words, Lines]
            | otherwise = selectedStats

All of this can likely be done way more elegantly, but this is the most straightforward approach that gets the boring parts out of the way, giving us the stats list to work with.

Given that list, what do we do? We need to convert it to a type that we could feed to wc. So, in other words, we have a term, and we need to treat it as a type. Sounds exactly like dependent types!

Almost dependent types

This approach is doomed to fail performance-wise, but it’s still worth giving it a shot to see on a real example how we can simulate something dependently type-ish in modern Haskell (and just how bad will it be). Let’s go ahead and build the solution step by step, preferably avoiding things like singletons to get a better feel of the underlying type system mechanics.

We don’t know what statistics we’ll compute at run time, so we’re wrapping it in an existential type:

data SomeStats where
  MkSomeStats :: Statistic s res st comp => proxy s -> SomeStats

Here proxy s is basically a witness of a specific Statistic instance. Its sole purpose is to ensure we have a type around to get a hold on.

Given an value of this existential type, how do we actually use it? We might try something like

wc' :: SomeStats -> BS.ByteString -> ?
wc' (MkSomeStats (_ :: proxy s)) input = wc @s input

but… What type do we write instead of the ?? What’s the return type of this function? It’s the res type from the corresponding Statistic instance, but we don’t know what exactly it is!

What if we modify our existential to keep track of it as well, like this?

data SomeStats where
  MkSomeStats :: Statistic s res st comp => proxy1 s -> proxy2 res -> SomeStats

and pattern-match like

wc' :: SomeStats -> BS.ByteString -> res
wc' (MkSomeStats (_ :: proxy1 s) (_ :: proxy2 res)) input = wc @s input

This obviously won’t work even syntactically: res is not in scope in the type signature. And even if it somehow was magically syntactically correct, then res would escape the existential — something a reasonable type system should prevent.

What do we do?

Let’s take a step back. We indeed don’t know the exact return type of wc, but the insight is that we don’t really need to! We only care about, say, printing it to the user. So, in other words, we only care if it’s able to, say, show this result. Let’s just write this down!

data SomeStats where
  MkSomeStats :: (Statistic s res st comp, Show res) => proxy s -> SomeStats

Now we can write our wc' as

wc' :: SomeStats -> BS.ByteString -> String
wc' (MkSomeStats (_ :: proxy s)) input = show $ wc @s input

and this type checks.

Now, how do we actually convert our input of stats to SomeStats?

First, we need to be able to promote individual statistics (members of the stats list):

promoteStat :: Statistics -> SomeStats
promoteStat Bytes = MkSomeStats (Proxy :: Proxy 'Bytes)
promoteStat Chars = MkSomeStats (Proxy :: Proxy 'Chars)
promoteStat Words = MkSomeStats (Proxy :: Proxy 'Words)
promoteStat MaxLL = MkSomeStats (Proxy :: Proxy 'MaxLL)
promoteStat Lines = MkSomeStats (Proxy :: Proxy 'Lines)

It’s a bit ugly that we have to explicitly enumerate all the values of Statistics, but this is the limitation of the Haskell type system. So much for the ersatz-dependent types: even though, say, the term Bytes and the (promoted) type 'Bytes look the same, they are very different entities to the type checker with no a-priori relationship between them, and we need to establish that relationship manually by the means of a function like promoteStat.

Anyway, given this promoteStat we can write a function to promote a whole list of stats:

promoteStats :: [Statistics] -> SomeStats
promoteStats [s] = promoteStat s
promoteStats (s:ss) =
  case (promoteStat s, promoteStats ss) of
       (MkSomeStats (_ :: proxy1 st), MkSomeStats (_ :: proxy2 sst))
                                   -> MkSomeStats (Proxy :: Proxy (st '::: sst))

Promoting a single element list boils down to using promoteStat.

Promoting a list with two or more elements is more interesting and might seem a bit magic. We use promoteStat and promoteStats to promote the head and the tail of the list respectively. Then we pattern-match on the results of these promotions, and we bind st to the (unknown) type which is the result of promoting the head of the list, and we bind sst to the (unknown) type which is the result of promoting the tail.

We don’t really know anything about those types except they implement Statistic (because that’s the constraint in the existential type). But if they do implement Statistic, then st ::: sst also implements Statistic — that’s precisely the gist of the combining instance we’ve written above! Moreover, we also know that rest and resst (that is, some imaginary unnamed type variables corresponding to the statistics computations result types) implement Show. So we can derive that rest ::: resst also implements Show, and this is precisely the result type of st ::: sst. So, all in all, MkSomeStats (Proxy :: Proxy (st '::: sst)) is perfectly well-typed.

Indeed, it might seem a bit marvelous that the type checker is able to figure all this out!

Also, this is a partial function (we don’t handle []), but let’s keep the exposition simple. It’d be a bit more involved with NonEmpty, blurring the focus somewhat.

Anyway, given this function, the usage is pretty straightforward:

main :: IO ()
main = do
  -- obtaining `stats` as before
  forM_ files $ \path -> do
    contents <- unsafeMMapFile path
    putStrLn $ wc' (promoteStats stats) contents

Performance, wonders

How well does this perform? Let’s run again to find out!

Running the lines calculation looks promising, taking 1.05 seconds to complete. That’s about as fast as doing BS.count 10 contents.

But that’s a chunked stat that takes the whole input at once in this case. What about something that should go over the input byte-by-byte, like the word counter? That’s gonna be… 14 seconds, as opposed to circa 1.5 we had previously.

Gosh, 14 seconds.

And it allocates like crazy: if we run the program with +RTS -sstderr, we’ll see

  74,873,139,008 bytes allocated in the heap

I haven’t systematically measured the heap allocations for the previous versions, but it always was under 1 megabyte. Well, at least this version is still O(1) on memory — most of those allocations just immediately die in the nursery (60,512 bytes maximum residency).

What if we count words and lines? 27 seconds, 120 gigs allocated in the heap.

What about words, lines and bytes? Can you guess?

If you expected something around 42 seconds, please pat yourself on the back — it runs in 41 seconds and allocates 194 gigs. Maximum residency as per RTS is still about 60 kilobytes, though — the allocated bytes count is the total memory ever allocated, it is not the peak memory consumed.

Why is it so bad?

Well, when there is a function like

wc' (MkSomeStats (_ :: proxy s)) input = show $ wc @s input

then the compiler can’t possibly know what computation is associated with the type variable s — it’s only really known at run time, once the user makes their choice. Thus the compiler cannot inline the specific computation for that s — in fact, it has no choice but to compile this down to explicit Statistic class methods dictionary passing without any inlining at all.

So, what happens here?

The wc' function receives a pointer to the computation function wrapped inside the SomeStats existential (along with the pointers to the rest of Statistic methods), and passes it on to wc, which calls that function through the pointer for every input byte. No inlining, no related optimizations, no tight loops, but a function call on each iteration. So that’s about 1.8 billion of calls for our input data, which is surely gonna be slow!

And the caller is responsible for packing the right function pointers into the existential.

For instance, when the stats list consists of a single element, so the

promoteStats [s] = promoteStat s

equation kicks in, and s turns out to be, for instance, Words, promoteStat gets evaluated according to the

promoteStat Words = MkSomeStats (Proxy :: Proxy 'Words)

equation. So it wraps the pointers to the methods of the Statistic implementation for Words in the SomeStats existential.

The other promoteStats equation,

promoteStats (s:ss) =
  case (promoteStat s, promoteStats ss) of
         (MkSomeStats (_ :: proxy1 st), MkSomeStats (_ :: proxy2 sst)) -> MkSomeStats (Proxy :: Proxy (st '::: sst))

is a bit more involved: for example, the pointer for computation would be built roughly as follows. The case on the right-hand side of the equation extracts the pointers to the computation implementations from the existentials returned by promoteStat and the recursive call to promoteStats and, to simplify it, passes them on to the computation of the implementation of the Statistic class for the “inductive” sa ::: sb case. That last function just invokes those earlier pointers one after the other.

Technically, a sufficiently smart compiler could have done some inter-procedural analysis along with some reasoning about options parsing to deduce that the wc' function is only called in certain contexts with a very limited set of input lists, so it could be specialized to those lists. But this probably requires some AI-level reasoning, and we’re not quite there yet.

So, if the options list consists of two stats, we’re paying the overhead of about 13-14 seconds twice, and the total run time should be about 28 seconds — just as we’ve observed. Shall we desire all five statistics, the run time would be around 65-70 seconds.

Understanding how existentials (and type erasure in general) work leads us to expect this approach to have terrible performance.

The ugly way

But all this helps us to narrow the possible solutions space down. We need to make sure the compiler sees the exact (monomorphized) type of wc at each invocation point. Ideally, we’d have something like

main = do
  -- ..
  case stats of
       [Words] -> print $ wc @'Words contents
       [Bytes] -> print $ wc @'Bytes contents
       [Lines] -> print $ wc @'Lines contents
       [Words, Bytes] -> print $ wc @('Words '::: 'Bytes) contents
       [Lines, Bytes] -> print $ wc @('Lines '::: 'Bytes) contents
       -- ...

and so on. But if we have just five statistics we might compute, then basic combinatorics shows that we’ll have 2⁵ - 1 = 31 case branches. Obviously, that’s not gonna work.

That is, unless the compiler does the job for us. For this, we’ll need to resort to some Template Haskell. We’ll write a (meta-)function dispatch that will be used as follows:

  contents <- unsafeMMapFile path
  putStrLn $ $(dispatch 'wc 'contents) stats

Here the $(dispatch 'wc 'contents) splice creates a function doing a case analysis of the stats argument using the generated list of alternatives and invoking the wc function as needed.

This is a pretty technical exercise in Template Haskell, so I’m not describing the thought process behind it. Here’s the result of that process:

dispatch :: Name -> Name -> Q Exp
dispatch fun bs = reify ''Statistics >>= \case
  TyConI (DataD _ _ _ _ cons _) -> do
    let consNames = [ name | NormalC name _ <- cons ]
    let powerset = filterM (const [True, False]) consNames
    let matches = buildMatch fun bs <$> filter (not . null) powerset
    fallbackMatch <- (\body -> Match WildP (NormalB body) []) <$> [e| error "Unexpected input" |]
    pure $ LamCaseE $ matches <> [fallbackMatch]
  _ -> fail "unsupported type"

buildMatch :: Name -> Name -> [Name] -> Match
buildMatch fun bs consNames = Match (ListP $ (`ConP` []) <$> consNames) (NormalB $ VarE 'show `AppE` (wcCall `AppE` VarE bs)) []
  where
    wcCall = VarE fun `AppTypeE` foldr1 f (PromotedT <$> consNames)
    f accTy promotedTy = PromotedT '(:::) `AppT` accTy `AppT` promotedTy

Long story short, we figure out all the constructors of the Statistics type, and for each non-empty subset of the constructor set — that is, for each non-empty element in powerset — we build a separate case branch using the helper buildMatch function. Then we wrap all those alternatives into a single lambda case using the {-# LANGUAGE LambdaCase #-} extension.

The only other thing that makes me cringe a little (besides the need for TH) is that the generated function expects the input stats list to be sorted. We can discharge this expectation either by carefully choosing the order of the elements in this list:

  let selectedStats = map snd $ filter fst [ (countBytes, Bytes), (countChars, Chars)
                                           , (countWords, Words), (countMaxLineLength, MaxLL)
                                           , (countLines, Lines)
                                           ]

where they are already given in the order induced by the compiler-generated Ord instance, or by explicitly sorting the stats list. Both approaches work, and the performance penalty of sorting the stats would obviously be negligible.

So, how would this function be used in context? Let’s put all of the pieces together!

main :: IO ()
main = do
  Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
  let selectedStats = map snd $ filter fst [(countBytes, Bytes), (countChars, Chars), (countWords, Words), (countMaxLineLength, MaxLL), (countLines, Lines)]
  let stats | null selectedStats = [Bytes, Words, Lines]
            | otherwise = selectedStats
  forM_ files $ \path -> do
    contents <- unsafeMMapFile path
    putStrLn $ $(dispatch 'wc 'contents) stats

Performance, redux

How well does this approach perform? I’m not going to benchmark all 31 combinations, but just some random subset of those. Measuring is done as before: each test is run 5 times on a sample file of 1.8 gigabytes, and the minimum userspace time is taken. GNU Coreutils’ wc is run with environment variables LC_ALL=C LANG=C unless otherwise noted.

Here are the results:

Words	Bytes	Lines	Characters	Max line length	Haskell `wc`, s	Coreutils `wc`, s
	✓				0.00 ¹	0.00 ¹
✓					1.54	12.5
✓	✓				1.20	12.5
		✓			1.06 / 0.24 ²	0.26
✓	✓	✓			1.52	12.5
			✓		1.42	8.45 ³
				✓	2.21	12.5
✓				✓	2.92	12.5

A few observations and footnotes:

It still boggles my mind that counting bytes and words is faster than just counting words.
Looks like the C version always computes some basic set of statistics (words, bytes, lines, max line length) if either words or max line length option is enabled. I’m not surprised.
¹ Yes, time literally shows 0.00user for both programs counting bytes.
² The first Haskell number (1.06 seconds) for the bytes and lines test is with the upstream bytestring. The second number is with a patched bytestring, where the count method is optimized. Ironically, count is implemented in C, and, ironically again, the C compiler is unable to optimize it reasonably well, and even a pure Haskell implementation would be faster, but all this irony would make a good subject for another post. The patch boils down to replacing that method with a bunch of ~~Assembly~~ intrinsics. With that patch, we’re basically comparing the speed of two implementations in a hand-written Assembly (or intrinsics, or whatever). No wonder it’s very close.
³ Here wc is run with an UTF-8 locale, as otherwise it figures out that the characters count is equal to the bytes count and runs in constant time, while the goal is to compare UTF-8 character counting.

Nits

Our little program can handle user options, it supports all of the statistics that wc supports, it can handle multiple files, but what is missing?

Parallelism

This is easy: we can just replace forM_ with forConcurrently_ from the async library in our main loop, resulting in:

main :: IO ()
main = do
  Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
  let selectedStats = map snd $ filter fst [(countBytes, Bytes), (countChars, Chars), (countWords, Words), (countMaxLineLength, MaxLL), (countLines, Lines)]
  let stats | null selectedStats = [Bytes, Words, Lines]
            | otherwise = selectedStats
  forConcurrently_ files $ \path -> do
    contents <- unsafeMMapFile path
    putStrLn $ $(dispatch 'wc 'contents) stats

This will handle as many files simultaneously as there are cores — or, rather, the value of the -N RTS option, which is equal to the number of cores by default for our compiled program. Alternatively, we could add another option, say, -j for jobs count.

What’s the overhead of this parallelism?

If we count, say, just words and bytes of a single file, it’ll run in about 1.22 seconds — almost as the sequential version, although the variance of the run time is higher in this case.

If we now feed our program with 6 copies of our test file (my CPU has 6 physical cores), it runs in 1.47 seconds — but with the variance even higher in this case. My guess is that it is already effectively bottlenecked on memory bandwidth at this point.

Pretty-printing

Now if we run our program, the result is presented as

Tagged 123 ::: (Tagged 456 ::: Tagged 789)

That’s not really friendly. Let’s add some pretty-printing! The easiest way is to add a method to the Statistic class:

class Statistic s res st comp | res -> s, st -> s
                              , s -> res, s -> st, s -> comp where
  -- ...
  prettyPrint :: res -> String

The implementation for base statistics is simple, for instance:

instance Statistic 'Bytes (Tagged 'Bytes) (Tagged 'Bytes) 'Chunked where
  -- ...
  prettyPrint (Tagged n) = show n <> " bytes"

The one for the combined statistics is just a bit more typing:

  prettyPrint (a ::: b) = prettyPrint a <> "\n" <> prettyPrint b

We also change our buildMatch helper to use prettyPrint:

buildMatch fun bs consNames = Match (ListP $ (`ConP` []) <$> consNames) (NormalB $ VarE 'prettyPrint `AppE` (wcCall `AppE` VarE bs)) []

And that’s it!

More input types

Our implementation unconditionally interprets input paths as files and mmaps them. This is not always going to work: for instance, it’s impossible to mmap a /proc/fd file descriptor in a case like hwc <(cat foo | grep bar).

One way to remedy this is to check if the input is a regular file or a symbolic link — those can be mmap’ed, and we already know how to handle them.

We choose to be safe and just unconditionally avoid mmap for everything else. Any other input gets read into a lazy ByteString so that the space complexity is still O(1). A lazy ByteString is basically a list of strict chunks, so we can just fold over those in case of chunked computations:

import qualified Data.ByteString.Lazy as BSL

wcLazy :: forall s res st comp. Statistic s res st comp => BSL.ByteString -> res
wcLazy s = extractState $! runCompute computation
  where
    runCompute :: StatComputation st comp -> st
    runCompute (ByteOnlyComputation step) = BSL.foldl' step initState s
    runCompute (ChunkedComputation _ chunker) = BSL.foldlChunks chunker initState s

Alternatively, we could have introduced a class of the input types and overload wc on that one, but the structure of the problem implies we’ll only going to have either strict or lazy ByteStrings, so this (more straightforward) approach works too, and it results in more or less equal amount of boilerplate.

Having this function, we modify our main as follows:

  forConcurrently_ files $ \path -> do
    stat <- getFileStatus path
    if isRegularFile stat || isSymbolicLink stat
      then countStrict stats $ unsafeMMapFile path
      else countLazy stats $ BSL.readFile path

where we’ve added a couple of tiny helper functions:

  where
    countStrict stats act = do
      contents <- act
      putStrLn $ $(dispatch 'wc 'contents) stats
    countLazy stats act = do
      contents <- act
      putStrLn $ $(dispatch 'wcLazy 'contents) stats

And that’s it!

Handling stdin

One other thing that Coreutils wc does and our implementation doesn’t is computing statistics over stdin.

The fix is quite simple, just adding a check to the end of our main:

main = do
  -- ... as before ...

  when (null files) $ countLazy stats BSL.getContents

If the files list is empty, then the forM_/forConcurrently_ fold doesn’t do anything, so we shouldn’t even guard it.

Now we can do something like this:

cat testfile.txt | /usr/bin/time hwc-exe -cw

What’s the performance hit? The above command runs in 1.40 seconds — that’s quite a jump from 1.22 seconds we had previously, but looks like it’s the overhead of avoiding mmap and using lazy ByteStrings.

Compile times and code bloat

I’m really nitpicking here, but let’s mention this too for the sake of some objectivity.

What’s the overhead of all this metaprogramming nonsense? We can measure either compilation times or the code size. So let’s do both!

If we go the “dependently-typed” route, which we can consider a baseline in this case, stack build takes about 7.9 seconds for the whole project, and the resulting binary size is 2.24 megabytes (after stripping).

The Template Haskell approach takes exactly three times as much to compile — 23 seconds. The code size change isn’t that big, though: it’s increased by just about 4.3%, resulting in a binary of 2.34 megabytes.

Although… It isn’t worth it doing stack build during normal development workflow: stack build --fast might be much faster.

Again, the “dependently-typed” implementation takes about 5.6 seconds to compile with --fast, and the TH one takes… 7.8 seconds. Not bad. I can buy that.

And, since this is a very important factor for some, let’s compare this to C. As expected, Haskell loses a lot here. I won’t even benchmark the compilation time of the C version — I bet it’s less than 0.1 seconds in release mode. Regarding code bloat, the resulting binary size is 51 kilobytes on my machine or 47 kilobytes in Ubuntu packages.

Anyway, I think that’s about it for all the minor things we could have done.

And this is already turns out to be quite a lengthy post, way more than I originally had in mind, so let’s jump to

Conclusions

So, what have we achieved? In a couple of words, we’ve managed to productionize our initial wc prototype to support user options (and more statistics!) in such a way that it doesn’t compute what the user hasn’t requested, all while ensuring the performance overhead of this modularization is negligible. In a sense, this implementation is closer to the favourite principle of the C and C++ programmers — “don’t pay for what you don’t use” — than the C version.

Or, in more detail, we’ve:

split our initial wc prototype into a set of small, testable and reusable pieces — one per each individual statistic;
learned how to easily compose those pieces together;
written some tests for the individual statistics while being certain any combination of statistics is computed correctly by construction;
tried computing just the statistics chosen by the user at runtime via some techniques inspired by dependent types, and watched them fail hilariously (much to our expectation);
managed to overcome the performance issues via some Template Haskell;
verified the result is very close performance-wise to our initial implementation, so all this modularity does not have a significant overhead;
closed a few other gaps our implementation had, ranging from parallel input processing to handling standard input.

I think that’s enough for today.

Oh, and by the way, the source code of the resulting program (modulo a couple of nits) is also available.

And there are also a couple of things worth mentioning:

The dependently typed approach isn’t necessarily doomed performance-wise. We chose to feed each byte to each handler, but we could instead feed a L1-cache-sized chunk of data to each handler. Then, for instance, since on my machine L1d cache size is 32 KiB and the input file is 1.8 gigs, the overhead would have been about 0.4 ms, or 0.04% of the characteristic run time of 1 second, which is totally reasonable.
Our implementation still isn’t completely equivalent to wc, since, for instance, the char count only supports ASCII and UTF-8 encodings. Although I haven’t seen a machine with a different encoding in maybe 18 years after I played around with a KOI8-R-based distro, support for other locales encodings can certainly be useful. Or, as an another example, this implementation counts words slightly differently than wc, which could also be an important difference in some applications.
The nice part about this approach, though, is that it’s super easy and super cheap to add more statistics, implementing different algorithms, without affecting the existing ones. Modularity FTW!

Edit history

Apr 05: expanded the conclusion and fixed a few typos in the text where Statistics occurred instead of Statistic.