The joys and perils of beating C with Haskell: productionizing wc
Last time
we’ve looked at implementing a toy wc
-like program
and we’ve also compared its performance against the full-blown Unix wc
.
The results were quite interesting:
our implementation managed to beat wc
by a factor of 5.
Of course, that’s quite an unfair comparison:
our implementation is hardcoded to count just the bytes, lines and words.
wc
, on the other hand, has command-line options to select specific statistics,
it supports some additional ones like maximum line length,
it treats Unicode spaces properly (in an Unicode-aware locale, of course),
and so on.
In other words, it’s better to consider what we’ve done last time
as a proof-of-concept showing that it’s possible to achieve (and overcome)
C-like performance on this task, even if with all those concessions.
Today we’ll look at ways of productionizing the toy program from the previous post. Our primary goal here is allowing the user to select various statistics, computing just what the user has selected to compute. We’ll try to do this in a modular and composable way, striving to isolate each statistic into its own unit of some sorts.
Indeed, if we look at the C version — well, personally I wouldn’t call that as a prime example of readable and maintainable code, as different statistics are computed in a single big 370-lines-long function. This is something we’ll try to avoid here.
Moreover, we’ll try to express that certain statistics like byte count or lines count
can be computed more efficiently if we don’t have to look at each byte,
while other statistics like word count or max line length just need to look at each byte one by one
(unless one does some clever and non-trivial broadword programming or SIMD-enabled things,
which is beyond the scope of this post).
For instance, byte count can be computed in O(1)
if we know we’re reading from a file —
we can just take the file size and call it a day!
In addition to that, we will, among other things:
- implement more statistics with ease, enjoying local reasoning;
- throw up some tests, enjoying local reasoning once more;
- try out some kinda-dependently-typed techniques, successfully obtaining working code but failing spectacularly on the performance side of things;
- play around with Template Haskell;
- marvel at the (un)predictability and (un)reproducibility of the resulting code performance.
So, here’s the code from the previous post that we begin with:
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
module Data.WordCount where
import qualified Data.ByteString.Lazy as BS
data State = State
{ bs :: Int
, ws :: Int
, ls :: Int
, wasSpace :: Int
}
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws + 1 - wasSpace, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 1) s
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
where
isSp | c == 32 || c - 9 <= 4 = 1
| otherwise = 0
addLine | c == 10 = 1
| otherwise = 0
addWord = (1 - wasSpace) * isSp
{-# INLINE wc #-}
We want to break it down into separate functions computing bytes, words and lines count respectively. How can we do this?
Composable left folds
First of all, note that what we’re doing is a fold —
hey, even the name of the function BS.foldl'
tells us so!
Some time ago I’ve come across foldl —
a library for "composable, streaming, and efficient Left folds".
This sounds like exactly what we need!
Moreover, luckily for us, it even has a
module
for folding over ByteString
s.
In particular, it already has functions for 2 of the 3 statistics we’ve implemented:
bytes count is equal to the input string length which is length
in said module,
while count
allows counting the lines (via count 10
).
Looks like we just need to write a fold to count the words, and we’re golden!
So here is the said fold:
{-# LANGUAGE Strict #-}
import qualified Control.Foldl as L
import qualified Data.ByteString as BS
data WordState = WordState { ws :: Int, wasSpace :: Int }
wordsCount :: L.Fold BS.ByteString Int
wordsCount = L.Fold (BS.foldl' go) (WordState 0 1) (\WordState { .. } -> ws + 1 - wasSpace)
where
go WordState { .. } c = WordState (ws + addWord) isSp
where
isSp | c == 32 || c - 9 <= 4 = 1
| otherwise = 0
addWord = (1 - wasSpace) * isSp
Now we can express counting bytes, lines and words simultaneously as
import qualified Control.Foldl.ByteString as BL
main :: IO ()
main = do
[path] <- getArgs
contents <- unsafeMMapFile path
let res = BL.fold ((,,) <$> BL.length <*> BL.count 10 <*> wordsCount) (BSL.fromStrict contents) :: (Int, Int, Int)
print res
Composable indeed! But what is the run time?
If I benchmark it as described in the previous post (running on a 1.8 gigabytes file residing in tmpfs
),
I get about 2.5 seconds on my machine.
Note it’s a different machine from the one I did the benchmarks for the previous post —
it’s my development machine with a different CPU and more sources of noise (IDE, browser, etc),
but it’ll do to get a rough idea of the performance.
Ok, so 2.5 seconds. That’s almost twice as bad as it was before.
What if we count just the length and words?
let res = BL.fold ((,) <$> BL.length <*> wordsCount) contents :: (Int, Int)
print res
1.55 seconds. Hmm. What about just the lines count?
let res = BL.fold (BL.count 10) contents :: Int
print res
1.05 seconds.
Dang. It’s almost additive. But it shouldn’t be additive.
The lines count (boiling down to counting '\n'
) should be dwarved
by the way more complicated logic of words counting, but it isn’t.
Something strange’s going on here.
Let’s get our hands dirty and look at what the foldl
library does.
In general,
foldl
feeds a chunk of input to each of the folds composed Applicative
ly.
foldl
folds over lazy ByteString
s,
and a chunk here is the strict ByteString
that the lazy one consists of.
In this particular case, that’s about 256 KiB worth of data,
which definitely does not fit L1 cache, so we end up paying quite some CPU cycles to wait
for L2 → L1 cache and even L1 cache → register transfers
(since otherwise the data we’re working with is readily available).
Sure, we could have just decreased the size of the chunk to 16-32 KiB so that it fits L1, but that’s not as interesting.
And, what’s worse, the compiler seemingly isn’t able to elide unnecessary computations:
the run time of BL.fold ((,) <$> wordsCount <*> wordsCount) contents
is twice as much as
the run time of BL.fold wordsCount contents
.
Anyway, that approach is not gonna fly.
Moreover, it’s not entirely obvious how to compose the folds “on demand”, say, based on the program command-line options.
So let’s write…
Our very own folds
Before we actually get to coding, let’s think a little about the design. Remember that ultimately we need to compose the folds according to what the user has selected. How it might even look like?
Let’s say we have base, “atomic” folds (that is, statistics) f1
, f2
, f3
at our disposal,
and the user requests the composition of f1
and f3
.
I wouldn’t really expect the compiler to be smart enough to see that code like
-- options is a list of booleans per individual statistic
options <- parseCliOptions
-- theFold is the resulting composition
let theFold = foldl' f (zip options [f1, f2, f3]) emptyFold
where
f acc (True, stat) = acc `compose` stat
f acc (False, _) = acc
implies that the compiler needs to specialize all the compositions of folds that might arise.
The only way to be sure the compiler has all the relevant information is to lift this to the type level — Haskell ultimately gets rid of the types during compile-time, so any information conveyed via them must be usable by the compiler.
Types to the rescue
Given that, how do we represent a fold? One type-friendly way of doing this is via a type class.
And, before we get to that, let’s just enumerate the type system extensions we’ll need to get them out of the way:
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE MultiParamTypeClasses, UndecidableInstances, FlexibleInstances #-}
{-# LANGUAGE TypeFamilyDependencies, FunctionalDependencies, PolyKinds, DataKinds, GADTs, TypeOperators #-}
{-# LANGUAGE ScopedTypeVariables #-}
So what constitutes a fold in our case, which is some statistic over the input? A fold-computed statistic is defined by:
- the initial state,
- the step function updating the state according to an input character,
- the function to extract the final result from the state,
- the state and the result types.
Let’s write this down:
class Statistic s res st | res -> s, st -> s
, s -> res, s -> st
where
initState :: st
extractState :: st -> res
step :: st -> Word8 -> st
here s
denotes the statistics identifier
(it will be a DataKinds
-thing later on),
and we use functional dependencies to denote that
the statistic uniquely determines the result type and the state type,
and vice versa.
In particular, this allows us to avoid
spelling out all of the class type variables in the methods signatures,
and it suffices to mention just the ones that we actually need.
Otherwise we’d need something like
initState :: proxy1 s -> proxy2 res -> st
extractState :: proxy s -> st -> res
step :: proxy1 s -> proxy2 res -> st -> Word8 -> st
Ugh. Ugly. Good we don’t need to spell out as much.
Chunked statistics
Anyway, this one looks like a good start, but something’s missing.
Let’s think how we might express the bytes count statistic, for instance.
If the user has requested just this statistic,
then there’s no point in folding over the input and incrementing a counter on each step.
Instead, we can just take the length of the whole input chunk (as in BS.length
)
and call it a day.
That’s O(1) on the input length.
On the other hand, the word counting or, even worse, max line length counting statistic does not have an easily expressible similarly “chunked” computation mode. They necessarily need to look at each byte of the input (modulo some SIMD implementations that we don’t consider here — but they still are O(n) on the input).
Moreover, we’ll later want to combine different statistics so that if both stats support chunked mode of computation, then we still do chunked, but otherwise we degrade to byte-level.
So how do we express that some statistics support both “chunked” computations as well as byte-level ones, while others only support the latter? GADTs to the rescue! We introduce an enumeration to denote the sort, or computation mode of the statistic, and we also introduce a GADT wrapping the functions that make sense for that sort. Or, in code:
data StatCompTyOf = Chunked | ByteOnly
data StatComputation st compTy where
ChunkedComputation :: (st -> Word8 -> st)
-> (st -> BS.ByteString -> st)
-> StatComputation st 'Chunked
ByteOnlyComputation :: (st -> Word8 -> st)
-> StatComputation st 'ByteOnly
Here (and in the subsequent text) BS
is the strict ByteString
module.
We also modify our Statistic
class to account for that
(note the new comp
type parameter):
class Statistic s res st comp | res -> s, st -> s
, s -> res, s -> st, s -> comp
where
initState :: st
extractState :: st -> res
computation :: StatComputation st comp
Here functional dependencies help again.
And, in general, it’s sufficient to know any of s
, res
or st
to infer the rest of the variables.
One other thing worth noting is that we could generalize this approach and have, for instance, three types of computations: our current byte-only and chunked ones plus SIMD-accelerated that are looking at 16-32 bytes at the same time. I’m keeping just two types for simplicity, though.
Stats implementation
What statistics do we have? I choose to implement
- byte count,
- (UTF-8) characters count,
- word count,
- max line length,
- lines count.
Or in code:
data Statistics = Bytes | Chars | Words | MaxLL | Lines deriving (Eq, Ord)
This should be more than enough to illustrate the approach.
The base statistics are just instances of the Statistic
class.
And, since most statistics will have just an integer as their state,
let’s introduce a convenience typed wrapper for the state and result types:
newtype Tagged a = Tagged Word64 deriving (Eq, Show, Num)
here the a
is just a dummy parameter to distinguish, say,
Tagged 'Bytes
from Tagged 'Chars
.
Given these definitions, let’s start with the most straightforward one: the bytes count. In code:
instance Statistic 'Bytes (Tagged 'Bytes) (Tagged 'Bytes) 'Chunked where
initState = 0
extractState = id
computation = ChunkedComputation (\st _ -> st + 1) (\st str -> st + fromIntegral (BS.length str))
This is probably quite self-documenting:
- We say that
Bytes
denotes a statistic havingTagged 'Bytes
both as state type and result type and supporting chunked computations. - The initial state (that is, the byte count) is
0
. - The extraction function doesn’t do anything fancy and just returns the accumulated count.
- The computation is necessarily a
ChunkedComputation
, since we said'Chunked
on the first line. The step function ignores the input and just increments the counter, while the chunk function adds the length of the input to the counter.
Easy-peasy so far.
The rest of the statistics are implemented similarly. The implementations are boring, so I’m hiding them under a spoiler, but feel free tocheck them out
Counting lines is similarly straightforward, and this statistic also supports both byte-only and chunked computations:
instance Statistic 'Lines (Tagged 'Lines) (Tagged 'Lines) 'Chunked where
initState = 0
extractState = id
computation = ChunkedComputation (\st c -> st + if c == 10 then 1 else 0) (\st str -> st + fromIntegral (BS.count 10 str))
What about counting words? This one supports just the byte-only computation, and we largely borrow the implementation from the last post:
data WordsState = WordsState { ws :: Word64, wasSpace :: Word64 }
instance Statistic 'Words (Tagged 'Words) WordsState 'ByteOnly where
initState = WordsState 0 1
extractState WordsState { .. } = Tagged (ws + 1 - wasSpace)
computation = ByteOnlyComputation step
where
step WordsState { .. } c = WordsState (ws + (1 - wasSpace) * isSp) isSp
where
isSp | c == 32 || c - 9 <= 4 = 1
| otherwise = 0
This is a good example of a statistic with a non-trivial extraction function.
Cool, we’ve managed to port the stats that we already had. What about the new ones: the UTF-8 chars count and the max line length?
The first one is a bit more involved, but all the complexity is about the right bit mangling:
instance Statistic 'Chars (Tagged 'Chars) (Tagged 'Chars) 'ByteOnly where
initState = 0
extractState = id
computation = ByteOnlyComputation $ \cnt c -> cnt + 1 - fromIntegral ( ((c .&. 0b10000000) `shiftR` 7)
.&. (1 - ((c .&. 0b01000000) `shiftR` 6))
)
Here we rely on the following property of the UTF-8 encoding:
each character has one and only one byte that doesn’t follow the 10xxxxxx
bit pattern.
In other words, we don’t have to properly decode UTF-8 just for the sake of characters count.
What about the max line length? Here all the complexity is about properly accounting the non-printable characters and mimicking tabulation behaviour:
instance Statistic 'MaxLL (Tagged 'MaxLL) MaxLLState 'ByteOnly where
initState = MaxLLState 0 0
extractState MaxLLState { .. } = Tagged $ max maxLen curLen
computation = ByteOnlyComputation step
where
step MaxLLState { .. } 9 = MaxLLState maxLen $ curLen + 8 - (curLen `rem` 8)
step MaxLLState { .. } 8 = MaxLLState maxLen $ max 0 (curLen - 1)
step MaxLLState { .. } c | c == 10
|| c == 12
|| c == 13 = MaxLLState (max maxLen curLen) 0
| c < 32 = MaxLLState maxLen curLen
step MaxLLState { .. } _ = MaxLLState maxLen (curLen + 1)
wc
doesn’t take into account.
Looks a bit messy, but what you’re gonna do?
Having implemented all the base statistics, we now turn to the fun part: how do we combine them together?
Combining stats
If a
is a statistic and b
is a statistic,
then the pair of (a, b)
is also a statistic.
So let’s start with the type for a pair of statistics:
infixr 5 :::
data a ::: b = a ::: b deriving (Show)
We could have just as well used the standard Prelude pair (as in (,)
),
but this way we don’t rely on the compiler eliding laziness (since tuples are lazy),
and a custom type for statistics also makes the type-promoted code a bit more readable IMO.
Let’s now express how the statistics are combined.
First, what about the combination of chunked vs byte-only statistics? If both statistics support chunked computations, then the result can also do chunked stats, otherwise it should degrade to byte-only computation. The type-level expression of this idea would be
type family CombineCompTy a b where
CombineCompTy 'Chunked 'Chunked = 'Chunked
CombineCompTy _ _ = 'ByteOnly
How would an instance of Statistic
look like for a pair of statistics?
We might write something like this:
instance (Statistic sa resa sta compa, Statistic sb resb stb compb)
=> Statistic (sa '::: sb) (resa ::: resb) (sta ::: stb) (CombineCompTy compa compb) where
initState = initState ::: initState
extractState (a ::: b) = extractState a ::: extractState b
computation =
case (computation :: StatComputation sta compa, computation :: StatComputation stb compb) of
(ByteOnlyComputation a, ChunkedComputation b _) -> ByteOnlyComputation $ combine a b
(ChunkedComputation a _, ByteOnlyComputation b) -> ByteOnlyComputation $ combine a b
(ByteOnlyComputation a, ByteOnlyComputation b) -> ByteOnlyComputation $ combine a b
(ChunkedComputation stepA chunkA, ChunkedComputation stepB chunkB)
-> ChunkedComputation (combine stepA stepB) (combine chunkA chunkB)
where
combine fa fb = \(a ::: b) w -> fa a w ::: fb b w
So if sa
is a statistic with the result type resa
,
state type sta
and computation mode compa
,
and similarly for sb
/ resb
/ stb
/ compb
,
then the pair sa ::: sb
is a statistic with the result type being a pair resa ::: resb
,
state type being sta ::: stb
and the computation mode — you guessed it — CombineCompTy compa compb
.
Note the (ticked, promoted) :::
-as-a-constructor vs (unticked, unpromoted) :::
-as-a-type
in the instance head.
sa
and sb
are terms promoted to the type level,
so we use the (promoted) term constructor near them,
and the rest are the usual types (living in Type
),
so we use the (unpromoted) type constructor there.
All is good, except… We can’t write this instance.
Having a type family in the instance head in current Haskell is illegal.
The easy way to fix this is to introduce a fresh type variable comp
and
add the constraint that it’s equal to the result of the type family application:
instance (Statistic sa resa sta compa,
Statistic sb resb stb compb,
comp ~ CombineCompTy compa compb)
=> Statistic (sa '::: sb) (resa ::: resb) (sta ::: stb) comp where
The rest of the instance definition remains the same.
Looking at the actual term-level stuff, the first two methods are easy and straightforward:
- Initial state of a pair of statistics is a pair of the corresponding initial states.
- Extracting state of a pair of statistics is a pair of the extracted states of the individual statistics.
Note we don’t need a single type annotation here!
The third method for building the statistics computation for the pair
is somewhat more boilerplate-ish,
but it largely follows the definition of the CombineCompTy
type
(and the type checker ensures that it indeed does follow that definition).
Namely, if both computations support chunked stats,
the resulting statistic also supports those (that’s the last clause).
Otherwise we fall back to the byte-only computation.
The major part of the boilerplate comes from the need to
explicitly pattern-match on at least one of the computers.
Why?
The type checker needs to ensure that the right-hand side
of each case
branch has the right type.
What’s the expected right-hand side type?
According to the class definition, it’s StatComputation st comp
,
where, according to the instance definition, comp ~ CombineCompTy compa compb
.
So it depends on the particular compa
and compb
,
and in order to reduce the CombineCompTy
type family,
the type checker needs to know whether both compa
and compb
are
Chunked
or not.
How does the type checker know the value of compa
or compb
?
In general it doesn’t — unless we pattern-match on the values of the corresponding computation
s.
Then the GADT machinery would tell us — by the definition of StatComputation
,
if the value is constructed with the ChunkedComputation
constructor,
then the corresponding type variable is necessarily Chunked
,
otherwise, if it’s the ByteOnlyComputation
constructor,
then the value is necessarily ByteOnly
.
Note that if we wrote CombineCompTy
without wildcard patterns,
enumerating all four possibilities, then the type checker would have had to know
the values of both compa
and compb
.
One other thing worth noting is that this representation of combined stats
allows having duplicates:
'Words '::: 'Words
represents a combined statistic of two word counts.
This would probably be frowned upon by ⊤rue type safety zealots,
but it is good enough for our purposes, and, moreover,
this particular feature will come in handy quite soon.
Using stats
Great, we wrote a bunch of code. How do we actually use it?
Suppose we’re given an type satisfying the Statistic
class
and a ByteString
to compute statistics over,
We first unwrap the computation
of said instance.
If it’s a ChunkedComputation
, we run it on the whole input string.
Otherwise it’s a ByteOnlyComputation
,
and we BS.foldl'
over the input byte-by-byte.
Or, in code:
wc :: forall s res st comp. Statistic s res st comp => BS.ByteString -> res
wc s = extractState $! runCompute computation
where
runCompute :: StatComputation st comp -> st
runCompute (ByteOnlyComputation step) = BS.foldl' step initState s
runCompute (ChunkedComputation _ chunker) = chunker initState s
Fundeps save the day again here,
since the type checker is able to figure out all of the class arguments
just by looking at res
.
On the other hand, unfortunately, the type checker seemingly isn’t able
to infer the type of runCompute
, so it has to be specified explicitly.
Thus the variables st
and comp
in its signature
have to be the same ones as in the top-level wc
type,
hence the forall
keyword along with the ScopedTypeVariables
extension.
So, having this wc
function, we select the statistics we want to compute
either by an explicit (return) type annotation:
let result = wc someBS :: Tagged 'Words ::: Tagged 'Lines
or using the TypeApplications
extension
and explicitly giving the value of the s
type variable in its signature:
let result = wc @('Words '::: 'Lines) someBS
Both invocations are equivalent, but the latter is a bit more appealing to me, as, IMO, it communicates the intent a bit better.
Performance, preliminarily
So, do all our efforts really pay off, or did we just do all this type classes mumbo-jumbo in vain?
Let’s measure how long does wc @'Words
take, using the same benchmarking strategy as the last time.
The best running time is 1.51 seconds — a bit more than counting bytes, words and lines in the unmodularized version.
How good is the compiler at eliding unnecessary computations with a representation like this?
Let’s measure wc @('Words '::: 'Words)
to find out!
But here the weirdness begins.
I expected to see something that’s maybe a bit slower,
ideally something that runs in pretty much the same time, but…
It actually runs faster, finishing in 1.34 seconds.
And wc @('Words '::: 'Words '::: 'Words)
takes about 1.30 seconds.
Adding more Words
doesn’t help further, though.
There’s something even weirder.
I’ve discussed this with some folks on the #haskell
freenode channel
and, in particular, I’ve asked if they were able to reproduce this behaviour.
They weren’t.
The results were pretty consistent and independent on the number of duplicates of the same computation.
I don’t have a good explanation for this. I spent some time looking at GHC Core — to no avail. If it was just the performance improvement from “more work to be done”, then I’d probably be able to propose a few hypotheses about the behaviour of the inliner, or the specializer, or something along those lines. But, given that it doesn’t reproduce on at least some of the others’ machines… I don’t know. And I can’t say I like this.
Anyway, enough for the ramblings, let’s proceed with our quick and dirty performance measurements.
What about all the three statistics we had previously?
Let’s measure wc @('Bytes '::: 'Words '::: 'Lines)
!
The run time of that one is 1.53 seconds. It’s a bit worse than 1.45 seconds we had previously with all three statistics hardcoded into a single tight loop, but, in my opinion, this is still a good result. So we’ve empirically proven that our approach stands a chance. Let’s see what benefits we can reap due to this modularizaton.
Testing
Oh, testing this is a pleasure! Local reasoning enables us to test each individual statistic in isolation and still be confident all of them work correctly in any combination.
For instance, we can check that, for any string,
the result of the Words
statistic on that string
is equal to the length of the list returned by the standard words
function
applied to the same string, ignoring all spaces at codepoints above 127
.
In fact, we can postulate this is the defining property of the Words
statistic,
but that’s more of a question of philosophy and attitude.
In code, we’ll just have some QuickCheck-style properties written for both ASCII and UTF-8 inputs:
import qualified Data.ByteString.Char8 as BS
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
-- and some testing framework imports
import Data.WordCount
wrapUnicode :: UnicodeString -> (BS.ByteString, T.Text)
wrapUnicode ustr = (T.encodeUtf8 txt, txt)
where
txt = T.pack $ getUnicodeString ustr
replaceNonAsciiSpaces :: Char -> Char
replaceNonAsciiSpaces ch | ch >= chr 127 && isSpace ch = '_'
| otherwise = ch
main :: IO ()
main = hspec $ parallel $ modifyMaxSuccess (const 10000) $ modifyMaxSize (const 1000) $ do
describe "ASCII support" $ do
it "Counts bytes correctly" $ property $
\(getASCIIString -> str) -> wc @'Bytes (BS.pack str) `shouldBe` genericLength str
it "Counts chars correctly" $ property $
\(getASCIIString -> str) -> wc @'Chars (BS.pack str) `shouldBe` genericLength str
it "Counts words correctly" $ property $
\(getASCIIString -> str) -> wc @'Words (BS.pack str) `shouldBe` genericLength (words str)
it "Counts lines correctly" $ property $
\(getASCIIString -> str) -> wc @'Lines (BS.pack str) `shouldBe` genericLength (filter (== '\n') str)
describe "UTF8 support" $ do
it "Counts bytes correctly" $ property $
\(wrapUnicode -> (bs, _)) -> wc @'Bytes bs `shouldBe` fromIntegral (BS.length bs)
it "Counts chars correctly" $ property $
\(wrapUnicode -> (bs, txt)) -> wc @'Chars bs `shouldBe` fromIntegral (T.length txt)
it "Counts words correctly" $ property $
\(wrapUnicode -> (bs, txt)) -> wc @'Words bs `shouldBe` genericLength (T.words $ T.map replaceNonAsciiSpaces txt)
it "Counts lines correctly" $ property $
\(wrapUnicode -> (bs, txt)) -> wc @'Lines bs `shouldBe` fromIntegral (T.count "\n" txt)
And that’s about it for the tests!
A couple of things worth mentioning:
- We don’t check any properties for the max line length statistic since it’s… non-trivial.
- We do some
Data.Text
-related string handling, but that’s just probably the easiest way to get aByteString
containing UTF-8 bytes. - The tests also execute reasonably fast: it takes about 3-5 seconds to get through 10000 samples having max string length of 1000 characters on my machine for each property.
Anyway, the reader is now invited to try implementing something similar for the GNU Coreutils C version.
Handling user options
We’ll use the optparse-applicative library to do command-line options parsing. We introduce a type for the command-line options along with a parser for it:
data Options = Options
{ countBytes :: Bool
, countChars :: Bool
, countLines :: Bool
, countMaxLineLength :: Bool
, countWords :: Bool
, files :: [FilePath]
}
options :: Parser Options
options = Options
<$> switch (long "bytes" <> short 'c' <> help "print the byte counts")
<*> switch (long "chars" <> short 'm' <> help "print the character counts")
<*> switch (long "lines" <> short 'l' <> help "print the newline counts")
<*> switch (long "max-line-length" <> short 'L' <> help "print the maximum display width")
<*> switch (long "words" <> short 'w' <> help "print the word counts")
<*> some (argument str (metavar "FILES..."))
We start our main
with actually running that parser
and mapping the options onto the values of the Statistics
type we had previously,
counting bytes, words and lines if the user hasn’t selected anything explicitly:
main :: IO ()
main = do
Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
let selectedStats = map snd $ filter fst [ (countBytes, Bytes), (countChars, Chars)
, (countWords, Words), (countMaxLineLength, MaxLL)
, (countLines, Lines)
]
let stats | null selectedStats = [Bytes, Words, Lines]
| otherwise = selectedStats
All of this can likely be done way more elegantly, but this is the most straightforward approach
that gets the boring parts out of the way, giving us the stats
list to work with.
Given that list, what do we do?
We need to convert it to a type that we could feed to wc
.
So, in other words, we have a term, and we need to treat it as a type.
Sounds exactly like dependent types!
Almost dependent types
This approach is doomed to fail performance-wise, but it’s still worth giving it a shot to see on a real example how we can simulate something dependently type-ish in modern Haskell (and just how bad will it be). Let’s go ahead and build the solution step by step, preferably avoiding things like singletons to get a better feel of the underlying type system mechanics.
We don’t know what statistics we’ll compute at run time, so we’re wrapping it in an existential type:
data SomeStats where
MkSomeStats :: Statistic s res st comp => proxy s -> SomeStats
Here proxy s
is basically a witness of a specific Statistic
instance.
Its sole purpose is to ensure we have a type around to get a hold on.
Given an value of this existential type, how do we actually use it? We might try something like
wc' :: SomeStats -> BS.ByteString -> ?
wc' (MkSomeStats (_ :: proxy s)) input = wc @s input
but… What type do we write instead of the ?
?
What’s the return type of this function?
It’s the res
type from the corresponding Statistic
instance,
but we don’t know what exactly it is!
What if we modify our existential to keep track of it as well, like this?
data SomeStats where
MkSomeStats :: Statistic s res st comp => proxy1 s -> proxy2 res -> SomeStats
and pattern-match like
wc' :: SomeStats -> BS.ByteString -> res
wc' (MkSomeStats (_ :: proxy1 s) (_ :: proxy2 res)) input = wc @s input
This obviously won’t work even syntactically: res
is not in scope in the type signature.
And even if it somehow was magically syntactically correct,
then res
would escape the existential — something a reasonable type system should prevent.
What do we do?
Let’s take a step back.
We indeed don’t know the exact return type of wc
,
but the insight is that we don’t really need to!
We only care about, say, printing it to the user.
So, in other words, we only care if it’s able to, say, show
this result.
Let’s just write this down!
data SomeStats where
MkSomeStats :: (Statistic s res st comp, Show res) => proxy s -> SomeStats
Now we can write our wc'
as
wc' :: SomeStats -> BS.ByteString -> String
wc' (MkSomeStats (_ :: proxy s)) input = show $ wc @s input
and this type checks.
Now, how do we actually convert our input of stats
to SomeStats
?
First, we need to be able to promote individual statistics (members of the stats
list):
promoteStat :: Statistics -> SomeStats
promoteStat Bytes = MkSomeStats (Proxy :: Proxy 'Bytes)
promoteStat Chars = MkSomeStats (Proxy :: Proxy 'Chars)
promoteStat Words = MkSomeStats (Proxy :: Proxy 'Words)
promoteStat MaxLL = MkSomeStats (Proxy :: Proxy 'MaxLL)
promoteStat Lines = MkSomeStats (Proxy :: Proxy 'Lines)
It’s a bit ugly that we have to explicitly enumerate all the values of Statistics
,
but this is the limitation of the Haskell type system.
So much for the ersatz-dependent types:
even though, say, the term Bytes
and the (promoted) type 'Bytes
look the same, they are very different entities to the type checker
with no a-priori relationship between them,
and we need to establish that relationship manually by the means of a function like promoteStat
.
Anyway, given this promoteStat
we can write a function to promote a whole list of stats:
promoteStats :: [Statistics] -> SomeStats
promoteStats [s] = promoteStat s
promoteStats (s:ss) =
case (promoteStat s, promoteStats ss) of
(MkSomeStats (_ :: proxy1 st), MkSomeStats (_ :: proxy2 sst))
-> MkSomeStats (Proxy :: Proxy (st '::: sst))
Promoting a single element list boils down to using promoteStat
.
Promoting a list with two or more elements is more interesting and might seem a bit magic.
We use promoteStat
and promoteStats
to promote the head and the tail of the list respectively.
Then we pattern-match on the results of these promotions,
and we bind st
to the (unknown) type which is the result of promoting the head of the list,
and we bind sst
to the (unknown) type which is the result of promoting the tail.
We don’t really know anything about those types except they implement Statistic
(because that’s the constraint in the existential type).
But if they do implement Statistic
,
then st ::: sst
also implements Statistic
—
that’s precisely the gist of the combining instance we’ve written above!
Moreover, we also know that rest
and resst
(that is, some imaginary unnamed type variables corresponding to the statistics computations result types)
implement Show
.
So we can derive that rest ::: resst
also implements Show
,
and this is precisely the result type of st ::: sst
.
So, all in all, MkSomeStats (Proxy :: Proxy (st '::: sst))
is perfectly well-typed.
Indeed, it might seem a bit marvelous that the type checker is able to figure all this out!
Also, this is a partial function (we don’t handle []
), but let’s keep the exposition simple.
It’d be a bit more involved with NonEmpty
, blurring the focus somewhat.
Anyway, given this function, the usage is pretty straightforward:
main :: IO ()
main = do
-- obtaining `stats` as before
forM_ files $ \path -> do
contents <- unsafeMMapFile path
putStrLn $ wc' (promoteStats stats) contents
Performance, wonders
How well does this perform? Let’s run again to find out!
Running the lines calculation looks promising, taking 1.05 seconds to complete.
That’s about as fast as doing BS.count 10 contents
.
But that’s a chunked stat that takes the whole input at once in this case. What about something that should go over the input byte-by-byte, like the word counter? That’s gonna be… 14 seconds, as opposed to circa 1.5 we had previously.
Gosh, 14 seconds.
And it allocates like crazy: if we run the program with +RTS -sstderr
, we’ll see
74,873,139,008 bytes allocated in the heap
I haven’t systematically measured the heap allocations for the previous versions,
but it always was under 1 megabyte.
Well, at least this version is still O(1) on memory —
most of those allocations just immediately die in the nursery
(60,512 bytes maximum residency
).
What if we count words and lines? 27 seconds, 120 gigs allocated in the heap.
What about words, lines and bytes? Can you guess?
If you expected something around 42 seconds, please pat yourself on the back — it runs in 41 seconds and allocates 194 gigs. Maximum residency as per RTS is still about 60 kilobytes, though — the allocated bytes count is the total memory ever allocated, it is not the peak memory consumed.
Why is it so bad?
Well, when there is a function like
wc' (MkSomeStats (_ :: proxy s)) input = show $ wc @s input
then the compiler can’t possibly know what computation
is associated with the type variable s
—
it’s only really known at run time, once the user makes their choice.
Thus the compiler cannot inline the specific computation
for that s
—
in fact, it has no choice but to compile this down to explicit
Statistic
class methods dictionary passing without any inlining at all.
So, what happens here?
Thewc'
function receives a pointer to the computation
function
wrapped inside the SomeStats
existential
(along with the pointers to the rest of Statistic
methods),
and passes it on to wc
, which calls that function through the pointer for every input byte.
No inlining, no related optimizations, no tight loops, but a function call on each iteration.
So that’s about 1.8 billion of calls for our input data, which is surely gonna be slow!
And the caller is responsible for packing the right function pointers into the existential.
For instance, when the stats
list consists of a single element,
so the
promoteStats [s] = promoteStat s
equation kicks in, and s
turns out to be, for instance, Words
,
promoteStat
gets evaluated according to the
promoteStat Words = MkSomeStats (Proxy :: Proxy 'Words)
equation.
So it wraps the pointers to the methods
of the Statistic
implementation for Words
in the SomeStats
existential.
The other promoteStats
equation,
promoteStats (s:ss) =
case (promoteStat s, promoteStats ss) of
(MkSomeStats (_ :: proxy1 st), MkSomeStats (_ :: proxy2 sst)) -> MkSomeStats (Proxy :: Proxy (st '::: sst))
is a bit more involved:
for example, the pointer for computation
would be built roughly as follows.
The case
on the right-hand side of the equation
extracts the pointers to the computation
implementations
from the existentials returned by promoteStat
and the recursive call to promoteStats
and, to simplify it,
passes them on to the computation
of the implementation
of the Statistic
class for the “inductive” sa ::: sb
case.
That last function just invokes those earlier pointers one after the other.
So, if the options list consists of two stats, we’re paying the overhead of about 13-14 seconds twice, and the total run time should be about 28 seconds — just as we’ve observed. Shall we desire all five statistics, the run time would be around 65-70 seconds.
Understanding how existentials (and type erasure in general) work leads us to expect this approach to have terrible performance.
The ugly way
But all this helps us to narrow the possible solutions space down.
We need to make sure the compiler sees the exact (monomorphized) type of wc
at each invocation point.
Ideally, we’d have something like
main = do
-- ..
case stats of
[Words] -> print $ wc @'Words contents
[Bytes] -> print $ wc @'Bytes contents
[Lines] -> print $ wc @'Lines contents
[Words, Bytes] -> print $ wc @('Words '::: 'Bytes) contents
[Lines, Bytes] -> print $ wc @('Lines '::: 'Bytes) contents
-- ...
and so on.
But if we have just five statistics we might compute, then basic combinatorics shows that
we’ll have 25 - 1 = 31 case
branches.
Obviously, that’s not gonna work.
That is, unless the compiler does the job for us.
For this, we’ll need to resort to some Template Haskell.
We’ll write a (meta-)function dispatch
that will be used as follows:
contents <- unsafeMMapFile path
putStrLn $ $(dispatch 'wc 'contents) stats
Here the $(dispatch 'wc 'contents)
splice creates a function
doing a case
analysis of the stats
argument
using the generated list of alternatives
and invoking the wc
function as needed.
This is a pretty technical exercise in Template Haskell, so I’m not describing the thought process behind it. Here’s the result of that process:
dispatch :: Name -> Name -> Q Exp
dispatch fun bs = reify ''Statistics >>= \case
TyConI (DataD _ _ _ _ cons _) -> do
let consNames = [ name | NormalC name _ <- cons ]
let powerset = filterM (const [True, False]) consNames
let matches = buildMatch fun bs <$> filter (not . null) powerset
fallbackMatch <- (\body -> Match WildP (NormalB body) []) <$> [e| error "Unexpected input" |]
pure $ LamCaseE $ matches <> [fallbackMatch]
_ -> fail "unsupported type"
buildMatch :: Name -> Name -> [Name] -> Match
buildMatch fun bs consNames = Match (ListP $ (`ConP` []) <$> consNames) (NormalB $ VarE 'show `AppE` (wcCall `AppE` VarE bs)) []
where
wcCall = VarE fun `AppTypeE` foldr1 f (PromotedT <$> consNames)
f accTy promotedTy = PromotedT '(:::) `AppT` accTy `AppT` promotedTy
Long story short, we figure out all the constructors of the Statistics
type,
and for each non-empty subset of the constructor set — that is, for each non-empty element in powerset
—
we build a separate case
branch using the helper buildMatch
function.
Then we wrap all those alternatives into a single lambda case
using the {-# LANGUAGE LambdaCase #-}
extension.
The only other thing that makes me cringe a little (besides the need for TH)
is that the generated function expects the input stats
list to be sorted.
We can discharge this expectation either by carefully choosing the order of the elements in this list:
let selectedStats = map snd $ filter fst [ (countBytes, Bytes), (countChars, Chars)
, (countWords, Words), (countMaxLineLength, MaxLL)
, (countLines, Lines)
]
where they are already given in the order induced by the compiler-generated Ord
instance,
or by explicitly sorting the stats
list.
Both approaches work, and the performance penalty of sorting the stats
would obviously be negligible.
So, how would this function be used in context? Let’s put all of the pieces together!
main :: IO ()
main = do
Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
let selectedStats = map snd $ filter fst [(countBytes, Bytes), (countChars, Chars), (countWords, Words), (countMaxLineLength, MaxLL), (countLines, Lines)]
let stats | null selectedStats = [Bytes, Words, Lines]
| otherwise = selectedStats
forM_ files $ \path -> do
contents <- unsafeMMapFile path
putStrLn $ $(dispatch 'wc 'contents) stats
Performance, redux
How well does this approach perform?
I’m not going to benchmark all 31 combinations, but just some random subset of those.
Measuring is done as before: each test is run 5 times on a sample file of 1.8 gigabytes,
and the minimum userspace time is taken.
GNU Coreutils’ wc
is run with environment variables LC_ALL=C LANG=C
unless otherwise noted.
Here are the results:
Words | Bytes | Lines | Characters | Max line length | Haskell wc , s |
Coreutils wc , s |
---|---|---|---|---|---|---|
✓ | 0.00 ¹ | 0.00 ¹ | ||||
✓ | 1.54 | 12.5 | ||||
✓ | ✓ | 1.20 | 12.5 | |||
✓ | 1.06 / 0.24 ² | 0.26 | ||||
✓ | ✓ | ✓ | 1.52 | 12.5 | ||
✓ | 1.42 | 8.45 ³ | ||||
✓ | 2.21 | 12.5 | ||||
✓ | ✓ | 2.92 | 12.5 |
A few observations and footnotes:
- It still boggles my mind that counting bytes and words is faster than just counting words.
- Looks like the C version always computes some basic set of statistics (words, bytes, lines, max line length) if either words or max line length option is enabled. I’m not surprised.
- ¹ Yes,
time
literally shows0.00user
for both programs counting bytes. - ² The first Haskell number (1.06 seconds) for the bytes and lines test is with the upstream
bytestring
. The second number is with a patchedbytestring
, where thecount
method is optimized. Ironically,count
is implemented in C, and, ironically again, the C compiler is unable to optimize it reasonably well, and even a pure Haskell implementation would be faster, but all this irony would make a good subject for another post. The patch boils down to replacing that method with a bunch ofAssemblyintrinsics. With that patch, we’re basically comparing the speed of two implementations in a hand-written Assembly (or intrinsics, or whatever). No wonder it’s very close. - ³ Here
wc
is run with an UTF-8 locale, as otherwise it figures out that the characters count is equal to the bytes count and runs in constant time, while the goal is to compare UTF-8 character counting.
Nits
Our little program can handle user options,
it supports all of the statistics that wc
supports,
it can handle multiple files, but what is missing?
Parallelism
This is easy: we can just replace forM_
with forConcurrently_
from the
async
library in our main loop, resulting in:
main :: IO ()
main = do
Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
let selectedStats = map snd $ filter fst [(countBytes, Bytes), (countChars, Chars), (countWords, Words), (countMaxLineLength, MaxLL), (countLines, Lines)]
let stats | null selectedStats = [Bytes, Words, Lines]
| otherwise = selectedStats
forConcurrently_ files $ \path -> do
contents <- unsafeMMapFile path
putStrLn $ $(dispatch 'wc 'contents) stats
This will handle as many files simultaneously as there are cores —
or, rather, the value of the -N
RTS option, which
is equal to the number of cores by default for our compiled program.
Alternatively, we could add another option, say, -j
for jobs count.
What’s the overhead of this parallelism?
If we count, say, just words and bytes of a single file, it’ll run in about 1.22 seconds — almost as the sequential version, although the variance of the run time is higher in this case.
If we now feed our program with 6 copies of our test file (my CPU has 6 physical cores), it runs in 1.47 seconds — but with the variance even higher in this case. My guess is that it is already effectively bottlenecked on memory bandwidth at this point.
Pretty-printing
Now if we run our program, the result is presented as
Tagged 123 ::: (Tagged 456 ::: Tagged 789)
That’s not really friendly. Let’s add some pretty-printing!
The easiest way is to add a method to the Statistic
class:
class Statistic s res st comp | res -> s, st -> s
, s -> res, s -> st, s -> comp where
-- ...
prettyPrint :: res -> String
The implementation for base statistics is simple, for instance:
instance Statistic 'Bytes (Tagged 'Bytes) (Tagged 'Bytes) 'Chunked where
-- ...
prettyPrint (Tagged n) = show n <> " bytes"
The one for the combined statistics is just a bit more typing:
prettyPrint (a ::: b) = prettyPrint a <> "\n" <> prettyPrint b
We also change our buildMatch
helper to use prettyPrint
:
buildMatch fun bs consNames = Match (ListP $ (`ConP` []) <$> consNames) (NormalB $ VarE 'prettyPrint `AppE` (wcCall `AppE` VarE bs)) []
And that’s it!
More input types
Our implementation unconditionally interprets input paths as files and mmap
s them.
This is not always going to work:
for instance, it’s impossible to mmap
a /proc/fd
file descriptor
in a case like hwc <(cat foo | grep bar)
.
One way to remedy this is to check if the input is a regular file or a symbolic link —
those can be mmap
’ed, and we already know how to handle them.
We choose to be safe and just unconditionally avoid mmap
for everything else.
Any other input gets read into a lazy ByteString
so that the space complexity is still O(1).
A lazy ByteString
is basically a list of strict chunks,
so we can just fold over those in case of chunked computations:
import qualified Data.ByteString.Lazy as BSL
wcLazy :: forall s res st comp. Statistic s res st comp => BSL.ByteString -> res
wcLazy s = extractState $! runCompute computation
where
runCompute :: StatComputation st comp -> st
runCompute (ByteOnlyComputation step) = BSL.foldl' step initState s
runCompute (ChunkedComputation _ chunker) = BSL.foldlChunks chunker initState s
Alternatively,
we could have introduced a class of the input types and overload wc
on that one,
but the structure of the problem implies we’ll only going to have either strict or lazy
ByteString
s, so this (more straightforward) approach works too,
and it results in more or less equal amount of boilerplate.
Having this function, we modify our main
as follows:
forConcurrently_ files $ \path -> do
stat <- getFileStatus path
if isRegularFile stat || isSymbolicLink stat
then countStrict stats $ unsafeMMapFile path
else countLazy stats $ BSL.readFile path
where we’ve added a couple of tiny helper functions:
where
countStrict stats act = do
contents <- act
putStrLn $ $(dispatch 'wc 'contents) stats
countLazy stats act = do
contents <- act
putStrLn $ $(dispatch 'wcLazy 'contents) stats
And that’s it!
Handling stdin
One other thing that Coreutils wc
does and our implementation doesn’t
is computing statistics over stdin
.
The fix is quite simple, just adding a check to the end of our main
:
main = do
-- ... as before ...
when (null files) $ countLazy stats BSL.getContents
If the files
list is empty,
then the forM_
/forConcurrently_
fold doesn’t do anything,
so we shouldn’t even guard it.
Now we can do something like this:
cat testfile.txt | /usr/bin/time hwc-exe -cw
What’s the performance hit?
The above command runs in 1.40 seconds — that’s quite a jump from 1.22 seconds we had previously,
but looks like it’s the overhead of avoiding mmap
and using lazy ByteString
s.
Compile times and code bloat
I’m really nitpicking here, but let’s mention this too for the sake of some objectivity.
What’s the overhead of all this metaprogramming nonsense? We can measure either compilation times or the code size. So let’s do both!
If we go the “dependently-typed” route, which we can consider a baseline in this case,
stack build
takes about 7.9 seconds for the whole project,
and the resulting binary size is 2.24 megabytes (after stripping).
The Template Haskell approach takes exactly three times as much to compile — 23 seconds. The code size change isn’t that big, though: it’s increased by just about 4.3%, resulting in a binary of 2.34 megabytes.
Although… It isn’t worth it doing stack build
during normal development workflow:
stack build --fast
might be much faster.
Again, the “dependently-typed” implementation takes about 5.6 seconds to compile with --fast
,
and the TH one takes… 7.8 seconds.
Not bad. I can buy that.
And, since this is a very important factor for some, let’s compare this to C. As expected, Haskell loses a lot here. I won’t even benchmark the compilation time of the C version — I bet it’s less than 0.1 seconds in release mode. Regarding code bloat, the resulting binary size is 51 kilobytes on my machine or 47 kilobytes in Ubuntu packages.
Anyway, I think that’s about it for all the minor things we could have done.
And this is already turns out to be quite a lengthy post, way more than I originally had in mind, so let’s jump to
Conclusions
So, what have we achieved?
In a couple of words, we’ve managed to productionize our initial wc
prototype
to support user options (and more statistics!)
in such a way that it doesn’t compute what the user hasn’t requested,
all while ensuring the performance overhead of this modularization is negligible.
In a sense, this implementation is closer to the favourite principle of the C and C++ programmers —
“don’t pay for what you don’t use” — than the C version.
Or, in more detail, we’ve:
- split our initial
wc
prototype into a set of small, testable and reusable pieces — one per each individual statistic; - learned how to easily compose those pieces together;
- written some tests for the individual statistics while being certain any combination of statistics is computed correctly by construction;
- tried computing just the statistics chosen by the user at runtime via some techniques inspired by dependent types, and watched them fail hilariously (much to our expectation);
- managed to overcome the performance issues via some Template Haskell;
- verified the result is very close performance-wise to our initial implementation, so all this modularity does not have a significant overhead;
- closed a few other gaps our implementation had, ranging from parallel input processing to handling standard input.
I think that’s enough for today.
Oh, and by the way, the source code of the resulting program (modulo a couple of nits) is also available.
And there are also a couple of things worth mentioning:
- The dependently typed approach isn’t necessarily doomed performance-wise. We chose to feed each byte to each handler, but we could instead feed a L1-cache-sized chunk of data to each handler. Then, for instance, since on my machine L1d cache size is 32 KiB and the input file is 1.8 gigs, the overhead would have been about 0.4 ms, or 0.04% of the characteristic run time of 1 second, which is totally reasonable.
- Our implementation still isn’t completely equivalent to
wc
, since, for instance, the char count only supports ASCII and UTF-8 encodings. Although I haven’t seen a machine with a different encoding in maybe 18 years after I played around with a KOI8-R-based distro, support for other locales encodings can certainly be useful. Or, as an another example, this implementation counts words slightly differently thanwc
, which could also be an important difference in some applications.
The nice part about this approach, though, is that it’s super easy and super cheap to add more statistics, implementing different algorithms, without affecting the existing ones. Modularity FTW!
Edit history
Apr 05: expanded the conclusion and fixed a few typos in the text whereStatistics
occurred instead of Statistic
.