What's up with cross-module optimizations?
Last time, we implemented a simple regexp engine and looked at how it could be optimized. Truth is, I cheated a little: all the code, from the regexp definition to calling the regexp engine, was in the same module. However, it’s unlikely you’d write your production code like this. You’d probably separate different functionality into different modules: low-level memory representation-related things go into one, NFA matching goes into another, and so on. So, when productionizing our code, you’d do a similar refactoring. But even such a simple change turns out to have quite an effect on the performance.
Today, we’ll look at the magnitude of this effect, identify its sources, and try to learn to reason about it.
Balancing brackets
To keep our focus, we consider an even simpler problem today:
the [in]famous interview question of checking if a bracket string is balanced.
We’ll have several different kinds of brackets, so we need to keep a stack.
To make matters more interesting, we keep a mutable ST
-based stack
that we write ourselves:
import Data.Vector.Unboxed.Mutable qualified as VM
data Stack s a = Stack
{ theVec :: VM.MVector s a
, size :: Int
}
mkStack :: VM.Unbox a => Int -> ST s (Stack s a)
mkStack initSize = (`Stack` 0) <$> VM.unsafeNew initSize
isEmpty :: Stack s a -> Bool
isEmpty Stack{..} = size == 0
push :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
push a Stack{..} = do
vec' <- if size /= VM.length theVec
then pure theVec
else theVec `VM.unsafeGrow` size
VM.unsafeWrite vec' size a
pure $ Stack vec' (size + 1)
pop :: VM.Unbox a => Stack s a -> ST s (a, Stack s a)
pop Stack{..} = do
a <- VM.unsafeRead theVec (size - 1)
pure (a, Stack theVec (size - 1))
We’ll be using Word8
instead of Char
as we don’t need Unicode,
and Word8
s are packed more effectively,
using just one byte instead of multiple bytes for Char
.
Modulo that, the code is straighforward:
isOpen :: Word8 -> Bool
isOpen [c|(|] = True
isOpen [c|[|] = True
isOpen _ = False
matches :: Word8 -> Word8 -> Bool
matches [c|(|] [c|)|] = True
matches [c|[|] [c|]|] = True
matches _ _ = False
checkBrackets :: BS.ByteString -> Bool
checkBrackets bs = runST $ mkStack len >>= go 0
where
len = BS.length bs
go i stack
| i == len = pure $ isEmpty stack
| otherwise = case bs `BS.unsafeIndex` i of
ch
| isOpen ch -> push ch stack >>= go (i + 1)
| isEmpty stack -> pure False
| otherwise -> do
(inStack, stack') <- pop stack
if inStack `matches` ch
then go (i + 1) stack'
else pure False
main :: IO ()
main = print $ checkBrackets $ BS.replicate cnt 40 <> BS.replicate cnt 41
where
cnt = 100_000_000
For the record, here’s the c
quasiquoter.
c :: QuasiQuoter
c = QuasiQuoter { quotePat = q, quoteExp = unsupported, quoteType = unsupported, quoteDec = unsupported }
where
q [char] = pure $ LitP $ IntegerL $ fromIntegral $ ord char
q _ = fail "single char expected"
unsupported = const $ error "unsupported context"
This runs in about 174 ms (GHC 9.8.2, NCG with -O2
, best of ten runs).
Here’s the C++ code.
#include <iostream>
#include <vector>
#include <string>
bool isOpen(char ch)
{
return ch == '(' || ch == '[';
}
bool matches(char l, char r)
{
if (l == '(' && r == ')')
return true;
if (l == '[' && r == ']')
return true;
return false;
}
bool checkBrackets(std::string_view str)
{
std::vector<char> stack;
stack.reserve(str.size());
for (auto ch : str)
{
if (isOpen(ch))
stack.push_back(ch);
else if (stack.empty())
return false;
else
{
auto inStack = stack.back();
stack.pop_back();
if (!matches(inStack, ch))
return false;
}
}
return stack.empty();
}
int main()
{
constexpr size_t cnt = 100'000'000;
auto input = std::string(cnt, '(') + std::string(cnt, ')');
std::cout << checkBrackets(input) << '\n';
}
What happens if we
move
the stack-related definitions (that is, the whole first code listing)
to its own module, say, Data.Stack
,
and import
it in our Main
module?
1.27 seconds, or about 7 times slower — merely after some very basic refactoring! What’s going on?
The only difference from the compiler’s point of view
is all the definitions in the module being compiled are readily available,
while the ones from import
ed modules might not be.
To confirm this, we need to go deeper into the compilation process and check what the compiler generates. GHC has several different code representations below the usual Haskell: Core, Cmm, and the target architecture Assembly. Core is the most high-level, readable, and closest to the Haskell source, so it makes sense to start with it.
Into the Core
Let’s add {-# OPTIONS_GHC -ddump-simpl -ddump-to-file #-}
to the module defining the checkBrackets
and using the stack functions.
The -ddump-simpl
flag tells GHC to output the optimized Core representation of the module,
and -ddump-to-file
writes it to a file instead of the console.
Compiling our project again,
we’ll get a Main.dump-simpl
file (the exact location depends on your environment).
There, we find something resembling our main
function
but no sign of checkBrackets
that we’re actually focusing on:
it got inlined into main
.
Since that’s the function we care about the most,
we’d like to analyze it in isolation,
so we add {-# NOINLINE checkBrackets #-}
and compile again.
Generally, preventing inlining is a pessimization,
but that’s OK in our case:
in short, we’re only calling this function once (so we can ignore any call overhead),
and it’s monomorphic
(so we don’t need to think about it being specialized for the particular types that we care about).
Moreover, keeping checkBrackets
as a standalone function
can help GHC optimize it more effectively in this context.
And this is indeed the case: this annotation
improves the performance of the all-in-one-module version
to circa 160 ms — that’s C++ speed again!
Anyway, we now get a definition of a function named Main.$wcheckBrackets
… spanning about 600 lines.
Welp, so much for readability!
Trying to reason about it in its entirety is unrealistic, but we can compare it instead!
Let’s add the same pragmas to the first, fast version with all the definitions in the same module
and compare the two dumps.
Unfortunately, diff
won’t cut it:
the names for the labels, dummy variables, and so on, are different,
and so are minor implementation details,
like letrec
vs joinrec
for our go
recursive helper.
So, we’ll have to do the comparison ourselves, say, by pulling the two
dumps side by side.
GHC.Prim.readWord8OffAddr#
(and GHC.Prim.touch#
ed the string finalizer to avoid running it before we’re done with the string).
The bad, slow version then calls Data.Stack.$wpush
(which looks like a mangled version of our push
),
passing it a ton of things:
here’s the annotated Core,
Main.$wcheckBrackets
= \ (ww_s3QY :: ghc-prim:GHC.Prim.Addr#)
(ww1_s3QZ :: GHC.ForeignPtr.ForeignPtrContents)
(ww2_s3R0 :: ghc-prim:GHC.Prim.Int#) ->
[...skip 20 lines...]
letrec {
[...skip 13 lines...]
$s$wgo_s3RE
= \ (@s1_s3QH)
(sc_s3Rj :: ghc-prim:GHC.Prim.Int#)
(sc1_s3Rk :: ghc-prim:GHC.Prim.Int#)
(sc2_s3Rm
:: ghc-prim:GHC.Prim.MutableByteArray# (VM.PrimState (ST s1_s3QH)))
(sg_s3Rn
:: Data.Vector.Primitive.Mutable.MVector
(VM.PrimState (ST s1_s3QH)) Word8
~R# VM.MVector s1_s3QH Word8)
(sc3_s3Ri :: ghc-prim:GHC.Prim.Int#)
(ww3_s3QP :: ghc-prim:GHC.Prim.Int#)
(eta_s3QR [OS=OneShot] :: ghc-prim:GHC.Prim.State# s1_s3QH) ->
-- string length check:
case ghc-prim:GHC.Prim.==# sc3_s3Ri ww2_s3R0 of {
__DEFAULT ->
-- reading the byte:
case ghc-prim:GHC.Prim.readWord8OffAddr#
@RealWorld
(ghc-prim:GHC.Prim.plusAddr# ww_s3QY sc3_s3Ri)
0#
ghc-prim:GHC.Prim.realWorld#
of
{ (# ipv2_a2UA, ipv3_a2UB #) ->
-- GC/finalizers shenanigans:
case ghc-prim:GHC.Prim.touch#
@ghc-prim:GHC.Types.Lifted
@GHC.ForeignPtr.ForeignPtrContents
@RealWorld
ww1_s3QZ
ipv2_a2UA
of
{ __DEFAULT ->
join {
-- handler for the isOpen branch:
$j_s3ON [Dmd=M!P(L,L)]
:: (# ghc-prim:GHC.Prim.State# s1_s3QH, Bool #)
[LclId[JoinId(0)(Nothing)]]
$j_s3ON
= case Data.Stack.$wpush
@Word8
@s1_s3QH
Data.Vector.Unboxed.Base.$fUnboxWord8
(GHC.Word.W8# ipv3_a2UB)
((Data.Vector.Primitive.Mutable.MVector
@s1_s3QH
@Word8
sc_s3Rj
sc1_s3Rk
(sc2_s3Rm
`cast` ((ghc-prim:GHC.Prim.MutableByteArray#
(SelCo:Tc(0)
(sg_s3Rn
; Data.Vector.Unboxed.Base.D:R:MVectorsWord80[0]
<s1_s3QH>_N
; Data.Vector.Unboxed.Base.N:R:MVectorsWord8[0]
<s1_s3QH>_N)))_R
:: ghc-prim:GHC.Prim.MutableByteArray#
(VM.PrimState (ST s1_s3QH))
~R# ghc-prim:GHC.Prim.MutableByteArray#
s1_s3QH)))
`cast` (Sym (Data.Vector.Unboxed.Base.N:R:MVectorsWord8[0]
<s1_s3QH>_N)
; Sym (Data.Vector.Unboxed.Base.D:R:MVectorsWord80[0]
<s1_s3QH>_N)
:: Data.Vector.Primitive.Mutable.MVector
s1_s3QH Word8
~R# VM.MVector s1_s3QH Word8))
ww3_s3QP
eta_s3QR
of [...]
push
:
here’s the Core.
Main.$wcheckBrackets
= \ (ww_s45M :: ghc-prim:GHC.Prim.Addr#)
(ww1_s45N :: GHC.ForeignPtr.ForeignPtrContents)
(ww2_s45O :: ghc-prim:GHC.Prim.Int#) ->
[...skip 20 lines...]
joinrec {
[...skip 14 lines...]
$wgo_s45I (ww3_s45u :: ghc-prim:GHC.Prim.Int#)
(ww4_s45A :: ghc-prim:GHC.Prim.Int#)
(ww5_s45B :: ghc-prim:GHC.Prim.Int#)
(ww6_s45C :: ghc-prim:GHC.Prim.MutableByteArray# RealWorld)
(ww7_s45E :: ghc-prim:GHC.Prim.Int#)
(eta_s45G [OS=OneShot] :: ghc-prim:GHC.Prim.State# RealWorld)
-- string length check:
= case ghc-prim:GHC.Prim.==# ww3_s45u ww2_s45O of {
__DEFAULT ->
-- reading the byte:
case ghc-prim:GHC.Prim.readWord8OffAddr#
@RealWorld
(ghc-prim:GHC.Prim.plusAddr# ww_s45M ww3_s45u)
0#
ghc-prim:GHC.Prim.realWorld#
of
{ (# ipv2_a3gB, ipv3_a3gC #) ->
-- GC/finalizers shenanigans:
case ghc-prim:GHC.Prim.touch#
@ghc-prim:GHC.Types.Lifted
@GHC.ForeignPtr.ForeignPtrContents
@RealWorld
ww1_s45N
ipv2_a3gB
of
{ __DEFAULT ->
join {
-- handler for the isOpen branch:
$j_s437 [Dmd=ML] :: Bool
[LclId[JoinId(0)(Nothing)]]
$j_s437
-- stack size check (inlined from `push`):
= case ghc-prim:GHC.Prim./=# ww7_s45E ww5_s45B of {
-- "need to grow" branch:
__DEFAULT ->
let {
eta3_s434 :: ghc-prim:GHC.Prim.Int#
[LclId]
eta3_s434 = ghc-prim:GHC.Prim.+# ww5_s45B ww7_s45E } in
-- new stack size integer overflow check:
case ghc-prim:GHC.Prim.<# eta3_s434 0# of {
-- "all good" branch:
__DEFAULT ->
-- allocate the new array:
case ghc-prim:GHC.Prim.newByteArray#
[...skip 15 lines preparing the parameters...]
of
{ (# ipv4_a3jb, ipv5_a3jc #) ->
-- copy old array into the new one:
case ghc-prim:GHC.Prim.copyMutableByteArray#
[...skip 33 lines preparing the parameters...]
of s'#_a3jj
{ __DEFAULT ->
-- write onto the newly allocated stack:
case ghc-prim:GHC.Prim.writeWord8Array#
@(VM.PrimState (ST (VM.PrimState (ST RealWorld))))
ipv5_a3jc
ww7_s45E
ipv3_a3gC
s'#_a3jj
of s'#1_a3ko
{ __DEFAULT ->
-- TCO'd recursive call:
jump $wgo_s45I
[...skip 38 lines preparing the parameters...]
}
}
};
-- new stack size integer overflown:
1# -> case $wlvl_r472 eta3_s434 of wild_00 { }
};
-- "no need to grow" branch of stack size check
1# ->
-- write onto the existing stack:
case ghc-prim:GHC.Prim.writeWord8Array#
[...skip 36 lines preparing the parameters...]
of s'#_a3ko
{ __DEFAULT ->
-- TCO'd recursive call:
jump $wgo_s45I
[...skip 14 lines preparing the parameters...]
}
Specialization
The worst part — the absolute performance killer, in fact —
the non-inlined push
doesn’t get specialized.
For instance, note the $fUnboxWord8
argument
(qualified by the Data.Vector.Unboxed.Base
module):
it’s the explicitly passed dictionary of the Unbox
type class
telling the Vector
how to box and unbox the Word8
s.
Dictionaries add significant overhead because each class method
requires a dictionary lookup, which is an extra indirection by itself,
and it also prevents further optimizations after inlining those operations.
So, why doesn’t push
get neither inlined nor specialized?
It’s easier to give a definite answer for the latter: because GHC only does auto-specialization for methods defined in the same module. GHC documentation says explicitly:
When compiling a module
M
, GHC’s optimiser (when given the-O
flag) automatically considers each top-level overloaded function declared inM
, and specialises it for the different types at which it is called inM
. The optimiser also considers each importedINLINABLE
overloaded function, and specialises it for the different types at which it is called inM
.
Moreover, Well-Typed recently had a couple of excellent and deep blog posts on specializer’s behavior [1] [2]. There, they say something similar:
Even if GHC decides to expose an overloaded binding’s unfolding, and a specializable call to that binding occurs in another module, GHC will still never automatically specialize that call unless it has been given explicit permission to create the specialization. Such explicit permission can only be given in one of the following ways:
- Mark the overloaded binding with either an
INLINABLE
orINLINE
pragma.- Enable the
-fspecialize-aggressively
flag while compiling the calling module.
So, we now have two options:
- mark
push
(and other functions for consistency) with the{-# INLINEABLE #-}
pragma, - or use the ultimate sledgehammer approach of the
-fspecialize-aggressively -fexpose-all-unfoldings
flags.
Both options fix the performance bug, and we get back our 160 ms-ish run time. We’re done, right?
“Hold on!” — you might say.
The other functions like pop
or mkStack
aren’t marked as INLINEABLE
but also don’t give us trouble — how come?
To understand that, we need to remember about…
Inlining
As Well-Typed posts above say, GHC is more eager to inline than to specialize,
so if it inlines pop
or mkStack
, then there’s no question of specialization not firing.
But what can be inlined, then?
If we dump Data.Stack
’s Core, we’ll see that pop
gets split into pop
and Data.Stack.pop1
(the former merely calling the latter with some runtime-irrelevant type coercions),
and both are annotated by, more or less,
Unf=Unf{Src=<vanilla>, TopLvl=True,
Value=True, ConLike=True, WorkFree=True, Expandable=True,...
Guidance=ALWAYS_IF(arity=3,unsat_ok=True,boring_ok=False)
...}
Unf
here stands for “unfolding”, and it contains a bunch of hints to the optimizer.
The particular hint we’re most interested in right now is Guidance
.
ALWAYS_IF
here means the definition should always be inlined
if the corresponding criteria are met.
These criteria are:
arity
— how many arguments should the function be applied to for inlining to make sense,unsat_ok
— whether it’s OK to inline even if there are less thanarity
arguments (in other words, if I understand this correctly, GHC doesn’t care aboutarity
whenunsat_ok = True
),boring_ok
— whether it’s OK to inline if GHC doesn’t think it opens further optimization opportunities.
In our case, all these are met, so pop
gets inlined.
Let’s now look at mkStack
.
The “top-level” one, which merely passes the calls to Data.Stack.mkStack1
after some type coercions, is also annotated by
Unf=Unf{Src=<vanilla>, TopLvl=True,
Value=True, ConLike=True, WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=0,unsat_ok=True,boring_ok=True)}
The latter is defined like this:
Data.Stack.mkStack1
= \ (@a_a3gB)
(@s_a3gC)
($dUnbox_a3gD :: VM.Unbox a_a3gB)
(initSize_a2GV [OS=OneShot] :: Int)
(eta_B0 [OS=OneShot] :: ghc-prim:GHC.Prim.State# s_a3gC) -> ...
and its annotation has IF_ARGS
that we haven’t seen before:
Unf=Unf{Src=<vanilla>, TopLvl=True,
Value=True, ConLike=True, WorkFree=True, Expandable=True,
Guidance=IF_ARGS [30 20 0] 100 10}
This stuff is documented very sparsely, so I’m very likely mistaken, but again, if I understand this correctly (mostly from reading GHC sources), the “arguments” mean the following:
- The second one,
100
, is the initial size of the function. - The first one,
[30 20 0]
, describes the discounts applied to the function size for each function argument if it’s “good” (substituting it doesn’t lead to code bloat). Remember what our arguments are:- the
Unbox
dictionary, and inlining it is very beneficial indeed (also note that30
is coincidentally the default value of-funfolding-dict-discount
), - the initial size of the stack, inlining it is so-so,
- the state token that doesn’t have any operational content, so inlining it has no benefit at all.
- the
- The last one,
10
, is the additional discount if the result of the inlining exposes further optimization opportunities.
There are also additional
hardcoded
discounts, like a “boost” of 10
for the mere fact of inlining,
and a 10
per supplied function argument just for the mere fact of it existing.
So, in our case, if all arguments are good, the resulting size is computed as
size | 100 |
- arg1 | 30 |
- arg2 | 20 |
- arg3 | 0 |
- inlining bonus | 10 |
- hardcoded bonus | 10 |
- args bonus | 3 * 10 |
= total | 0 |
which is less than the apparent default 80
of the -funfolding-use-threshold
tunable (which is the inlining threshold).
Hence, mkStack
gets inlined, just as observed.
What about push
? It gets split into three functions now:
push
, with theALWAYS_IF
guidance, which just calls the…Data.Stack.push1
, also with theALWAYS_IF
guidance, which seems to do the worker/wrapper transformation and then call the…Data.Stack.$wpush
, havingGuidance=IF_ARGS [120 20 20 0 0] 393 20
.
So even in the best case we’ll get the all-discounted cost of $wpush
as 153
(including all the hardcoded discounts mentioned above),
which is way above the inlining threshold,
so no inlining happens (also just as observed).
It is also worth noting that the 120
-discounted argument here is the Unbox
dictionary
which is used precisely four times in the function body,
so maybe -funfolding-dict-threshold
is per-use, not per-argument.
More dumps
Can we ask GHC to verify our hypotheses?
First, we might notice a seemingly relevant flag -ddump-inlinings
,
which instructs GHC to output which functions it decides to inline.
What if we add it to our Main
module, undo all our INLINEABLE
or -fspecialize-aggressively
changes,
recompile and grep
for Data.Stack
?
Inlining done: Data.Stack.mkStack
Inlining done: Data.Stack.mkStack1
Inlining done: Data.Stack.isEmpty
Inlining done: Data.Stack.pop
Inlining done: Data.Stack.push
Inlining done: Data.Stack.isEmpty
Inlining done: Data.Stack.pop1
Inlining done: Data.Stack.push1
This might be a bit confusing, especially if we forget what we’ve seen in Core
(and it was confusing for me, since I tried this flag before looking at the Core).
Indeed, we see that push
(and even something named push1
) gets inlined,
but the performance is as if it’s not!
But now we know it’s because push
gets split into three functions,
and that the last one, the $wpush
worker, doesn’t get inlined
(and it isn’t present in this list).
Luckily, there’s another flag: -ddump-verbose-inlinings
.
This outputs a ton of information (about 17100 lines just for the Main
module in our case),
but it is very informative!
For instance, we can see that the push1
wrapper gets inlined somewhere:
Considering inlining: push1
arg infos [ValueArg, ValueArg, ValueArg, TrivArg]
interesting continuation CaseCtxt
some_benefit True
is exp: True
is work-free: True
guidance ALWAYS_IF(arity=4,unsat_ok=True,boring_ok=False)
ANSWER = YES
but our $wpush
worker does not:
Considering inlining: $wpush
arg infos [ValueArg, ValueArg, ValueArg, TrivArg, TrivArg]
interesting continuation CaseCtxt
some_benefit True
is exp: True
is work-free: True
guidance IF_ARGS [120 20 20 0 0] 393 20
case depth = 2
depth based penalty = 0
discounted size = 153
ANSWER = NO
Note that this 153
is exactly equal to the 153
we estimated from reading GHC sources.
Exercise to the reader:
compare Main
’s -ddump-verbose-inlinings
and Data.Stack
’s Core
for push
annotated with:
- no pragmas,
{-# INLINEABLE #-}
,{-# INLINE #-}
.
Other optimizations
While we’re at it, we could try a couple of other things.
Explicit specialization
What happens
if we pretend that the specializer has fired (as it did in the original one-module case)
by adding the explicit specialization pragma?
We can add the following to specialize the push
function for Word8
:
{-# SPECIALIZE push :: Word8 -> Stack s Word8 -> ST s (Stack s Word8) #-}
We get a rewrite rule that forwards Word8
-instantiated calls of push
to
Data.Stack.push1 [InlPrag=[2]]
:: forall {s}.
Word8 -> Stack s Word8 -> GHC.ST.STRep s (Stack s Word8)
which doesn’t have any dictionary-related overhead,
and it also has Guidance=ALWAYS_IF(arity=3,unsat_ok=True,boring_ok=False)
inlining hints.
This now runs in about 154 ms,
which is even faster than our original, all-in-one-module version
and comparable to the C++ version compiled with clang.
It’s worth noting that we only specialized push
with respect to a ~ Word8
and not the s
tag,
yet still got good performance improvement.
This is because s
represents the state thread in the ST
monad
and doesn’t have any runtime semantics.
GHC recognizes this and can effectively optimize the function
without us having to specialize over s
.
The bottom line is that we don’t always have to
fully specialize a polymorphic function to enjoy the performance benefits.
Hot and cold
There’s another optimization technique, and it’s quite universal across languages — in fact, I’ve seen it used the most in the context of high-frequency trading and other systems with very tight latency and throughput requirements.
Let’s split push
into two functions,
a “hot” one that runs most often,
and a “cold” one that only gets called asymptotically never O(1) times:
push :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
push a s@Stack{..}
| size /= VM.length theVec = do
VM.unsafeWrite theVec size a
pure $ Stack theVec (size + 1)
| otherwise = pushCold a s
pushCold :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
pushCold a Stack{..} = do
vec' <- theVec `VM.unsafeGrow` size
push a $ Stack vec' size
{-# NOINLINE pushCold #-}
By marking pushCold
as {-# NOINLINE #-}
,
we prevent GHC from bloating the hot push
with this almost-never-needed function,
keeping it small enough to inline just the hot version.
Indeed, compiling Main
with -ddump-inlinings
shows that all three functions that push
is split to by GHC are inlined —
without any pragmas on push
itself.
Moreover, this version has roughly the same 160 ms-ish run time,
which shows that splitting hot and cold paths can be an effective optimization strategy
(although in this case, it doesn’t offer a significant improvement over explicit specialization).
But, although this might look like a cunning trick,
I’d advise against relying merely on {-# NOINLINE #-}
ing cold paths in your code,
since it’s easily broken by minor code changes or GHC upgrades.
It’s still a good idea to annotate hot paths with {-# INLINEABLE #-}
.
In our case, this split is probably not worth it, so we revert it and say we’re done.
More brackets!
The interviewer hears we’re done and bumps the problem’s complexity considerably,
now asking to match four different kinds of brackets.
Fearlessly, we think a little and then make the required changes,
modifying our isOpen
and matches
functions
to recognize the four kinds of brackets:
isOpen :: Word8 -> Bool
isOpen [c|(|] = True
isOpen [c|[|] = True
isOpen [c|<|] = True
isOpen [c|{|] = True
isOpen _ = False
matches :: Word8 -> Word8 -> Bool
matches [c|(|] [c|)|] = True
matches [c|[|] [c|]|] = True
matches [c|<|] [c|>|] = True
matches [c|{|] [c|}|] = True
matches _ _ = False
This now runs in about 265 ms with the stack split into a separate module,
even with all the {-# INLINEABLE #-}
pragmas —
about 100 ms or 60% more than the previous version.
That’s quite a jump from the 160 ms we had!
Having a bad feeling, we try the all-in-one-module version again,
and it consistently runs in about 230 ms, which is again noticeably faster.
If we replace {-# INLINEABLE push #-}
with {-# INLINE push #-}
,
we’ll get 215 ms.
Using {-# SPECIALIZE #-}
also gives us the same 215 ms
if we do it in Data.Stack
,
or 260 ms if we do it in Main
(yes, you can {-# SPECIALIZE #-}
inlineable functions in different modules).
What is going on again?
Aside from the obvious causes like more branches and higher instruction cache utilization,
the added complexity affects GHC’s optimization choices.
Remember that GHC can optimize, inline, and specialize,
and do arbitrary subsequences of these actions, and not all actions commute.
Moreover, remember that GHC exports the pristine unoptimized body of
an {-# INLINE #-}
or {-# INLINEABLE #-}
function,
as the docs say:
It is useful for GHC to optimise the definition of an
INLINE
functionf
just like any other non-INLINE function, in case the non-inlined version off
is ultimately called. But we don’t want to inline the optimised version off
; a major reason forINLINE
pragmas is to expose functions inf
’s RHS that have rewrite rules, and it’s no good if those functions have been optimised away.So GHC guarantees to inline precisely the code that you wrote, no more and no less. It does this by capturing a copy of the definition of the function to use for inlining (we call this the “inline-RHS”), which it leaves untouched, while optimising the ordinarily RHS as usual. For externally-visible functions the inline-RHS (not the optimised RHS) is recorded in the interface file.
We now have several scenarios, and among the most interesting ones:
- optimize-inline-
specialize(all-in-one module):push
’s body gets optimized as-is.- The optimized body gets inlined into
checkBrackets
. - Inlining also substitutes lookups of the
Unbox
class methods in the dictionary argument with the methods specific toWord8
, so there’s no explicit specialization step, hence I’ve struck it out.
- specialize-optimize-inline-1 (with the
{-# INLINEABLE #-}
pragma):push
’s body is exported as is, unoptimized.- It’s too big to get inlined, so the specializer kicks in as a fallback.
- The specialized body is now small and nice enough to get inlined into
checkBrackets
.
- inline-
specialize-optimize (with the{-# INLINE #-}
pragma):push
’s body is exported as is, unoptimized.- It’s too big to get inlined, but nobody cares and GHC shoves it into the caller.
- The inlined body is then optimized in the context of
checkBrackets
, together with it.
- specialize-optimize-inline-2 (with
{-# SPECIALIZE #-}
inData.Stack
):push
’s body gets specialized toWord8
while compilingData.Stack
.- The specialized body gets optimized while still compiling
Data.Stack
. - The specialized and optimized body gets exported.
- It’s small and nice, so it gets inlined into
checkBrackets
.
- specialize-optimize-inline-3 (with
{-# INLINEABLE #-}
inData.Stack
and{-# SPECIALIZE #-}
inMain
):push
’s body is exported as is, unoptimized.push
’s body gets specialized toWord8
while compilingMain
.- The specialized body gets optimized while compiling
Main
. - It’s small and nice, so it gets inlined into
checkBrackets
.
So, each scenario represents a different sequence of optimization steps, leading to varying performance.
Exercise for the reader: what happens in the scenarios (2) and (3)
if the caller function (checkBrackets
in our case)
is not used in the same module it’s defined,
but is exported instead?
A hint.
Things may go ugly because the unoptimized body of the caller is exported.More modules!
What’s curious, and what I don’t have a good explanation for,
is that scenarios (4) and (5) have vastly different performance characteristics,
and the Core of checkBrackets
is noticeably different,
even though the Core of the specialized push
versions in both scenarios is the same.
Although,
one has to {-# SPECIALIZE NOINLINE #-}
when specializing in Main
to see the specialized push
body:
otherwise, it gets inlined and removed as a separate binding,
and the NOINLINE
might cause a different Core from
the version without NOINLINE
.
What’s even more interesting is that if we
add
yet another module
that merely imports Data.Stack
and specializes push
, which is then exported,
we’ll get the same numbers as when it was specialized in Data.Stack
itself.
Apparently, GHC prefers to rely on imported, specialized, and optimized functions,
possibly due to how it manages optimization workloads and inlining decisions across module boundaries.
And, since it’s only observed when the surrounding function becomes more complex
(by adding more cases, for instance), I’d guess GHC somehow relies on
the imported function even if it is specialized in the current module,
either explicitly (via {-# SPECIALIZE #-}
) or implicitly (by merely being {-# INLINEABLE #-}
).
At least, this hypothesis is consistent with observations,
where both scenarios result in similar performance and similar Core.
Where do we arrive at?
I like to say the hardest part of Haskell programming is reasoning about performance, and today I stand firm in my beliefs: we’ve seen how seemingly simple refactoring and code changes can have profound impacts on performance due to the intricacies of GHC’s optimization process. In short, reasoning about optimization is a mess.
But, some principles and observations may help:
- By default, GHC only specializes the functions defined in the current module
and those marked with
{-# INLINEABLE #-}
or{-# INLINE #-}
. Ensure your functions are marked accordingly. - Remember that with
{-# INLINEABLE #-}
and{-# INLINE #-}
, the unoptimized bodies of functions are exported, and the subsequent inlining happens according to those unoptimized bodies. An{-# INLINEABLE #-}
(monomorphic or non-specialized) function might be inlined in some context in its defining module but not inlined in the same context in a different module. - The trinity of specialization, inlining, and further optimizations, does not commute. You can (and will) get different results based on the exact sequence of these steps, which depends somewhat non-trivially on the exact module structure and the exact pragmas each function is annotated with.
- Not all polymorphism is bad for performance.
For example, phantom type parameters for the
ST
trick don’t need to be specialized explicitly: the optimizer is really good at removing them. - The caller’s complexity is as important for the optimizer as its callees.
Now, is it generally beneficial to mark functions as:
{-# INLINEABLE #-}
? Perhaps, if you don’t care about compilation times (I, for one, don’t).{-# INLINE #-}
? Definitely not always, since it might result in code bloat and more (i-)cache misses.{-# SPECIALIZE #-}
? Perhaps, if you don’t care about the extra maintenance overhead (I, for one, do).
Is there any universal, silver bullet-like advice? Unfortunately, I don’t think so. Nevertheless, a good rule of thumb might be to:
- Annotate your hot functions with
{-# INLINEABLE #-}
by default and with{-# INLINE #-}
if benchmarks show it’s beneficial. - If a hot function is polymorhic and depends on some type class,
and you know it will be used with some specific instances of that class,
consider explicitly
{-# SPECIALIZE #-}
ing the function, even though it involves the extra maintenance overhead to make sure these specializations stay in sync with the actual usage sites.
Last but not least, experiment on your own and play around with some toy examples! This helps developing an intuition for what works best in which context, and there’s no better way to learn than by doing.