What's up with cross-module optimizations?
Last time, we implemented a simple regexp engine and looked at how it could be optimized. Truth is, I cheated a little: all the code, from the regexp definition to calling the regexp engine, was in the same module. However, it’s unlikely you’d write your production code like this. You’d probably separate different functionality into different modules: low-level memory representation-related things go into one, NFA matching goes into another, and so on. So, when productionizing our code, you’d do a similar refactoring. But even such a simple change turns out to have quite an effect on the performance.
Today, we’ll look at the magnitude of this effect, identify its sources, and try to learn to reason about it.
Balancing brackets
To keep our focus, we consider an even simpler problem today:
the [in]famous interview question of checking if a bracket string is balanced.
We’ll have several different kinds of brackets, so we need to keep a stack.
To make matters more interesting, we keep a mutable ST-based stack
that we write ourselves:
import Data.Vector.Unboxed.Mutable qualified as VM
data Stack s a = Stack
{ theVec :: VM.MVector s a
, size :: Int
}
mkStack :: VM.Unbox a => Int -> ST s (Stack s a)
mkStack initSize = (`Stack` 0) <$> VM.unsafeNew initSize
isEmpty :: Stack s a -> Bool
isEmpty Stack{..} = size == 0
push :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
push a Stack{..} = do
vec' <- if size /= VM.length theVec
then pure theVec
else theVec `VM.unsafeGrow` size
VM.unsafeWrite vec' size a
pure $ Stack vec' (size + 1)
pop :: VM.Unbox a => Stack s a -> ST s (a, Stack s a)
pop Stack{..} = do
a <- VM.unsafeRead theVec (size - 1)
pure (a, Stack theVec (size - 1))We’ll be using Word8 instead of Char as we don’t need Unicode,
and Word8s are packed more effectively,
using just one byte instead of multiple bytes for Char.
Modulo that, the code is straighforward:
isOpen :: Word8 -> Bool
isOpen [c|(|] = True
isOpen [c|[|] = True
isOpen _ = False
matches :: Word8 -> Word8 -> Bool
matches [c|(|] [c|)|] = True
matches [c|[|] [c|]|] = True
matches _ _ = False
checkBrackets :: BS.ByteString -> Bool
checkBrackets bs = runST $ mkStack len >>= go 0
where
len = BS.length bs
go i stack
| i == len = pure $ isEmpty stack
| otherwise = case bs `BS.unsafeIndex` i of
ch
| isOpen ch -> push ch stack >>= go (i + 1)
| isEmpty stack -> pure False
| otherwise -> do
(inStack, stack') <- pop stack
if inStack `matches` ch
then go (i + 1) stack'
else pure False
main :: IO ()
main = print $ checkBrackets $ BS.replicate cnt 40 <> BS.replicate cnt 41
where
cnt = 100_000_000
For the record, here’s the c quasiquoter.
c :: QuasiQuoter
c = QuasiQuoter { quotePat = q, quoteExp = unsupported, quoteType = unsupported, quoteDec = unsupported }
where
q [char] = pure $ LitP $ IntegerL $ fromIntegral $ ord char
q _ = fail "single char expected"
unsupported = const $ error "unsupported context"This runs in about 174 ms (GHC 9.8.2, NCG with -O2, best of ten runs).
Here’s the C++ code.
#include <iostream>
#include <vector>
#include <string>
bool isOpen(char ch)
{
return ch == '(' || ch == '[';
}
bool matches(char l, char r)
{
if (l == '(' && r == ')')
return true;
if (l == '[' && r == ']')
return true;
return false;
}
bool checkBrackets(std::string_view str)
{
std::vector<char> stack;
stack.reserve(str.size());
for (auto ch : str)
{
if (isOpen(ch))
stack.push_back(ch);
else if (stack.empty())
return false;
else
{
auto inStack = stack.back();
stack.pop_back();
if (!matches(inStack, ch))
return false;
}
}
return stack.empty();
}
int main()
{
constexpr size_t cnt = 100'000'000;
auto input = std::string(cnt, '(') + std::string(cnt, ')');
std::cout << checkBrackets(input) << '\n';
}What happens if we
move
the stack-related definitions (that is, the whole first code listing)
to its own module, say, Data.Stack,
and import it in our Main module?
1.27 seconds, or about 7 times slower — merely after some very basic refactoring! What’s going on?
The only difference from the compiler’s point of view
is all the definitions in the module being compiled are readily available,
while the ones from imported modules might not be.
To confirm this, we need to go deeper into the compilation process and check what the compiler generates. GHC has several different code representations below the usual Haskell: Core, Cmm, and the target architecture Assembly. Core is the most high-level, readable, and closest to the Haskell source, so it makes sense to start with it.
Into the Core
Let’s add {-# OPTIONS_GHC -ddump-simpl -ddump-to-file #-}
to the module defining the checkBrackets and using the stack functions.
The -ddump-simpl flag tells GHC to output the optimized Core representation of the module,
and -ddump-to-file writes it to a file instead of the console.
Compiling our project again,
we’ll get a Main.dump-simpl file (the exact location depends on your environment).
There, we find something resembling our main function
but no sign of checkBrackets that we’re actually focusing on:
it got inlined into main.
Since that’s the function we care about the most,
we’d like to analyze it in isolation,
so we add {-# NOINLINE checkBrackets #-} and compile again.
Generally, preventing inlining is a pessimization,
but that’s OK in our case:
in short, we’re only calling this function once (so we can ignore any call overhead),
and it’s monomorphic
(so we don’t need to think about it being specialized for the particular types that we care about).
Moreover, keeping checkBrackets as a standalone function
can help GHC optimize it more effectively in this context.
And this is indeed the case: this annotation
improves the performance of the all-in-one-module version
to circa 160 ms — that’s C++ speed again!
Anyway, we now get a definition of a function named Main.$wcheckBrackets… spanning about 600 lines.
Welp, so much for readability!
Trying to reason about it in its entirety is unrealistic, but we can compare it instead!
Let’s add the same pragmas to the first, fast version with all the definitions in the same module
and compare the two dumps.
Unfortunately, diff won’t cut it:
the names for the labels, dummy variables, and so on, are different,
and so are minor implementation details,
like letrec vs joinrec for our go recursive helper.
So, we’ll have to do the comparison ourselves, say, by pulling the two
dumps side by side.
GHC.Prim.readWord8OffAddr#
(and GHC.Prim.touch#ed the string finalizer to avoid running it before we’re done with the string).
The bad, slow version then calls Data.Stack.$wpush
(which looks like a mangled version of our push),
passing it a ton of things:
here’s the annotated Core,
Main.$wcheckBrackets
= \ (ww_s3QY :: ghc-prim:GHC.Prim.Addr#)
(ww1_s3QZ :: GHC.ForeignPtr.ForeignPtrContents)
(ww2_s3R0 :: ghc-prim:GHC.Prim.Int#) ->
[...skip 20 lines...]
letrec {
[...skip 13 lines...]
$s$wgo_s3RE
= \ (@s1_s3QH)
(sc_s3Rj :: ghc-prim:GHC.Prim.Int#)
(sc1_s3Rk :: ghc-prim:GHC.Prim.Int#)
(sc2_s3Rm
:: ghc-prim:GHC.Prim.MutableByteArray# (VM.PrimState (ST s1_s3QH)))
(sg_s3Rn
:: Data.Vector.Primitive.Mutable.MVector
(VM.PrimState (ST s1_s3QH)) Word8
~R# VM.MVector s1_s3QH Word8)
(sc3_s3Ri :: ghc-prim:GHC.Prim.Int#)
(ww3_s3QP :: ghc-prim:GHC.Prim.Int#)
(eta_s3QR [OS=OneShot] :: ghc-prim:GHC.Prim.State# s1_s3QH) ->
-- string length check:
case ghc-prim:GHC.Prim.==# sc3_s3Ri ww2_s3R0 of {
__DEFAULT ->
-- reading the byte:
case ghc-prim:GHC.Prim.readWord8OffAddr#
@RealWorld
(ghc-prim:GHC.Prim.plusAddr# ww_s3QY sc3_s3Ri)
0#
ghc-prim:GHC.Prim.realWorld#
of
{ (# ipv2_a2UA, ipv3_a2UB #) ->
-- GC/finalizers shenanigans:
case ghc-prim:GHC.Prim.touch#
@ghc-prim:GHC.Types.Lifted
@GHC.ForeignPtr.ForeignPtrContents
@RealWorld
ww1_s3QZ
ipv2_a2UA
of
{ __DEFAULT ->
join {
-- handler for the isOpen branch:
$j_s3ON [Dmd=M!P(L,L)]
:: (# ghc-prim:GHC.Prim.State# s1_s3QH, Bool #)
[LclId[JoinId(0)(Nothing)]]
$j_s3ON
= case Data.Stack.$wpush
@Word8
@s1_s3QH
Data.Vector.Unboxed.Base.$fUnboxWord8
(GHC.Word.W8# ipv3_a2UB)
((Data.Vector.Primitive.Mutable.MVector
@s1_s3QH
@Word8
sc_s3Rj
sc1_s3Rk
(sc2_s3Rm
`cast` ((ghc-prim:GHC.Prim.MutableByteArray#
(SelCo:Tc(0)
(sg_s3Rn
; Data.Vector.Unboxed.Base.D:R:MVectorsWord80[0]
<s1_s3QH>_N
; Data.Vector.Unboxed.Base.N:R:MVectorsWord8[0]
<s1_s3QH>_N)))_R
:: ghc-prim:GHC.Prim.MutableByteArray#
(VM.PrimState (ST s1_s3QH))
~R# ghc-prim:GHC.Prim.MutableByteArray#
s1_s3QH)))
`cast` (Sym (Data.Vector.Unboxed.Base.N:R:MVectorsWord8[0]
<s1_s3QH>_N)
; Sym (Data.Vector.Unboxed.Base.D:R:MVectorsWord80[0]
<s1_s3QH>_N)
:: Data.Vector.Primitive.Mutable.MVector
s1_s3QH Word8
~R# VM.MVector s1_s3QH Word8))
ww3_s3QP
eta_s3QR
of [...]
push:
here’s the Core.
Main.$wcheckBrackets
= \ (ww_s45M :: ghc-prim:GHC.Prim.Addr#)
(ww1_s45N :: GHC.ForeignPtr.ForeignPtrContents)
(ww2_s45O :: ghc-prim:GHC.Prim.Int#) ->
[...skip 20 lines...]
joinrec {
[...skip 14 lines...]
$wgo_s45I (ww3_s45u :: ghc-prim:GHC.Prim.Int#)
(ww4_s45A :: ghc-prim:GHC.Prim.Int#)
(ww5_s45B :: ghc-prim:GHC.Prim.Int#)
(ww6_s45C :: ghc-prim:GHC.Prim.MutableByteArray# RealWorld)
(ww7_s45E :: ghc-prim:GHC.Prim.Int#)
(eta_s45G [OS=OneShot] :: ghc-prim:GHC.Prim.State# RealWorld)
-- string length check:
= case ghc-prim:GHC.Prim.==# ww3_s45u ww2_s45O of {
__DEFAULT ->
-- reading the byte:
case ghc-prim:GHC.Prim.readWord8OffAddr#
@RealWorld
(ghc-prim:GHC.Prim.plusAddr# ww_s45M ww3_s45u)
0#
ghc-prim:GHC.Prim.realWorld#
of
{ (# ipv2_a3gB, ipv3_a3gC #) ->
-- GC/finalizers shenanigans:
case ghc-prim:GHC.Prim.touch#
@ghc-prim:GHC.Types.Lifted
@GHC.ForeignPtr.ForeignPtrContents
@RealWorld
ww1_s45N
ipv2_a3gB
of
{ __DEFAULT ->
join {
-- handler for the isOpen branch:
$j_s437 [Dmd=ML] :: Bool
[LclId[JoinId(0)(Nothing)]]
$j_s437
-- stack size check (inlined from `push`):
= case ghc-prim:GHC.Prim./=# ww7_s45E ww5_s45B of {
-- "need to grow" branch:
__DEFAULT ->
let {
eta3_s434 :: ghc-prim:GHC.Prim.Int#
[LclId]
eta3_s434 = ghc-prim:GHC.Prim.+# ww5_s45B ww7_s45E } in
-- new stack size integer overflow check:
case ghc-prim:GHC.Prim.<# eta3_s434 0# of {
-- "all good" branch:
__DEFAULT ->
-- allocate the new array:
case ghc-prim:GHC.Prim.newByteArray#
[...skip 15 lines preparing the parameters...]
of
{ (# ipv4_a3jb, ipv5_a3jc #) ->
-- copy old array into the new one:
case ghc-prim:GHC.Prim.copyMutableByteArray#
[...skip 33 lines preparing the parameters...]
of s'#_a3jj
{ __DEFAULT ->
-- write onto the newly allocated stack:
case ghc-prim:GHC.Prim.writeWord8Array#
@(VM.PrimState (ST (VM.PrimState (ST RealWorld))))
ipv5_a3jc
ww7_s45E
ipv3_a3gC
s'#_a3jj
of s'#1_a3ko
{ __DEFAULT ->
-- TCO'd recursive call:
jump $wgo_s45I
[...skip 38 lines preparing the parameters...]
}
}
};
-- new stack size integer overflown:
1# -> case $wlvl_r472 eta3_s434 of wild_00 { }
};
-- "no need to grow" branch of stack size check
1# ->
-- write onto the existing stack:
case ghc-prim:GHC.Prim.writeWord8Array#
[...skip 36 lines preparing the parameters...]
of s'#_a3ko
{ __DEFAULT ->
-- TCO'd recursive call:
jump $wgo_s45I
[...skip 14 lines preparing the parameters...]
}Specialization
The worst part — the absolute performance killer, in fact —
the non-inlined push doesn’t get specialized.
For instance, note the $fUnboxWord8 argument
(qualified by the Data.Vector.Unboxed.Base module):
it’s the explicitly passed dictionary of the Unbox type class
telling the Vector how to box and unbox the Word8s.
Dictionaries add significant overhead because each class method
requires a dictionary lookup, which is an extra indirection by itself,
and it also prevents further optimizations after inlining those operations.
So, why doesn’t push get neither inlined nor specialized?
It’s easier to give a definite answer for the latter: because GHC only does auto-specialization for methods defined in the same module. GHC documentation says explicitly:
When compiling a module
M, GHC’s optimiser (when given the-Oflag) automatically considers each top-level overloaded function declared inM, and specialises it for the different types at which it is called inM. The optimiser also considers each importedINLINABLEoverloaded function, and specialises it for the different types at which it is called inM.
Moreover, Well-Typed recently had a couple of excellent and deep blog posts on specializer’s behavior [1] [2]. There, they say something similar:
Even if GHC decides to expose an overloaded binding’s unfolding, and a specializable call to that binding occurs in another module, GHC will still never automatically specialize that call unless it has been given explicit permission to create the specialization. Such explicit permission can only be given in one of the following ways:
- Mark the overloaded binding with either an
INLINABLEorINLINEpragma.- Enable the
-fspecialize-aggressivelyflag while compiling the calling module.
So, we now have two options:
- mark
push(and other functions for consistency) with the{-# INLINEABLE #-}pragma, - or use the ultimate sledgehammer approach of the
-fspecialize-aggressively -fexpose-all-unfoldingsflags.
Both options fix the performance bug, and we get back our 160 ms-ish run time. We’re done, right?
“Hold on!” — you might say.
The other functions like pop or mkStack
aren’t marked as INLINEABLE but also don’t give us trouble — how come?
To understand that, we need to remember about…
Inlining
As Well-Typed posts above say, GHC is more eager to inline than to specialize,
so if it inlines pop or mkStack, then there’s no question of specialization not firing.
But what can be inlined, then?
If we dump Data.Stack’s Core, we’ll see that pop
gets split into pop and Data.Stack.pop1
(the former merely calling the latter with some runtime-irrelevant type coercions),
and both are annotated by, more or less,
Unf=Unf{Src=<vanilla>, TopLvl=True,
Value=True, ConLike=True, WorkFree=True, Expandable=True,...
Guidance=ALWAYS_IF(arity=3,unsat_ok=True,boring_ok=False)
...}
Unf here stands for “unfolding”, and it contains a bunch of hints to the optimizer.
The particular hint we’re most interested in right now is Guidance.
ALWAYS_IF here means the definition should always be inlined
if the corresponding criteria are met.
These criteria are:
arity— how many arguments should the function be applied to for inlining to make sense,unsat_ok— whether it’s OK to inline even if there are less thanarityarguments (in other words, if I understand this correctly, GHC doesn’t care aboutaritywhenunsat_ok = True),boring_ok— whether it’s OK to inline if GHC doesn’t think it opens further optimization opportunities.
In our case, all these are met, so pop gets inlined.
Let’s now look at mkStack.
The “top-level” one, which merely passes the calls to Data.Stack.mkStack1 after some type coercions, is also annotated by
Unf=Unf{Src=<vanilla>, TopLvl=True,
Value=True, ConLike=True, WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=0,unsat_ok=True,boring_ok=True)}
The latter is defined like this:
Data.Stack.mkStack1
= \ (@a_a3gB)
(@s_a3gC)
($dUnbox_a3gD :: VM.Unbox a_a3gB)
(initSize_a2GV [OS=OneShot] :: Int)
(eta_B0 [OS=OneShot] :: ghc-prim:GHC.Prim.State# s_a3gC) -> ...and its annotation has IF_ARGS that we haven’t seen before:
Unf=Unf{Src=<vanilla>, TopLvl=True,
Value=True, ConLike=True, WorkFree=True, Expandable=True,
Guidance=IF_ARGS [30 20 0] 100 10}
This stuff is documented very sparsely, so I’m very likely mistaken, but again, if I understand this correctly (mostly from reading GHC sources), the “arguments” mean the following:
- The second one,
100, is the initial size of the function. - The first one,
[30 20 0], describes the discounts applied to the function size for each function argument if it’s “good” (substituting it doesn’t lead to code bloat). Remember what our arguments are:- the
Unboxdictionary, and inlining it is very beneficial indeed (also note that30is coincidentally the default value of-funfolding-dict-discount), - the initial size of the stack, inlining it is so-so,
- the state token that doesn’t have any operational content, so inlining it has no benefit at all.
- the
- The last one,
10, is the additional discount if the result of the inlining exposes further optimization opportunities.
There are also additional
hardcoded
discounts, like a “boost” of 10 for the mere fact of inlining,
and a 10 per supplied function argument just for the mere fact of it existing.
So, in our case, if all arguments are good, the resulting size is computed as
| size | 100 |
| - arg1 | 30 |
| - arg2 | 20 |
| - arg3 | 0 |
| - inlining bonus | 10 |
| - hardcoded bonus | 10 |
| - args bonus | 3 * 10 |
| = total | 0 |
which is less than the apparent default 80
of the -funfolding-use-threshold tunable (which is the inlining threshold).
Hence, mkStack gets inlined, just as observed.
What about push? It gets split into three functions now:
push, with theALWAYS_IFguidance, which just calls the…Data.Stack.push1, also with theALWAYS_IFguidance, which seems to do the worker/wrapper transformation and then call the…Data.Stack.$wpush, havingGuidance=IF_ARGS [120 20 20 0 0] 393 20.
So even in the best case we’ll get the all-discounted cost of $wpush as 153
(including all the hardcoded discounts mentioned above),
which is way above the inlining threshold,
so no inlining happens (also just as observed).
It is also worth noting that the 120-discounted argument here is the Unbox dictionary
which is used precisely four times in the function body,
so maybe -funfolding-dict-threshold is per-use, not per-argument.
More dumps
Can we ask GHC to verify our hypotheses?
First, we might notice a seemingly relevant flag -ddump-inlinings,
which instructs GHC to output which functions it decides to inline.
What if we add it to our Main module, undo all our INLINEABLE or -fspecialize-aggressively changes,
recompile and grep for Data.Stack?
Inlining done: Data.Stack.mkStack
Inlining done: Data.Stack.mkStack1
Inlining done: Data.Stack.isEmpty
Inlining done: Data.Stack.pop
Inlining done: Data.Stack.push
Inlining done: Data.Stack.isEmpty
Inlining done: Data.Stack.pop1
Inlining done: Data.Stack.push1
This might be a bit confusing, especially if we forget what we’ve seen in Core
(and it was confusing for me, since I tried this flag before looking at the Core).
Indeed, we see that push (and even something named push1) gets inlined,
but the performance is as if it’s not!
But now we know it’s because push gets split into three functions,
and that the last one, the $wpush worker, doesn’t get inlined
(and it isn’t present in this list).
Luckily, there’s another flag: -ddump-verbose-inlinings.
This outputs a ton of information (about 17100 lines just for the Main module in our case),
but it is very informative!
For instance, we can see that the push1 wrapper gets inlined somewhere:
Considering inlining: push1
arg infos [ValueArg, ValueArg, ValueArg, TrivArg]
interesting continuation CaseCtxt
some_benefit True
is exp: True
is work-free: True
guidance ALWAYS_IF(arity=4,unsat_ok=True,boring_ok=False)
ANSWER = YES
but our $wpush worker does not:
Considering inlining: $wpush
arg infos [ValueArg, ValueArg, ValueArg, TrivArg, TrivArg]
interesting continuation CaseCtxt
some_benefit True
is exp: True
is work-free: True
guidance IF_ARGS [120 20 20 0 0] 393 20
case depth = 2
depth based penalty = 0
discounted size = 153
ANSWER = NO
Note that this 153 is exactly equal to the 153 we estimated from reading GHC sources.
Exercise to the reader:
compare Main’s -ddump-verbose-inlinings and Data.Stack’s Core
for push annotated with:
- no pragmas,
{-# INLINEABLE #-},{-# INLINE #-}.
Other optimizations
While we’re at it, we could try a couple of other things.
Explicit specialization
What happens
if we pretend that the specializer has fired (as it did in the original one-module case)
by adding the explicit specialization pragma?
We can add the following to specialize the push function for Word8:
{-# SPECIALIZE push :: Word8 -> Stack s Word8 -> ST s (Stack s Word8) #-}We get a rewrite rule that forwards Word8-instantiated calls of push to
Data.Stack.push1 [InlPrag=[2]]
:: forall {s}.
Word8 -> Stack s Word8 -> GHC.ST.STRep s (Stack s Word8)which doesn’t have any dictionary-related overhead,
and it also has Guidance=ALWAYS_IF(arity=3,unsat_ok=True,boring_ok=False) inlining hints.
This now runs in about 154 ms,
which is even faster than our original, all-in-one-module version
and comparable to the C++ version compiled with clang.
It’s worth noting that we only specialized push
with respect to a ~ Word8 and not the s tag,
yet still got good performance improvement.
This is because s represents the state thread in the ST monad
and doesn’t have any runtime semantics.
GHC recognizes this and can effectively optimize the function
without us having to specialize over s.
The bottom line is that we don’t always have to
fully specialize a polymorphic function to enjoy the performance benefits.
Hot and cold
There’s another optimization technique, and it’s quite universal across languages — in fact, I’ve seen it used the most in the context of high-frequency trading and other systems with very tight latency and throughput requirements.
Let’s split push into two functions,
a “hot” one that runs most often,
and a “cold” one that only gets called asymptotically never O(1) times:
push :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
push a s@Stack{..}
| size /= VM.length theVec = do
VM.unsafeWrite theVec size a
pure $ Stack theVec (size + 1)
| otherwise = pushCold a s
pushCold :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
pushCold a Stack{..} = do
vec' <- theVec `VM.unsafeGrow` size
push a $ Stack vec' size
{-# NOINLINE pushCold #-}By marking pushCold as {-# NOINLINE #-},
we prevent GHC from bloating the hot push with this almost-never-needed function,
keeping it small enough to inline just the hot version.
Indeed, compiling Main with -ddump-inlinings
shows that all three functions that push is split to by GHC are inlined —
without any pragmas on push itself.
Moreover, this version has roughly the same 160 ms-ish run time,
which shows that splitting hot and cold paths can be an effective optimization strategy
(although in this case, it doesn’t offer a significant improvement over explicit specialization).
But, although this might look like a cunning trick,
I’d advise against relying merely on {-# NOINLINE #-}ing cold paths in your code,
since it’s easily broken by minor code changes or GHC upgrades.
It’s still a good idea to annotate hot paths with {-# INLINEABLE #-}.
In our case, this split is probably not worth it, so we revert it and say we’re done.
More brackets!
The interviewer hears we’re done and bumps the problem’s complexity considerably,
now asking to match four different kinds of brackets.
Fearlessly, we think a little and then make the required changes,
modifying our isOpen and matches functions
to recognize the four kinds of brackets:
isOpen :: Word8 -> Bool
isOpen [c|(|] = True
isOpen [c|[|] = True
isOpen [c|<|] = True
isOpen [c|{|] = True
isOpen _ = False
matches :: Word8 -> Word8 -> Bool
matches [c|(|] [c|)|] = True
matches [c|[|] [c|]|] = True
matches [c|<|] [c|>|] = True
matches [c|{|] [c|}|] = True
matches _ _ = FalseThis now runs in about 265 ms with the stack split into a separate module,
even with all the {-# INLINEABLE #-} pragmas —
about 100 ms or 60% more than the previous version.
That’s quite a jump from the 160 ms we had!
Having a bad feeling, we try the all-in-one-module version again,
and it consistently runs in about 230 ms, which is again noticeably faster.
If we replace {-# INLINEABLE push #-} with {-# INLINE push #-},
we’ll get 215 ms.
Using {-# SPECIALIZE #-} also gives us the same 215 ms
if we do it in Data.Stack,
or 260 ms if we do it in Main
(yes, you can {-# SPECIALIZE #-} inlineable functions in different modules).
What is going on again?
Aside from the obvious causes like more branches and higher instruction cache utilization,
the added complexity affects GHC’s optimization choices.
Remember that GHC can optimize, inline, and specialize,
and do arbitrary subsequences of these actions, and not all actions commute.
Moreover, remember that GHC exports the pristine unoptimized body of
an {-# INLINE #-} or {-# INLINEABLE #-} function,
as the docs say:
It is useful for GHC to optimise the definition of an
INLINEfunctionfjust like any other non-INLINE function, in case the non-inlined version offis ultimately called. But we don’t want to inline the optimised version off; a major reason forINLINEpragmas is to expose functions inf’s RHS that have rewrite rules, and it’s no good if those functions have been optimised away.So GHC guarantees to inline precisely the code that you wrote, no more and no less. It does this by capturing a copy of the definition of the function to use for inlining (we call this the “inline-RHS”), which it leaves untouched, while optimising the ordinarily RHS as usual. For externally-visible functions the inline-RHS (not the optimised RHS) is recorded in the interface file.
We now have several scenarios, and among the most interesting ones:
- optimize-inline-
specialize(all-in-one module):push’s body gets optimized as-is.- The optimized body gets inlined into
checkBrackets. - Inlining also substitutes lookups of the
Unboxclass methods in the dictionary argument with the methods specific toWord8, so there’s no explicit specialization step, hence I’ve struck it out.
- specialize-optimize-inline-1 (with the
{-# INLINEABLE #-}pragma):push’s body is exported as is, unoptimized.- It’s too big to get inlined, so the specializer kicks in as a fallback.
- The specialized body is now small and nice enough to get inlined into
checkBrackets.
- inline-
specialize-optimize (with the{-# INLINE #-}pragma):push’s body is exported as is, unoptimized.- It’s too big to get inlined, but nobody cares and GHC shoves it into the caller.
- The inlined body is then optimized in the context of
checkBrackets, together with it.
- specialize-optimize-inline-2 (with
{-# SPECIALIZE #-}inData.Stack):push’s body gets specialized toWord8while compilingData.Stack.- The specialized body gets optimized while still compiling
Data.Stack. - The specialized and optimized body gets exported.
- It’s small and nice, so it gets inlined into
checkBrackets.
- specialize-optimize-inline-3 (with
{-# INLINEABLE #-}inData.Stackand{-# SPECIALIZE #-}inMain):push’s body is exported as is, unoptimized.push’s body gets specialized toWord8while compilingMain.- The specialized body gets optimized while compiling
Main. - It’s small and nice, so it gets inlined into
checkBrackets.
So, each scenario represents a different sequence of optimization steps, leading to varying performance.
Exercise for the reader: what happens in the scenarios (2) and (3)
if the caller function (checkBrackets in our case)
is not used in the same module it’s defined,
but is exported instead?
A hint.
Things may go ugly because the unoptimized body of the caller is exported.More modules!
What’s curious, and what I don’t have a good explanation for,
is that scenarios (4) and (5) have vastly different performance characteristics,
and the Core of checkBrackets is noticeably different,
even though the Core of the specialized push versions in both scenarios is the same.
Although,
one has to {-# SPECIALIZE NOINLINE #-} when specializing in Main
to see the specialized push body:
otherwise, it gets inlined and removed as a separate binding,
and the NOINLINE might cause a different Core from
the version without NOINLINE.
What’s even more interesting is that if we
add
yet another module
that merely imports Data.Stack and specializes push, which is then exported,
we’ll get the same numbers as when it was specialized in Data.Stack itself.
Apparently, GHC prefers to rely on imported, specialized, and optimized functions,
possibly due to how it manages optimization workloads and inlining decisions across module boundaries.
And, since it’s only observed when the surrounding function becomes more complex
(by adding more cases, for instance), I’d guess GHC somehow relies on
the imported function even if it is specialized in the current module,
either explicitly (via {-# SPECIALIZE #-}) or implicitly (by merely being {-# INLINEABLE #-}).
At least, this hypothesis is consistent with observations,
where both scenarios result in similar performance and similar Core.
Where do we arrive at?
I like to say the hardest part of Haskell programming is reasoning about performance, and today I stand firm in my beliefs: we’ve seen how seemingly simple refactoring and code changes can have profound impacts on performance due to the intricacies of GHC’s optimization process. In short, reasoning about optimization is a mess.
But, some principles and observations may help:
- By default, GHC only specializes the functions defined in the current module
and those marked with
{-# INLINEABLE #-}or{-# INLINE #-}. Ensure your functions are marked accordingly. - Remember that with
{-# INLINEABLE #-}and{-# INLINE #-}, the unoptimized bodies of functions are exported, and the subsequent inlining happens according to those unoptimized bodies. An{-# INLINEABLE #-}(monomorphic or non-specialized) function might be inlined in some context in its defining module but not inlined in the same context in a different module. - The trinity of specialization, inlining, and further optimizations, does not commute. You can (and will) get different results based on the exact sequence of these steps, which depends somewhat non-trivially on the exact module structure and the exact pragmas each function is annotated with.
- Not all polymorphism is bad for performance.
For example, phantom type parameters for the
STtrick don’t need to be specialized explicitly: the optimizer is really good at removing them. - The caller’s complexity is as important for the optimizer as its callees.
Now, is it generally beneficial to mark functions as:
{-# INLINEABLE #-}? Perhaps, if you don’t care about compilation times (I, for one, don’t).{-# INLINE #-}? Definitely not always, since it might result in code bloat and more (i-)cache misses.{-# SPECIALIZE #-}? Perhaps, if you don’t care about the extra maintenance overhead (I, for one, do).
Is there any universal, silver bullet-like advice? Unfortunately, I don’t think so. Nevertheless, a good rule of thumb might be to:
- Annotate your hot functions with
{-# INLINEABLE #-}by default and with{-# INLINE #-}if benchmarks show it’s beneficial. - If a hot function is polymorhic and depends on some type class,
and you know it will be used with some specific instances of that class,
consider explicitly
{-# SPECIALIZE #-}ing the function, even though it involves the extra maintenance overhead to make sure these specializations stay in sync with the actual usage sites.
Last but not least, experiment on your own and play around with some toy examples! This helps developing an intuition for what works best in which context, and there’s no better way to learn than by doing.