What's up with cross-module optimizations?

October 31, 2024 // haskell, performance // comment @

Last time, we implemented a simple regexp engine and looked at how it could be optimized. Truth is, I cheated a little: all the code, from the regexp definition to calling the regexp engine, was in the same module. However, it’s unlikely you’d write your production code like this. You’d probably separate different functionality into different modules: low-level memory representation-related things go into one, NFA matching goes into another, and so on. So, when productionizing our code, you’d do a similar refactoring. But even such a simple change turns out to have quite an effect on the performance.

Today, we’ll look at the magnitude of this effect, identify its sources, and try to learn to reason about it.

Balancing brackets

To keep our focus, we consider an even simpler problem today: the [in]famous interview question of checking if a bracket string is balanced. We’ll have several different kinds of brackets, so we need to keep a stack. To make matters more interesting, we keep a mutable ST-based stack that we write ourselves:

import Data.Vector.Unboxed.Mutable qualified as VM

data Stack s a = Stack
  { theVec :: VM.MVector s a
  , size :: Int
  }

mkStack :: VM.Unbox a => Int -> ST s (Stack s a)
mkStack initSize = (`Stack` 0) <$> VM.unsafeNew initSize

isEmpty :: Stack s a -> Bool
isEmpty Stack{..} = size == 0

push :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
push a Stack{..} = do
  vec' <- if size /= VM.length theVec
             then pure theVec
             else theVec `VM.unsafeGrow` size
  VM.unsafeWrite vec' size a
  pure $ Stack vec' (size + 1)

pop :: VM.Unbox a => Stack s a -> ST s (a, Stack s a)
pop Stack{..} = do
  a <- VM.unsafeRead theVec (size - 1)
  pure (a, Stack theVec (size - 1))

We’ll be using Word8 instead of Char as we don’t need Unicode, and Word8s are packed more effectively, using just one byte instead of multiple bytes for Char. Modulo that, the code is straighforward:

isOpen :: Word8 -> Bool
isOpen [c|(|] = True
isOpen [c|[|] = True
isOpen _ = False

matches :: Word8 -> Word8 -> Bool
matches [c|(|] [c|)|] = True
matches [c|[|] [c|]|] = True
matches _ _ = False

checkBrackets :: BS.ByteString -> Bool
checkBrackets bs = runST $ mkStack len >>= go 0
  where
  len = BS.length bs
  go i stack
    | i == len = pure $ isEmpty stack
    | otherwise = case bs `BS.unsafeIndex` i of
                    ch
                      | isOpen ch -> push ch stack >>= go (i + 1)
                      | isEmpty stack -> pure False
                      | otherwise -> do
                        (inStack, stack') <- pop stack
                        if inStack `matches` ch
                         then go (i + 1) stack'
                         else pure False

main :: IO ()
main = print $ checkBrackets $ BS.replicate cnt 40 <> BS.replicate cnt 41
  where
  cnt = 100_000_000

For the record, here’s the c quasiquoter.

c :: QuasiQuoter
c = QuasiQuoter { quotePat = q, quoteExp = unsupported, quoteType = unsupported, quoteDec = unsupported }
  where
  q [char] = pure $ LitP $ IntegerL $ fromIntegral $ ord char
  q _ = fail "single char expected"
  unsupported = const $ error "unsupported context"

This runs in about 174 ms (GHC 9.8.2, NCG with -O2, best of ten runs).

For comparison, a similar C++ version runs in about 160 ms with gcc 14 and 152 ms with clang 18. That’s about 8-12% faster — unfortunate for Haskell in this case, but I can live with it.

Here’s the C++ code.

#include <iostream>
#include <vector>
#include <string>

bool isOpen(char ch)
{
	return ch == '(' || ch == '[';
}

bool matches(char l, char r)
{
	if (l == '(' && r == ')')
		return true;
	if (l == '[' && r == ']')
		return true;
	return false;
}

bool checkBrackets(std::string_view str)
{
	std::vector<char> stack;
	stack.reserve(str.size());

	for (auto ch : str)
	{
		if (isOpen(ch))
			stack.push_back(ch);
		else if (stack.empty())
			return false;
		else
		{
			auto inStack = stack.back();
			stack.pop_back();
			if (!matches(inStack, ch))
				return false;
		}
	}
	return stack.empty();
}

int main()
{
	constexpr size_t cnt = 100'000'000;
	auto input = std::string(cnt, '(') + std::string(cnt, ')');
	std::cout << checkBrackets(input) << '\n';
}

What happens if we move the stack-related definitions (that is, the whole first code listing) to its own module, say, Data.Stack, and import it in our Main module?

1.27 seconds, or about 7 times slower — merely after some very basic refactoring! What’s going on?

The only difference from the compiler’s point of view is all the definitions in the module being compiled are readily available, while the ones from imported modules might not be.

To confirm this, we need to go deeper into the compilation process and check what the compiler generates. GHC has several different code representations below the usual Haskell: Core, Cmm, and the target architecture Assembly. Core is the most high-level, readable, and closest to the Haskell source, so it makes sense to start with it.

Into the Core

Let’s add {-# OPTIONS_GHC -ddump-simpl -ddump-to-file #-} to the module defining the checkBrackets and using the stack functions. The -ddump-simpl flag tells GHC to output the optimized Core representation of the module, and -ddump-to-file writes it to a file instead of the console.

Compiling our project again, we’ll get a Main.dump-simpl file (the exact location depends on your environment). There, we find something resembling our main function but no sign of checkBrackets that we’re actually focusing on: it got inlined into main. Since that’s the function we care about the most, we’d like to analyze it in isolation, so we add {-# NOINLINE checkBrackets #-} and compile again.

Generally, preventing inlining is a pessimization, but that’s OK in our case: in short, we’re only calling this function once (so we can ignore any call overhead), and it’s monomorphic (so we don’t need to think about it being specialized for the particular types that we care about). Moreover, keeping checkBrackets as a standalone function can help GHC optimize it more effectively in this context. And this is indeed the case: this annotation improves the performance of the all-in-one-module version to circa 160 ms — that’s C++ speed again!

Anyway, we now get a definition of a function named Main.$wcheckBrackets… spanning about 600 lines. Welp, so much for readability!

Trying to reason about it in its entirety is unrealistic, but we can compare it instead! Let’s add the same pragmas to the first, fast version with all the definitions in the same module and compare the two dumps. Unfortunately, diff won’t cut it: the names for the labels, dummy variables, and so on, are different, and so are minor implementation details, like letrec vs joinrec for our go recursive helper. So, we’ll have to do the comparison ourselves, say, by pulling the two dumps side by side.

The first non-trivial difference comes after we’ve read a byte off the input string via GHC.Prim.readWord8OffAddr# (and GHC.Prim.touch#ed the string finalizer to avoid running it before we’re done with the string). The bad, slow version then calls Data.Stack.$wpush (which looks like a mangled version of our push), passing it a ton of things:

here’s the annotated Core,

Main.$wcheckBrackets
  = \ (ww_s3QY :: ghc-prim:GHC.Prim.Addr#)
      (ww1_s3QZ :: GHC.ForeignPtr.ForeignPtrContents)
      (ww2_s3R0 :: ghc-prim:GHC.Prim.Int#) ->
[...skip 20 lines...]
               letrec {
[...skip 13 lines...]
                 $s$wgo_s3RE
                   = \ (@s1_s3QH)
                       (sc_s3Rj :: ghc-prim:GHC.Prim.Int#)
                       (sc1_s3Rk :: ghc-prim:GHC.Prim.Int#)
                       (sc2_s3Rm
                          :: ghc-prim:GHC.Prim.MutableByteArray# (VM.PrimState (ST s1_s3QH)))
                       (sg_s3Rn
                          :: Data.Vector.Primitive.Mutable.MVector
                               (VM.PrimState (ST s1_s3QH)) Word8
                             ~R# VM.MVector s1_s3QH Word8)
                       (sc3_s3Ri :: ghc-prim:GHC.Prim.Int#)
                       (ww3_s3QP :: ghc-prim:GHC.Prim.Int#)
                       (eta_s3QR [OS=OneShot] :: ghc-prim:GHC.Prim.State# s1_s3QH) ->
-- string length check:
                       case ghc-prim:GHC.Prim.==# sc3_s3Ri ww2_s3R0 of {
                         __DEFAULT ->
-- reading the byte:
                           case ghc-prim:GHC.Prim.readWord8OffAddr#
                                  @RealWorld
                                  (ghc-prim:GHC.Prim.plusAddr# ww_s3QY sc3_s3Ri)
                                  0#
                                  ghc-prim:GHC.Prim.realWorld#
                           of
                           { (# ipv2_a2UA, ipv3_a2UB #) ->
-- GC/finalizers shenanigans:
                           case ghc-prim:GHC.Prim.touch#
                                  @ghc-prim:GHC.Types.Lifted
                                  @GHC.ForeignPtr.ForeignPtrContents
                                  @RealWorld
                                  ww1_s3QZ
                                  ipv2_a2UA
                           of
                           { __DEFAULT ->
                           join {
-- handler for the isOpen branch:
                             $j_s3ON [Dmd=M!P(L,L)]
                               :: (# ghc-prim:GHC.Prim.State# s1_s3QH, Bool #)
                             [LclId[JoinId(0)(Nothing)]]
                             $j_s3ON
                               = case Data.Stack.$wpush
                                        @Word8
                                        @s1_s3QH
                                        Data.Vector.Unboxed.Base.$fUnboxWord8
                                        (GHC.Word.W8# ipv3_a2UB)
                                        ((Data.Vector.Primitive.Mutable.MVector
                                            @s1_s3QH
                                            @Word8
                                            sc_s3Rj
                                            sc1_s3Rk
                                            (sc2_s3Rm
                                             `cast` ((ghc-prim:GHC.Prim.MutableByteArray#
                                                        (SelCo:Tc(0)
                                                             (sg_s3Rn
                                                              ; Data.Vector.Unboxed.Base.D:R:MVectorsWord80[0]
                                                                    <s1_s3QH>_N
                                                              ; Data.Vector.Unboxed.Base.N:R:MVectorsWord8[0]
                                                                    <s1_s3QH>_N)))_R
                                                     :: ghc-prim:GHC.Prim.MutableByteArray#
                                                          (VM.PrimState (ST s1_s3QH))
                                                        ~R# ghc-prim:GHC.Prim.MutableByteArray#
                                                              s1_s3QH)))
                                         `cast` (Sym (Data.Vector.Unboxed.Base.N:R:MVectorsWord8[0]
                                                          <s1_s3QH>_N)
                                                 ; Sym (Data.Vector.Unboxed.Base.D:R:MVectorsWord80[0]
                                                            <s1_s3QH>_N)
                                                 :: Data.Vector.Primitive.Mutable.MVector
                                                      s1_s3QH Word8
                                                    ~R# VM.MVector s1_s3QH Word8))
                                        ww3_s3QP
                                        eta_s3QR
                                 of [...]

while the good, fast version has inlined push:

here’s the Core.

Main.$wcheckBrackets
  = \ (ww_s45M :: ghc-prim:GHC.Prim.Addr#)
      (ww1_s45N :: GHC.ForeignPtr.ForeignPtrContents)
      (ww2_s45O :: ghc-prim:GHC.Prim.Int#) ->
[...skip 20 lines...]
               joinrec {
[...skip 14 lines...]
                 $wgo_s45I (ww3_s45u :: ghc-prim:GHC.Prim.Int#)
                           (ww4_s45A :: ghc-prim:GHC.Prim.Int#)
                           (ww5_s45B :: ghc-prim:GHC.Prim.Int#)
                           (ww6_s45C :: ghc-prim:GHC.Prim.MutableByteArray# RealWorld)
                           (ww7_s45E :: ghc-prim:GHC.Prim.Int#)
                           (eta_s45G [OS=OneShot] :: ghc-prim:GHC.Prim.State# RealWorld)
-- string length check:
                   = case ghc-prim:GHC.Prim.==# ww3_s45u ww2_s45O of {
                       __DEFAULT ->
-- reading the byte:
                         case ghc-prim:GHC.Prim.readWord8OffAddr#
                                @RealWorld
                                (ghc-prim:GHC.Prim.plusAddr# ww_s45M ww3_s45u)
                                0#
                                ghc-prim:GHC.Prim.realWorld#
                         of
                         { (# ipv2_a3gB, ipv3_a3gC #) ->
-- GC/finalizers shenanigans:
                         case ghc-prim:GHC.Prim.touch#
                                @ghc-prim:GHC.Types.Lifted
                                @GHC.ForeignPtr.ForeignPtrContents
                                @RealWorld
                                ww1_s45N
                                ipv2_a3gB
                         of
                         { __DEFAULT ->
                         join {
-- handler for the isOpen branch:
                           $j_s437 [Dmd=ML] :: Bool
                           [LclId[JoinId(0)(Nothing)]]
                           $j_s437
-- stack size check (inlined from `push`):
                             = case ghc-prim:GHC.Prim./=# ww7_s45E ww5_s45B of {
-- "need to grow" branch:
                                 __DEFAULT ->
                                   let {
                                     eta3_s434 :: ghc-prim:GHC.Prim.Int#
                                     [LclId]
                                     eta3_s434 = ghc-prim:GHC.Prim.+# ww5_s45B ww7_s45E } in
-- new stack size integer overflow check:
                                   case ghc-prim:GHC.Prim.<# eta3_s434 0# of {
-- "all good" branch:
                                     __DEFAULT ->
-- allocate the new array:
                                       case ghc-prim:GHC.Prim.newByteArray#
[...skip 15 lines preparing the parameters...]
                                       of
                                       { (# ipv4_a3jb, ipv5_a3jc #) ->
-- copy old array into the new one:
                                       case ghc-prim:GHC.Prim.copyMutableByteArray#
[...skip 33 lines preparing the parameters...]
                                       of s'#_a3jj
                                       { __DEFAULT ->
-- write onto the newly allocated stack:
                                       case ghc-prim:GHC.Prim.writeWord8Array#
                                              @(VM.PrimState (ST (VM.PrimState (ST RealWorld))))
                                              ipv5_a3jc
                                              ww7_s45E
                                              ipv3_a3gC
                                              s'#_a3jj
                                       of s'#1_a3ko
                                       { __DEFAULT ->
-- TCO'd recursive call:
                                       jump $wgo_s45I
[...skip 38 lines preparing the parameters...]
                                       }
                                       }
                                       };
-- new stack size integer overflown:
                                     1# -> case $wlvl_r472 eta3_s434 of wild_00 { }
                                   };
-- "no need to grow" branch of stack size check
                                 1# ->
-- write onto the existing stack:
                                   case ghc-prim:GHC.Prim.writeWord8Array#
[...skip 36 lines preparing the parameters...]
                                   of s'#_a3ko
                                   { __DEFAULT ->
-- TCO'd recursive call:
                                   jump $wgo_s45I
[...skip 14 lines preparing the parameters...]
                                   }

Specialization

The worst part — the absolute performance killer, in fact — the non-inlined push doesn’t get specialized. For instance, note the $fUnboxWord8 argument (qualified by the Data.Vector.Unboxed.Base module): it’s the explicitly passed dictionary of the Unbox type class telling the Vector how to box and unbox the Word8s. Dictionaries add significant overhead because each class method requires a dictionary lookup, which is an extra indirection by itself, and it also prevents further optimizations after inlining those operations.

So, why doesn’t push get neither inlined nor specialized?

It’s easier to give a definite answer for the latter: because GHC only does auto-specialization for methods defined in the same module. GHC documentation says explicitly:

When compiling a module M, GHC’s optimiser (when given the -O flag) automatically considers each top-level overloaded function declared in M, and specialises it for the different types at which it is called in M. The optimiser also considers each imported INLINABLE overloaded function, and specialises it for the different types at which it is called in M.

Moreover, Well-Typed recently had a couple of excellent and deep blog posts on specializer’s behavior [1] [2]. There, they say something similar:

Even if GHC decides to expose an overloaded binding’s unfolding, and a specializable call to that binding occurs in another module, GHC will still never automatically specialize that call unless it has been given explicit permission to create the specialization. Such explicit permission can only be given in one of the following ways:

Mark the overloaded binding with either an INLINABLE or INLINE pragma.

Enable the -fspecialize-aggressively flag while compiling the calling module.

So, we now have two options:

mark push (and other functions for consistency) with the {-# INLINEABLE #-} pragma,
or use the ultimate sledgehammer approach of the -fspecialize-aggressively -fexpose-all-unfoldings flags.

In this particular case, we don’t need -fexpose-all-unfoldings since it’s already exposed, but relying on it is fragile, since tiniest code changes may result in GHC deciding to not expose it.

Both options fix the performance bug, and we get back our 160 ms-ish run time. We’re done, right?

“Hold on!” — you might say. The other functions like pop or mkStack aren’t marked as INLINEABLE but also don’t give us trouble — how come? To understand that, we need to remember about…

Inlining

As Well-Typed posts above say, GHC is more eager to inline than to specialize, so if it inlines pop or mkStack, then there’s no question of specialization not firing. But what can be inlined, then?

If we dump Data.Stack’s Core, we’ll see that pop gets split into pop and Data.Stack.pop1 (the former merely calling the latter with some runtime-irrelevant type coercions), and both are annotated by, more or less,

Unf=Unf{Src=<vanilla>, TopLvl=True,
        Value=True, ConLike=True, WorkFree=True, Expandable=True,...
        Guidance=ALWAYS_IF(arity=3,unsat_ok=True,boring_ok=False)
        ...}

Unf here stands for “unfolding”, and it contains a bunch of hints to the optimizer. The particular hint we’re most interested in right now is Guidance. ALWAYS_IF here means the definition should always be inlined if the corresponding criteria are met. These criteria are:

arity — how many arguments should the function be applied to for inlining to make sense,
unsat_ok — whether it’s OK to inline even if there are less than arity arguments (in other words, if I understand this correctly, GHC doesn’t care about arity when unsat_ok = True),
boring_ok — whether it’s OK to inline if GHC doesn’t think it opens further optimization opportunities.

In our case, all these are met, so pop gets inlined.

Let’s now look at mkStack. The “top-level” one, which merely passes the calls to Data.Stack.mkStack1 after some type coercions, is also annotated by

Unf=Unf{Src=<vanilla>, TopLvl=True,
        Value=True, ConLike=True, WorkFree=True, Expandable=True,
        Guidance=ALWAYS_IF(arity=0,unsat_ok=True,boring_ok=True)}

The latter is defined like this:

Data.Stack.mkStack1
  = \ (@a_a3gB)
      (@s_a3gC)
      ($dUnbox_a3gD :: VM.Unbox a_a3gB)
      (initSize_a2GV [OS=OneShot] :: Int)
      (eta_B0 [OS=OneShot] :: ghc-prim:GHC.Prim.State# s_a3gC) -> ...

and its annotation has IF_ARGS that we haven’t seen before:

Unf=Unf{Src=<vanilla>, TopLvl=True,
        Value=True, ConLike=True, WorkFree=True, Expandable=True,
        Guidance=IF_ARGS [30 20 0] 100 10}

This stuff is documented very sparsely, so I’m very likely mistaken, but again, if I understand this correctly (mostly from reading GHC sources), the “arguments” mean the following:

The second one, 100, is the initial size of the function.
The first one, [30 20 0], describes the discounts applied to the function size for each function argument if it’s “good” (substituting it doesn’t lead to code bloat). Remember what our arguments are:
1. the Unbox dictionary, and inlining it is very beneficial indeed (also note that 30 is coincidentally the default value of -funfolding-dict-discount),
2. the initial size of the stack, inlining it is so-so,
3. the state token that doesn’t have any operational content, so inlining it has no benefit at all.
The last one, 10, is the additional discount if the result of the inlining exposes further optimization opportunities.

Frankly, I’m surprised the Unbox dictionary, and the default value for dictionaries inlining discount in general, is merely 30 compared to 20 for a simple Int that’s just passed down to Vector’s basicUnsafeNew. I’d expect the net benefit from inlining dictionaries to be much, much higher. But those are heuristics, after all, and this one can be tuned via -funfolding-dict-discount.

There are also additional hardcoded discounts, like a “boost” of 10 for the mere fact of inlining, and a 10 per supplied function argument just for the mere fact of it existing.

So, in our case, if all arguments are good, the resulting size is computed as

size	`100`
- arg1	`30`
- arg2	`20`
- arg3	`0`
- inlining bonus	`10`
- hardcoded bonus	`10`
- args bonus	`3 * 10`
= total	`0`

which is less than the apparent default 80 of the -funfolding-use-threshold tunable (which is the inlining threshold). Hence, mkStack gets inlined, just as observed.

What about push? It gets split into three functions now:

push, with the ALWAYS_IF guidance, which just calls the…
Data.Stack.push1, also with the ALWAYS_IF guidance, which seems to do the worker/wrapper transformation and then call the…
Data.Stack.$wpush, having Guidance=IF_ARGS [120 20 20 0 0] 393 20.

So even in the best case we’ll get the all-discounted cost of $wpush as 153 (including all the hardcoded discounts mentioned above), which is way above the inlining threshold, so no inlining happens (also just as observed).

It is also worth noting that the 120-discounted argument here is the Unbox dictionary which is used precisely four times in the function body, so maybe -funfolding-dict-threshold is per-use, not per-argument.

More dumps

Can we ask GHC to verify our hypotheses?

First, we might notice a seemingly relevant flag -ddump-inlinings, which instructs GHC to output which functions it decides to inline. What if we add it to our Main module, undo all our INLINEABLE or -fspecialize-aggressively changes, recompile and grep for Data.Stack?

Inlining done: Data.Stack.mkStack
Inlining done: Data.Stack.mkStack1
Inlining done: Data.Stack.isEmpty
Inlining done: Data.Stack.pop
Inlining done: Data.Stack.push
Inlining done: Data.Stack.isEmpty
Inlining done: Data.Stack.pop1
Inlining done: Data.Stack.push1

This might be a bit confusing, especially if we forget what we’ve seen in Core (and it was confusing for me, since I tried this flag before looking at the Core). Indeed, we see that push (and even something named push1) gets inlined, but the performance is as if it’s not! But now we know it’s because push gets split into three functions, and that the last one, the $wpush worker, doesn’t get inlined (and it isn’t present in this list).

Luckily, there’s another flag: -ddump-verbose-inlinings. This outputs a ton of information (about 17100 lines just for the Main module in our case), but it is very informative! For instance, we can see that the push1 wrapper gets inlined somewhere:

Considering inlining: push1
  arg infos [ValueArg, ValueArg, ValueArg, TrivArg]
  interesting continuation CaseCtxt
  some_benefit True
  is exp: True
  is work-free: True
  guidance ALWAYS_IF(arity=4,unsat_ok=True,boring_ok=False)
  ANSWER = YES

but our $wpush worker does not:

Considering inlining: $wpush
  arg infos [ValueArg, ValueArg, ValueArg, TrivArg, TrivArg]
  interesting continuation CaseCtxt
  some_benefit True
  is exp: True
  is work-free: True
  guidance IF_ARGS [120 20 20 0 0] 393 20
  case depth = 2
  depth based penalty = 0
  discounted size = 153
  ANSWER = NO

Note that this 153 is exactly equal to the 153 we estimated from reading GHC sources.

Exercise to the reader: compare Main’s -ddump-verbose-inlinings and Data.Stack’s Core for push annotated with:

no pragmas,
{-# INLINEABLE #-},
{-# INLINE #-}.

Other optimizations

While we’re at it, we could try a couple of other things.

Explicit specialization

What happens if we pretend that the specializer has fired (as it did in the original one-module case) by adding the explicit specialization pragma? We can add the following to specialize the push function for Word8:

{-# SPECIALIZE push :: Word8 -> Stack s Word8 -> ST s (Stack s Word8) #-}

We get a rewrite rule that forwards Word8-instantiated calls of push to

Data.Stack.push1 [InlPrag=[2]]
  :: forall {s}.
     Word8 -> Stack s Word8 -> GHC.ST.STRep s (Stack s Word8)

which doesn’t have any dictionary-related overhead, and it also has Guidance=ALWAYS_IF(arity=3,unsat_ok=True,boring_ok=False) inlining hints. This now runs in about 154 ms, which is even faster than our original, all-in-one-module version and comparable to the C++ version compiled with clang.

It’s worth noting that we only specialized push with respect to a ~ Word8 and not the s tag, yet still got good performance improvement. This is because s represents the state thread in the ST monad and doesn’t have any runtime semantics. GHC recognizes this and can effectively optimize the function without us having to specialize over s. The bottom line is that we don’t always have to fully specialize a polymorphic function to enjoy the performance benefits.

Hot and cold

There’s another optimization technique, and it’s quite universal across languages — in fact, I’ve seen it used the most in the context of high-frequency trading and other systems with very tight latency and throughput requirements.

Let’s split push into two functions, a “hot” one that runs most often, and a “cold” one that only gets called ~~asymptotically never~~ O(1) times:

push :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
push a s@Stack{..}
  | size /= VM.length theVec = do
      VM.unsafeWrite theVec size a
      pure $ Stack theVec (size + 1)
  | otherwise = pushCold a s

pushCold :: VM.Unbox a => a -> Stack s a -> ST s (Stack s a)
pushCold a Stack{..} = do
  vec' <- theVec `VM.unsafeGrow` size
  push a $ Stack vec' size
{-# NOINLINE pushCold #-}

By marking pushCold as {-# NOINLINE #-}, we prevent GHC from bloating the hot push with this almost-never-needed function, keeping it small enough to inline just the hot version. Indeed, compiling Main with -ddump-inlinings shows that all three functions that push is split to by GHC are inlined — without any pragmas on push itself. Moreover, this version has roughly the same 160 ms-ish run time, which shows that splitting hot and cold paths can be an effective optimization strategy (although in this case, it doesn’t offer a significant improvement over explicit specialization).

But, although this might look like a cunning trick, I’d advise against relying merely on {-# NOINLINE #-}ing cold paths in your code, since it’s easily broken by minor code changes or GHC upgrades. It’s still a good idea to annotate hot paths with {-# INLINEABLE #-}.

In our case, this split is probably not worth it, so we revert it and say we’re done.

More brackets!

The interviewer hears we’re done and bumps the problem’s complexity considerably, now asking to match four different kinds of brackets. Fearlessly, we think a little and then make the required changes, modifying our isOpen and matches functions to recognize the four kinds of brackets:

isOpen :: Word8 -> Bool
isOpen [c|(|] = True
isOpen [c|[|] = True
isOpen [c|<|] = True
isOpen [c|{|] = True
isOpen _ = False

matches :: Word8 -> Word8 -> Bool
matches [c|(|] [c|)|] = True
matches [c|[|] [c|]|] = True
matches [c|<|] [c|>|] = True
matches [c|{|] [c|}|] = True
matches _ _ = False

This now runs in about 265 ms with the stack split into a separate module, even with all the {-# INLINEABLE #-} pragmas — about 100 ms or 60% more than the previous version. That’s quite a jump from the 160 ms we had!

Having a bad feeling, we try the all-in-one-module version again, and it consistently runs in about 230 ms, which is again noticeably faster. If we replace {-# INLINEABLE push #-} with {-# INLINE push #-}, we’ll get 215 ms. Using {-# SPECIALIZE #-} also gives us the same 215 ms if we do it in Data.Stack, or 260 ms if we do it in Main (yes, you can {-# SPECIALIZE #-} inlineable functions in different modules).

What is going on again?

Aside from the obvious causes like more branches and higher instruction cache utilization, the added complexity affects GHC’s optimization choices. Remember that GHC can optimize, inline, and specialize, and do arbitrary subsequences of these actions, and not all actions commute. Moreover, remember that GHC exports the pristine unoptimized body of an {-# INLINE #-} or {-# INLINEABLE #-} function, as the docs say:

It is useful for GHC to optimise the definition of an INLINE function f just like any other non-INLINE function, in case the non-inlined version of f is ultimately called. But we don’t want to inline the optimised version of f; a major reason for INLINE pragmas is to expose functions in f’s RHS that have rewrite rules, and it’s no good if those functions have been optimised away.

So GHC guarantees to inline precisely the code that you wrote, no more and no less. It does this by capturing a copy of the definition of the function to use for inlining (we call this the “inline-RHS”), which it leaves untouched, while optimising the ordinarily RHS as usual. For externally-visible functions the inline-RHS (not the optimised RHS) is recorded in the interface file.

We now have several scenarios, and among the most interesting ones:

optimize-inline-~~specialize~~ (all-in-one module):
1. push’s body gets optimized as-is.
2. The optimized body gets inlined into checkBrackets.
3. Inlining also substitutes lookups of the Unbox class methods in the dictionary argument with the methods specific to Word8, so there’s no explicit specialization step, hence I’ve struck it out.
  I’m not entirely sure that specialization doesn’t happen before inlining, but the end result should be the same given that the surrounding function is then further optimized as well.
specialize-optimize-inline-1 (with the {-# INLINEABLE #-} pragma):
1. push’s body is exported as is, unoptimized.
2. It’s too big to get inlined, so the specializer kicks in as a fallback.
3. The specialized body is now small and nice enough to get inlined into checkBrackets.
inline-~~specialize~~-optimize (with the {-# INLINE #-} pragma):
1. push’s body is exported as is, unoptimized.
2. It’s too big to get inlined, but nobody cares and GHC shoves it into the caller.
3. The inlined body is then optimized in the context of checkBrackets, together with it.
specialize-optimize-inline-2 (with {-# SPECIALIZE #-} in Data.Stack):
1. push’s body gets specialized to Word8 while compiling Data.Stack.
2. The specialized body gets optimized while still compiling Data.Stack.
3. The specialized and optimized body gets exported.
4. It’s small and nice, so it gets inlined into checkBrackets.
specialize-optimize-inline-3 (with {-# INLINEABLE #-} in Data.Stack and {-# SPECIALIZE #-} in Main):
1. push’s body is exported as is, unoptimized.
2. push’s body gets specialized to Word8 while compiling Main.
3. The specialized body gets optimized while compiling Main.
4. It’s small and nice, so it gets inlined into checkBrackets.

So, each scenario represents a different sequence of optimization steps, leading to varying performance.

Exercise for the reader: what happens in the scenarios (2) and (3) if the caller function (checkBrackets in our case) is not used in the same module it’s defined, but is exported instead?

A hint.

Things may go ugly because the unoptimized body of the caller is exported.

More modules!

What’s curious, and what I don’t have a good explanation for, is that scenarios (4) and (5) have vastly different performance characteristics, and the Core of checkBrackets is noticeably different, even though the Core of the specialized push versions in both scenarios is the same. Although, one has to {-# SPECIALIZE NOINLINE #-} when specializing in Main to see the specialized push body: otherwise, it gets inlined and removed as a separate binding, and the NOINLINE might cause a different Core from the version without NOINLINE.

What’s even more interesting is that if we add yet another module that merely imports Data.Stack and specializes push, which is then exported, we’ll get the same numbers as when it was specialized in Data.Stack itself. Apparently, GHC prefers to rely on imported, specialized, and optimized functions, possibly due to how it manages optimization workloads and inlining decisions across module boundaries.

And, since it’s only observed when the surrounding function becomes more complex (by adding more cases, for instance), I’d guess GHC somehow relies on the imported function even if it is specialized in the current module, either explicitly (via {-# SPECIALIZE #-}) or implicitly (by merely being {-# INLINEABLE #-}). At least, this hypothesis is consistent with observations, where both scenarios result in similar performance and similar Core.

Where do we arrive at?

I like to say the hardest part of Haskell programming is reasoning about performance, and today I stand firm in my beliefs: we’ve seen how seemingly simple refactoring and code changes can have profound impacts on performance due to the intricacies of GHC’s optimization process. In short, reasoning about optimization is a mess.

But, some principles and observations may help:

By default, GHC only specializes the functions defined in the current module and those marked with {-# INLINEABLE #-} or {-# INLINE #-}. Ensure your functions are marked accordingly.
Remember that with {-# INLINEABLE #-} and {-# INLINE #-}, the unoptimized bodies of functions are exported, and the subsequent inlining happens according to those unoptimized bodies. An {-# INLINEABLE #-} (monomorphic or non-specialized) function might be inlined in some context in its defining module but not inlined in the same context in a different module.
The trinity of specialization, inlining, and further optimizations, does not commute. You can (and will) get different results based on the exact sequence of these steps, which depends somewhat non-trivially on the exact module structure and the exact pragmas each function is annotated with.
Not all polymorphism is bad for performance. For example, phantom type parameters for the ST trick don’t need to be specialized explicitly: the optimizer is really good at removing them.
The caller’s complexity is as important for the optimizer as its callees.

Now, is it generally beneficial to mark functions as:

{-# INLINEABLE #-}? Perhaps, if you don’t care about compilation times (I, for one, don’t).
{-# INLINE #-}? Definitely not always, since it might result in code bloat and more (i-)cache misses.
{-# SPECIALIZE #-}? Perhaps, if you don’t care about the extra maintenance overhead (I, for one, do).

Is there any universal, silver bullet-like advice? Unfortunately, I don’t think so. Nevertheless, a good rule of thumb might be to:

Annotate your hot functions with {-# INLINEABLE #-} by default and with {-# INLINE #-} if benchmarks show it’s beneficial.
If a hot function is polymorhic and depends on some type class, and you know it will be used with some specific instances of that class, consider explicitly {-# SPECIALIZE #-}ing the function, even though it involves the extra maintenance overhead to make sure these specializations stay in sync with the actual usage sites.

Last but not least, experiment on your own and play around with some toy examples! This helps developing an intuition for what works best in which context, and there’s no better way to learn than by doing.