Let's run some NFAs

September 12, 2024 // haskell, performance // comment @

Lately, I’ve been playing around with memoized NFAs for optimized regular expression matching, with features like lookahead and atomic groups, based on this paper. The original authors have their code in Scala, and I thought it’d be fun to code something in Haskell to see how it stacks up against their new implementation and the prior art.

But before diving into memoization and the more complex features, let’s start with the basics. In this post, we’ll focus on a simple, naive backtracking NFA implementation. We’ll start with the simplest, regexp 101 code and then make it significantly faster, step by step. We’ll also inevitably face some dead ends — that’s part of learning and experimentation, too!

To ground our work in reality, I’ll also implement some of the algorithms in C++, praised for its performance advantages over pretty much everything else. Is the praise deserved here? Let’s find out.

Base baseline

NFAs

Our NFA states come from a type q that fulfills some constraints:

type StateId q = (Integral q, Enum q)

Typically, we’ll have q ~ Word32, but it’s nice to have it abstracted away: it’s easier to change the type this way, and we also won’t mistake q for anything else (like string length or position) even if it’s instantiated to the same type. Yay type safety!

An NFA consists of transitions between those states, a particular initial state, and a set of final states. A transition can be an ε-transition, a branch, or a ~~character~~ byte match:

data Trans q
  = TEps q
  | TBranch q q
  | TCh Word8 q
  deriving (Eq, Show)

The NFAs I’m working with have only one final state by construction, so the type describing an NFA itself looks like this:

import Data.EnumMap.Strict qualified as EM

type TransMap q = EM.EnumMap q (Trans q)

data NFA q = NFA
  { transitions :: TransMap q
  , initState :: q
  , finState :: q
  }

Here, EnumMap is a wrapper around IntMap that supports arbitrary Enum types and comes from the enummapset package. My previous experience shows that for types with cheap toEnum/fromEnum (like our Word32) it’s much more efficient than the usual HashMap from unordered-containers, so we’ll go with it straight away.

As usual, I’ll also have {-# LANGUAGE Strict #-} from the get-go — it’s almost always beneficial for problems like this.

Matching

In my experiments, the result of matching an NFA needs to keep the index where it’s matched, so we’ll define a Maybe-like type for this:

data MatchResult a = SuccessAt a | Failure
  deriving (Eq, Ord, Show, Functor)

instance Applicative MatchResult where
  SuccessAt f <*> SuccessAt v = SuccessAt $ f v
  _           <*> _           = Failure
  pure = SuccessAt

instance Alternative MatchResult where
  SuccessAt a <|> ~_ = SuccessAt a
  _           <|> ~r = r
  empty = Failure

Note the explicit laziness annotation in <|>’s second argument. Since I’ve enabled the strictness by default, omitting it results in always computing both branches, which is definitely not what we want.

Now, it’s time for the first NFA matching implementation!

getTrans :: StateId q => q -> TransMap q -> Trans q
getTrans q m = case q `EM.lookup` m of
                 Just t -> t
                 Nothing -> error "invariant failure"
                 -- ^ alas, we could've proven statically this never happens in a stronger language

match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0
  where
  go q i | q == finState = SuccessAt i
  go q i = case q `getTrans` transitions of
             TEps q' -> go q' i
             TBranch q1 q2 -> go q1 i <|> go q2 i
             TCh ch q'
               | bs `BS.indexMaybe` i == Just ch -> go q' (i + 1)
               | otherwise -> Failure

This is most likely what you’d expect from a basic backtracking implementation:

If we’re in the final state, we’re done and return the match.
Otherwise, look up the transition from our current state:
1. If it’s an ε-transition, just try it out.
2. If it’s branching into states q1 and q2, try the first one first. If it fails, try the second one.
3. If it’s a byte matcher, check that our current string position is within the string bounds and that the byte at this string is what we expect (indexMaybe does both, safely). If it’s all good, go to the next state and increment the index. Otherwise, fail.

Benchmarking

We won’t be doing any NFA parsing here, and for the purposes of this post we’ll use a fairly simple regexp (aa|ab)*z.

Here’s the definition.

nfa :: NFA Word32
nfa = NFA{..}
  where
  initState = 0
  finState = 13
  transitions = EM.fromList
    [ (0, TBranch 2 1)
    , (1, TEps 12)
    , (2, TBranch 4 8)
    , (3, TEps 0)
    , (4, TCh a 5)
    , (5, TEps 6)
    , (6, TCh a 7)
    , (7, TEps 3)
    , (8, TCh a 9)
    , (9, TEps 10)
    , (10, TCh b 11)
    , (11, TEps 3)
    , (12, TCh z 13)
    ]
  a = 97
  b = 98
  z = 122

This regexp doesn’t have exponential backtracking explosion since one or the other branch always fails immediately. A regexp like (a|a)*z would’ve been a completely different story.

We have a few test strings, which we keep in separate files on a tmpfs partition. In each of these, the aa/ab pair is repeated 10 million times, so the overall length is 20 (decimal) megabytes (plus one byte if it ends with z):

aa...aa
Intuitively, this is the worst-case scenario. We’re taking all the first branches just to find there’s no final z. So, we backtrack on the final aa and try ab instead, but it also fails. We backtrack on the previous aa and try ab again, but fails too. Rinse and repeat 10 million times.
aa...aaz
Now, this is the best-case scenario: always taking the first branch leads us to accept the string, and we never need to backtrack.
ab...ab
This is a somewhat mixed case. We try the first branch aa on the first two characters ab, but it immediately fails, and we backtrack to ab, which succeeds. Again, rinse and repeat 10 million times. All in all, we do the same number of backtrackings as in the first case of aa...aa, but we might expect it to be faster due to locality effects.
ab...abz
This is a similar case, except we eventually find our success.

We mmap these files to avoid any unnecessary costs of copying the data around, so our main looks like this:

main :: IO ()
main = do
  path <- getArgs <&> \case [path] -> path
                            _ -> error "wrong usage"
  str <- mmapFileByteString path Nothing
  print $ match nfa str

Let’s see what happens if we compile this with -O2 and run it on each input, recording what +RTS -s shows:

String	MUT time, s	GC time, s	Total time, s
`aa...aa`	0.63	0.31	0.95
`aa...aaz`	0.46	0.32	0.78
`ab...ab`	0.54	0.16	0.70
`ab...abz`	0.48	0.16	0.63

This is interesting! Our expected best case is indeed the best, and the worst case is indeed the worst, but only if we look at the MUT time (which is what’s actually spent in our program computation that we care about). For the first two strings, the GC time is precisely twice as much as for the other two, and the having any GC at all might be unexpected. Indeed, what is there to collect if we’re merely traversing the input string and don’t allocate anything?

Except we do: stack space and thunks. All those recursive go invocations aren’t free, and we might hypothesize that GHC can’t tail-call-optimize them away due to the <|> expression in the TBranch case, which is not tail-recursive and requires allocating a thunk for each delayed go invocation. Indeed, how would you TCO that?

One way to verify is to look at the GHC Core, but sometimes that’s just too much work! There’s another quick and dirty way to check: just remove the alternative and see what happens on the input that succeeds and almost never backtracks — our aa...aaz string.

I say “almost never backtracks” because it actually does so once when it sees the final z. In other words, this change also breaks the Kleene star * regexp operator, but it matters only once for this string and at its very end, so we can safely ignore that for this little test.

If we do that, our benchmark runs in 0.28s MUT time and zero GC (as opposed to 0.46s MUT and 0.32s GC). Now, GHC is apparently able to TCO it away (and no thunks are created), so there is nothing to GC (as expected), and the MUT time is faster since the recursive go call is cheaper (as perhaps not so expected).

But what about C++?

Ok, it’s time for some baseline C++ code.

I’m doing a fairly straightforward port:

each transition kind is represented by a separate struct,
the transition type itself is an std::variant of those transition kinds,
the transition map is an std::unordered_map,
and our pattern-matching happens with the usual overloaded trick.

using Q = uint32_t;

struct TEps { Q q; };
struct TBranch { Q q1; Q q2; };
struct TCh { char ch; Q q; };

using Trans = std::variant<TEps, TBranch, TCh>;

struct NFA
{
	std::unordered_map<Q, Trans> transitions;
	Q initState;
	Q finState;
};

template<typename... Ts>
struct overloaded : Ts... { using Ts::operator()...; };

struct Match
{
	NFA nfa;
	std::string_view string;
	const size_t size = string.size();

	std::optional<int> GoRec()
	{
		return GoRec(nfa.initState, 0);
	}

	std::optional<int> GoRec(Q state, size_t idx)
	{
		if (state == nfa.finState)
			return idx;

		return std::visit(overloaded {
				[=, this](TEps eps) { return GoRec(eps.q, idx); },
				[=, this](TBranch b)
				{
					if (const auto res = GoRec(b.q1, idx))
						return res;
					return GoRec(b.q2, idx);
				},
				[=, this](TCh ch)
				{
					if (idx < size && string[idx] == ch.ch)
						return GoRec(ch.q, idx + 1);
					else
						return std::optional<int> {};
				},
			}, nfa.transitions[state]);
	}
};

The rest of the boilerplate.

std::string_view GetFile(const char *path)
{
	int fd = open(path, O_RDONLY);
    if (fd == -1) {
        perror("Error opening file");
        return {};
    }

    struct stat sb;
    if (fstat(fd, &sb) == -1) {
        perror("Error getting the file size");
        close(fd);
        return {};
    }

    size_t filesize = sb.st_size;

    auto mapped = mmap(nullptr, filesize, PROT_READ, MAP_PRIVATE, fd, 0);
    if (mapped == MAP_FAILED) {
        perror("Error mapping the file");
        close(fd);
        return {};
    }

    close(fd);

    return { static_cast<const char*>(mapped), filesize };
}

int main(int, char **argv)
{
	NFA nfa
	{
		{
			{  0, TBranch { 2, 1 } },
			{  1, TEps { 12 } },
			{  2, TBranch { 4, 8 } },
			{  3, TEps { 0 } },
			{  4, TCh { 'a', 5 } },
			{  5, TEps { 6 } },
			{  6, TCh { 'a', 7 } },
			{  7, TEps { 3 } },
			{  8, TCh { 'a', 9 } },
			{  9, TEps { 10 } },
			{ 10, TCh { 'b', 11 } },
			{ 11, TEps { 3 } },
			{ 12, TCh { 'z', 13 } },
		},
		0,
		13,
	};

	auto str = GetFile(argv[1]);
	if (const auto res = Match { nfa, str }.GoRec())
		std::cout << *res << '\n';
	else
		std::cout << "failure\n";
}

Let’s compile and run it… and get a segfault. Of course, this is stack-heavy, and neither gcc nor clang TCO that away, so we get a stack overflow. Luckily, both compilers have the -fsplit-stack option, effectively transforming the stack into a somewhat std::deque-like structure, growing as needed. Sure, it has some performance hit, but since all our operations on the stack are more or less local, the CPU cache should hide most of that latency away. Moreover, I haven’t even started low-level optimizations on the Haskell code, so I’m not doing that for C++!

Anyway, here are the results (again, best of ten runs):

String	g++ -O2, s	g++ -O3, s	clang -O2, s	clang -O3, s
`aa...aa`	6.8	9.3	4.7	2.7
`aa...aaz`	6.5	8.9	4.7	2.6
`ab...ab`	7.0	9.6	4.8	2.7
`ab...abz`	6.8	9.7	4.5	2.8

This is ridiculously slow! gcc is about an order of magnitude slower, and clang is 3-4 times slower than Haskell. I’m not feeling that bad now about Haskell and GC and no TCO and whatnot!

The battle of the stack

Nevertheless, let’s see how we can help the compiler and the run time system.

RTS options

The most straightforward attack direction is playing around with the RTS options controlling the GC behavior. In particular, it is worth fiddling with:

-A — the allocation area size, with a bigger one meaning that fewer GCs happen at the cost of worse cache locality.
-ki — the initial thread stack size, with a bigger size requiring more (virtual) memory but also delaying the first stack overflow and growing.
-kc — the stack chunk size, adjusting how much the stack grows on stack overflow, with the cons similar to the previous two.
-kb — the stack chunk buffer size, which sets how much of the data from the previous stack chunk gets copied to the new stack chunk to avoid crossing the stack chunk boundary too often (which is detrimental for cache locality and, worse, causes the GC of one of the stack chunks).
-w — a new-ish GC algorithm for the oldest generation showing better performance in quite a few workloads but whose description is outside the scope of this post.

Since the stack operations here are inherently spatially and temporally close, the stack chunks locality shouldn’t matter much, and we can expect bigger numbers to give better performance.

Indeed, on my machine, the best options are something like -A128m -w -ki64m -kc8m, with -kb either playing no significant role or making matters worse (yay reasonable defaults!). But both -ki and -kc do have quite a positive impact, especially on GC times, proving once again that it’s all about the stack and that we’re on the right track:

String	MUT time, s	GC time, s	Total time, s	% of untuned baseline total
`aa...aa`	0.51	0.16	0.67	71%
`aa...aaz`	0.36	0.16	0.53	68%
`ab...ab`	0.48	0.04	0.52	74%
`ab...abz`	0.40	0.04	0.44	70%

Since we’ve got this improvement so cheaply, I’ll call this the tuned baseline from now on, and I’ll also be running the subsequent benchmarks with these RTS options.

Of course, this is not a good solution — libraries can’t (and aren’t supposed to) force these options on the consuming executable, and even if they could, it feels wrong to set these fairly memory-hungry settings just for the sake of some corner cases. So, let’s keep digging.

Explicit continuations

The next most straightforward approach is to explicitly pass the continuation:

match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0 Failure
  where
  go q i ~cont
    | q == finState = SuccessAt i
    | otherwise = case q `getTrans` transitions of
             TEps q' -> go q' i cont
             TBranch q1 q2 -> go q1 i (go q2 i cont)
             TCh ch q'
               | bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) cont
               | otherwise -> cont

Note the laziness annotation on cont parameter due to strictness by default.

Let’s run this:

String	MUT time, s	GC time, s	Total time, s	% of tuned baseline total
`aa...aa`	0.70	0.62	1.32	197%
`aa...aaz`	0.27	0.60	0.87	164%
`ab...ab`	0.63	0.31	0.94	181%
`ab...abz`	0.46	0.31	0.77	175%

Nope, nope, nope, nope. This is terrible.

The problem is that we’re creating a lazy closure on every recursive call, while we only really need to do it when we branch. We could make the continuation parameter strict and explicitly wrap it in something lazy where needed:

data Lazy a = Lazy { unLazy :: ~a }

match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0 (Lazy Failure)
  where
  go q i cont
    | q == finState = SuccessAt i
    | otherwise = case q `getTrans` transitions of
             TEps q' -> go q' i cont
             TBranch q1 q2 -> go q1 i (Lazy $ go q2 i cont)
             TCh ch q'
               | bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) cont
               | otherwise -> unLazy cont

but it only shaves off a couple of dozen ms, so it’s not worth it, and it’s probably not even worth putting down a table here.

All in all, our attempt to outsmart the compiler: failed.

Explicit stack

But all is not lost! We can just have an explicit stack with the states to explore instead. How do we represent that?

Vectors

We might have a Data.Vector like this:

match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0 mempty
  where
  go q i stack
    | q == finState = SuccessAt i
    | otherwise = case q `getTrans` transitions of
             TEps q' -> go q' i stack
             TBranch q1 q2 -> go q1 i (stack `V.snoc` (q2, i))
             TCh ch q'
               | bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) stack
               | Just (stack'', (q'', i'')) <- V.unsnoc stack -> go q'' i'' stack''
               | otherwise -> Failure

This requires a kind of inversion of control: we’re popping a value off the stack and trying it out when we would otherwise fail, and that happens in the byte matcher. Of course, we could refactor it into a separate function, but I’d like to keep these matchers similar, even if at the cost of the readability of each individual matcher.

Anyway, this is a bad idea since all these snocs and unsnocs require the stack to be copied because of immutability. Moreover, we can’t hope GHC fuses these operations away as the control flow depends crucially on the inputs which are only known at run time. And, in fact, if you try this out, you’d have to wait a long time before the benchmark finishes. I, for one, just ^C’ed it — no point in getting the exact numbers if it’s more than a few seconds.

Lists

Or, we might have our good old friend from when we were just learning Haskell — the list. The only changes compared to the vector are for the branching and for the failure handling, replacing V.snoc with : and V.unsnoc with pattern matching on the list.

That is, the code looks like this.

match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0 mempty
  where
  go q i stack
    | q == finState = SuccessAt i
    | otherwise = case q `getTrans` transitions of
             TEps q' -> go q' i stack
             TBranch q1 q2 -> go q1 i ((q2, i) : stack)
             TCh ch q'
               | bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) stack
               | ((q'', i'') : stack'') <- stack -> go q'' i'' stack''
               | otherwise -> Failure

String	MUT time, s	GC time, s	Total time, s	% of tuned baseline total
`aa...aa`	0.54	1.06	1.61	240%
`aa...aaz`	0.29	1.05	1.34	253%
`ab...ab`	0.50	0.48	0.97	187%
`ab...abz`	0.38	0.48	0.86	195%

This is still a no-go, but at least I can wait until the benchmark finishes.

Linear vectors

Looks like immutability really hurts here. Typically, our next attempt would be to have local mutability in ST, but in recent GHC, we can try something else: linear types and, in particular, linear vectors from the linear-base package. No need for ST monads, and the code is a more or less straightforward port of the previous version, except to make the linearity checker happy, we need to:

consume the vector when it’s no longer needed,
avoid pattern matching on stack in pattern guards, using case instead,
add explicit type annotation for go since GHC doesn’t infer linearity.

The code now looks like this:

import Data.Unrestricted.Linear qualified as L
import Data.Vector.Mutable.Linear qualified as VL
import Prelude.Linear qualified as L

match :: forall q. StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = L.unur L.$ VL.empty L.$ go initState 0
  where
  go :: q -> Int -> VL.Vector (q, Int) %1-> L.Ur (MatchResult Int)
  go q i stack
    | q == finState = stack `L.lseq` L.Ur (SuccessAt i)
    | otherwise = case q `getTrans` transitions of
                    TEps q' -> go q' i stack
                    TBranch q1 q2 -> go q1 i L.$ (q2, i) `VL.push` stack
                    TCh ch q'
                      | bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) stack
                      | otherwise -> case VL.pop stack of
                                      (L.Ur top, stack'')
                                        | (Just (q'', i'')) <- top -> go q'' i'' stack''
                                        | otherwise -> stack'' `L.lseq` L.Ur Failure

Is the result worth making the linearity checker happy?

String	MUT time, s	GC time, s	Total time, s	% of tuned baseline total
`aa...aa`	0.58	0.77	1.35	201%
`aa...aaz`	0.36	0.78	1.14	215%
`ab...ab`	0.58	0.46	1.04	200%
`ab...abz`	0.48	0.45	0.93	211%

This is still worse than the GHC/RTS-managed stack, which is sad. One might expect improved performance due to cache friendliness even if we don’t do any random accesses since vectors are, by definition, a continuous data structure, except not really: these are boxed vectors. So, it’s effectively an array of pointers, canceling out most of the profits. One might expect better performance with unboxed vectors, but alas, linear-base does not have them (yet).

Good ol’ `ST`

Alright, time for our good old performance savior, the mutable vectors living in the ST monad.

Here, I’ll be doing a terrible thing: preallocating the whole vector with a hardcoded capacity. I know it should be enough for these tests, and it allows comparing the performance vs. Data.Array (which doesn’t grow as easily) much more directly. Here’s the ugly ST thing:

import Data.Vector.Unboxed.Mutable qualified as VM

match :: (VM.Unbox q, StateId q) => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = runST $ do
  stack <- VM.unsafeNew 24_000_000
  let go s q i
        | q == finState = pure $ SuccessAt i
        | otherwise = case q `getTrans` transitions of
              TEps q' -> go s q' i
              TBranch q1 q2 -> do VM.unsafeWrite stack s (q2, i)
                                  go (s + 1) q1 i
              TCh ch q'
                | bs `BS.indexMaybe` i == Just ch -> go s q' (i + 1)
                | s == 0 -> pure Failure
                | otherwise -> do (q'', i'') <- VM.unsafeRead stack (s - 1)
                                  go (s - 1) q'' i''
  go 0 initState 0

And now we’re talking!

String	MUT time, s	Total time, s	% of tuned baseline total
`aa...aa`	0.44	0.44	66%
`aa...aaz`	0.27	0.27	51%
`ab...ab`	0.46	0.46	88%
`ab...abz`	0.38	0.38	86%

ST is definitely worth it! GC time is now precisely zero, and MUT times also got slightly faster. This also means that GHC managed to tail-call optimize go even though it’s technically in a monad, so it is not the last operation: >>= is (at least, syntactically).

Out of curiosity and to see if linear unboxed vectors could be worth it, how would ST-based mutable boxed vectors perform? I’m not showing all numbers here (they’re in the final graph, though), but if we replace the import with Data.Vector.Mutable and remove the VM.Unbox q constraint, we’ll get the numbers quite similar to the linear vectors case, suggesting that boxing is indeed to blame.

We could also replace the unboxed Vector with an STUArray, but that gives me the same results while having

uglier code.

match :: (StateId q, (forall s. A.MArray (A.STUArray s) q (ST s))) => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = runST $ do
  qstack <- mkArr 24_000_000
  istack <- mkArr 24_000_000
  let go s q i
        | q == finState = pure $ SuccessAt i
        | otherwise = case q `getTrans` transitions of
              TEps q' -> go s q' i
              TBranch q1 q2 -> do A.unsafeWrite qstack s q2
                                  A.unsafeWrite istack s i
                                  go (s + 1) q1 i
              TCh ch q'
                | bs `BS.indexMaybe` i == Just ch -> go s q' (i + 1)
                | s == 0 -> pure Failure
                | otherwise -> do q'' <- A.unsafeRead qstack (s - 1)
                                  i'' <- A.unsafeRead istack (s - 1)
                                  go (s - 1) q'' i''
  go 0 initState 0
  where
  mkArr :: A.MArray (A.STUArray s) e (ST s) => Int -> ST s (A.STUArray s Int e)
  mkArr len = A.unsafeNewArray_ (0, len)

Data.Array doesn’t do the Array of Structs → Struct of Arrays transformation on tuples of unboxable types, so we have to do that ourselves: hence, we have two separate arrays. Since the numbers are the same within measurement errors (so I’m not showing them), it looks like Data.Vector compiles to quite optimal code in this particular case.

Also, note the extra syntactic noise from the quantified constraint and the separate mkArr definition just to guide type inference. Programming with Array directly is meh.

Looks like we’re out of easily reachable options, so let’s return to our favourite language.

But what about C++?

If we write a more or less similar code in C++,

struct Match
{
    ...

	std::vector<std::pair<Q, size_t>> stack;

	std::optional<int> GoStack()
	{
		stack.reserve(24'000'000);
		return GoStack(nfa.initState, 0);
	}

	std::optional<int> GoStack(Q state, size_t idx)
	{
		if (state == nfa.finState)
			return idx;

		return std::visit(overloaded {
				[=, this](TEps eps) { return GoStack(eps.q, idx); },
				[=, this](TBranch b)
				{
					stack.push_back({ b.q2, idx });
					return GoStack(b.q1, idx);
				},
				[=, this](TCh ch)
				{
					if (idx < size && string[idx] == ch.ch)
						return GoStack(ch.q, idx + 1);
					else if (stack.empty())
						return std::optional<int> {};
					else
					{
						auto [q, i] = stack.back();
						stack.pop_back();
						return GoStack(q, i);
					}
				},
			}, nfa.transitions[state]);
	}
};

it also segfaults with both gcc and clang if -fsplit-stack is omitted, suggesting neither compiler does TCO in this case, which is quite unfortunate.

With -fsplit-stack, though, the results are even worse than before:

String	g++ -O2, s	g++ -O3, s	clang -O2, s	clang -O3, s
`aa...aa`	21	19	9.3	5.3
`aa...aaz`	12	11	5.6	3.2
`ab...ab`	20	19	9.2	5.2
`ab...abz`	16	16	7.9	4.4

Okay, so C++ compilers are bad at recursion, as we might have expected. Let’s see if we can help the compiler by

manually converting that to a loop.

	std::optional<int> GoLoop()
	{
		stack.reserve(24'000'000);

		Q q = nfa.initState;
		size_t i = 0;

		bool keepGoing = true;

		while (keepGoing)
		{
			if (q == nfa.finState)
				return i;

			std::visit(overloaded {
					[&](TEps eps) { q = eps.q; },
					[&](TBranch b)
					{
						stack.push_back({ b.q2, i });
						q = b.q1;
					},
					[&](TCh ch)
					{
						if (i < size && string[i] == ch.ch)
						{
							++i;
							q = ch.q;
						}
						else if (stack.empty())
							keepGoing = false;
						else
						{
							std::tie(q, i) = stack.back();
							stack.pop_back();
						}
					},
				}, nfa.transitions[q]);
		}

		return {};
	}

Indeed, this doesn’t segfault without -fsplit-stack, and it performs much better, like 10-50 times better:

String	g++ -O2, s	g++ -O3, s	clang -O2, s	clang -O3, s
`aa...aa`	0.46	0.46	0.49	0.48
`aa...aaz`	0.35	0.34	0.39	0.34
`ab...ab`	0.42	0.43	0.50	0.43
`ab...abz`	0.36	0.37	0.42	0.36

This is on par with Haskell, with the first two cases being marginally slower and the other two slightly faster. Not bad!

An attentive reader might object that the Haskell version doesn’t do anything analogous to push_back()/pop_back(). It just allocates an uninitialized chunk of memory and proceeds with it, while push_back must check if there’s enough memory and potentially grow the vector (even if it never happens on our test data). Sure, since it never happens during execution, the CPU’s branch predictor learns that quickly, but it’s still some extra code that might also prevent further optimizations. Moreover, getting the last element happens via a pair of back()/pop_back() calls. Looks like the compilers optimize that quite nicely, but that’s nevertheless a valid objection.

Let’s also try a version where we keep the “pointer” to the top of the stack, just like in Haskell. That is:

we add a variable s (which has the same intent as our Haskell’s s),
we replace stack.reserve() with stack.resize(),
pushing to the stack via stack.push_back({ b.q2, i }) is replaced with stack[s++] = { b.q2, i },
checking if the stack is empty is now done via if (!s) instead of if (stack.empty()),
getting the last element of the stack is now done via std::tie(q, i) = stack[--s]; instead of the stack.back()/stack.pop_back() pair.

The results are curious:

String	g++ -O2, s	g++ -O3, s	clang -O2, s	clang -O3, s
`aa...aa`	0.52	0.48	0.56	0.54
`aa...aaz`	0.37	0.36	0.23	0.23
`ab...ab`	0.45	0.40	0.49	0.51
`ab...abz`	0.39	0.38	0.25	0.26

Some cases got even better, others got worse, which is somewhat unexpected. My intuition is that resize() causes eager initialization of all its elements, which isn’t necessary if the whole capacity of our stack isn’t used, but we’ll check that later.

Overall, it’s fair to say we won the battle for the stack.

Transition tables

Now let’s examine the other part of the problem: our transition tables.

The `EnumMap` default

I mentioned that EnumMap instead of HashMap should be the default choice for types like Word32. But is that actually the case? If we replace our EnumMap with a HashMap, we’ll get the following results (NB here we’re comparing against EnumMap):

String	MUT time, s	Total time, s	% of `EnumMap`
`aa...aa`	0.78	0.78	177%
`aa...aaz`	0.50	0.50	185%
`ab...ab`	0.77	0.77	167%
`ab...abz`	0.66	0.66	174%

Indeed, the run time with the HashMap is generally 1.7-1.8 times worse than with the EnumMap. Overall, EnumMap is a good choice for a map, so our intuition was correct.

Why maps?

But do we really need a map in the first place?

An observant reader might notice that the states in our test NFA are indexed by a dense increasing sequence of numbers, and the final state is the state with the largest index. While my regexp parsing/NFA conversion functions don’t guarantee that, we can always reindex the states appropriately. Then, we can avoid EnumMap/std::unordered_map and just use a vector instead. In this case, we don’t even have to bother with mutability or linearity since we never modify that vector once the NFA is constructed.

So let’s do that! We merely change TransMap and getTrans to become

import Data.Vector qualified as V

type TransMap q = V.Vector (Trans q)

getTrans :: StateId q => q -> TransMap q -> Trans q
getTrans q m = m V.! fromIntegral q

Having done that, we get:

String	MUT time, s	Total time, s	% of `EnumMap`	% of tuned baseline total
`aa...aa`	0.28	0.28	64%	42%
`aa...aaz`	0.20	0.20	74%	38%
`ab...ab`	0.25	0.25	54%	48%
`ab...abz`	0.21	0.21	55%	48%

A nice improvement for a simple change!

Interestingly, the performance of both V.! and V.unsafeIndex is roughly the same in this case. Moreover, perhaps less surprisingly, the -A128m -w -ki64m -kc8m incantation no longer has any measurable effect — we do no GC, so the GC RTS options don’t matter. From now on, we can omit these options if we wish.

But, as we’ve learned from our previous experiments, boxed vs unboxed vectors do make a difference. It’s a shame that we cannot use unboxed ones here since Trans q is a non-trivial sum type. Or can we?

Unboxed vectors again

We can map the Trans q type to, say, Word64, reserving two bits for the type tag and packing the corresponding constructor arguments into the rest of the bits. If we don’t have too many states (say, less than 2ⁿ), we can pack each q into n bits of the Word64. All this is better explained in code:

tagShift :: Int
tagShift = 60

toWord64 :: Integral q => Trans q -> Word64
toWord64 = \case
  TEps q        -> 0b01 .<<. tagShift .|. fromIntegral q
  TBranch q1 q2 -> 0b00 .<<. tagShift .|. fromIntegral q1 .|. fromIntegral q2 .<<. 32
  TCh w q       -> 0b11 .<<. tagShift .|. fromIntegral w  .|. fromIntegral q .<<. 8

fromWord64 :: Integral q => Word64 -> Trans q
fromWord64 w = case w .>>. tagShift of
  0b01 -> TEps $ fromIntegral w
  0b00 -> TBranch (fromIntegral w) (fromIntegral $ w .>>. 32)
  _    -> TCh (fromIntegral w) (fromIntegral $ w .>>. 8)

With these two functions, we can declare support for unboxed vectors of Trans q:

import Data.Vector.Generic qualified as VG
import Data.Vector.Generic.Mutable qualified as VG
import Data.Vector.Unboxed qualified as VU

instance Integral q => VU.IsoUnbox (Trans q) Word64 where
  toURepr = toWord64
  fromURepr = fromWord64

newtype instance VU.MVector s (Trans q) = MV_Trans (VU.MVector s Word64)
newtype instance VU.Vector    (Trans q) = V_Trans  (VU.Vector    Word64)

deriving via (Trans q `VU.As` Word64) instance Integral q => VG.MVector VU.MVector (Trans q)
deriving via (Trans q `VU.As` Word64) instance Integral q => VG.Vector  VU.Vector  (Trans q)

and then just replace V-qualified names with VU as needed. I’m also replacing V.! with VU.unsafeIndex, since it starts to show some effect on the performance now.

String	MUT time, s	Total time, s	% of boxed `Vector`	% of tuned baseline total
`aa...aa`	0.134	0.136	49%	20%
`aa...aaz`	0.104	0.105	52%	20%
`ab...ab`	0.113	0.115	46%	22%
`ab...abz`	0.102	0.104	50%	24%

Wow! That’s about 2x improvement in performance! This ugly bit-mingling code is definitely worth it.

Not all encodings are created equal.

My initial implementation of toWord64/fromWord64 was more like

toWord64 :: Integral q => Trans q -> Word64
toWord64 = \case
  TEps q        -> 0b00 .|. (fromIntegral q .<<. 2)
  TBranch q1 q2 -> 0b01 .|. (fromIntegral q1 .<<. 18) .|. (fromIntegral q2 .<<. 2)
  TCh w q       -> 0b10 .|. (fromIntegral w .<<. 2) .|. (fromIntegral q .<<. 18)

fromWord64 :: Integral q => Word64 -> Trans q
fromWord64 w = case w .&. 0b11 of
  0b00 -> TEps $ fromIntegral $ w .>>. 2
  0b01 -> TBranch (fromIntegral $ w .>>. 18) (fromIntegral $ (w .>>. 2) .&. 0xffff)
  _    -> TCh (fromIntegral $ w .>>. 2) (fromIntegral $ w .>>. 18)

This fromWord64 does considerably more operations than the final version above, and it has a measurable effect: this version results in 8-10 ms added to each benchmark case.

Unboxed arrays again

But, again, vectors sometimes do have some overhead. How about an unboxed array? We have to use fromWord64 and toWord64 explicitly when looking up transitions:

type TransMap q = A.UArray Int Word64

getTrans :: StateId q => q -> TransMap q -> Trans q
getTrans q m = fromWord64 $ m `A.unsafeAt` fromIntegral q

and when creating the transitions table:

nfa :: NFA Word32
nfa = NFA{..}
  where
  ...
  transitions = A.listArray (0, 12) $ toWord64 @Word32 <$> [...]

Otherwise, the code is mostly the same.

But it’s worth it, giving a further 10-15% improvement over unboxed Vectors:

String	MUT time, s	Total time, s	% of unboxed `Vector`	% of tuned baseline total
`aa...aa`	0.119	0.120	88%	18%
`aa...aaz`	0.091	0.093	89%	18%
`ab...ab`	0.095	0.096	83%	18%
`ab...abz`	0.088	0.090	87%	20%

I’d say this is extremely good!

But what about C++?

Here’s how the C++ version performs if we similarly replace the std::unordered_map with the std::vector.

If we try the push_back()/pop_back()/back() stack version:

String	g++ -O2, s	g++ -O3, s	clang -O2, s	clang -O3, s
`aa...aa`	0.141	0.132	0.338	0.221
`aa...aaz`	0.104	0.105	0.207	0.138
`ab...ab`	0.111	0.108	0.317	0.224
`ab...abz`	0.096	0.097	0.261	0.196

If we keep the explicit stack position, clang does significantly better at the cost of gcc performing slightly worse:

String	g++ -O2, s	g++ -O3, s	clang -O2, s	clang -O3, s
`aa...aa`	0.152	0.152	0.154	0.150
`aa...aaz`	0.123	0.120	0.128	0.128
`ab...ab`	0.142	0.141	0.147	0.140
`ab...abz`	0.125	0.129	0.133	0.136

As a side quest, let’s also see if my intuition about the initialization overhead is correct. We replace the stack-as-an-std::vector with a malloc’ed array:

		using StackElem = std::pair<Q, size_t>;
		auto stack = static_cast<StackElem*>(malloc(24'000'000 * sizeof(StackElem)));
        // the rest as before

Here, we use malloc because anything relying on new will still

likely initialize the elements.

For instance, gcc compiles (even with -O3 -march=native)

auto foo()
{
    auto stack = std::make_unique<std::pair<uint32_t, size_t>[]>(24'000'000);
    stack[0] = { 3, 4 };
    return stack[0];
}

foo():
        push    rbx
        mov     edi, 384000000
        call    operator new[](unsigned long)
        mov     rdi, rax
        mov     rdx, rax
        lea     rcx, [rax+384000000]
.L2:
        mov     DWORD PTR [rdx], 0
        add     rdx, 16
        mov     QWORD PTR [rdx-8], 0
        cmp     rcx, rdx
        jne     .L2
        mov     DWORD PTR [rdi], 3
        mov     QWORD PTR [rdi+8], 4
        mov     rbx, QWORD PTR [rdi]
        call    operator delete[](void*)
        mov     edx, 4
        mov     eax, ebx
        pop     rbx
        ret

Note the unvectorized and fairly silly loop setting the whole of stack to 0 without any vectorization only to destroy it.

For comparison, clang does a good job doing constant propagation and elision, and compiles this to

foo():
        mov     edx, 4
        mov     eax, 3
        ret

If we run this version, we’ll indeed get better numbers:

String	g++ -O2, s	g++ -O3, s	clang -O2, s	clang -O3, s
`aa...aa`	0.134	0.133	0.144	0.145
`aa...aaz`	0.107	0.107	0.115	0.114
`ab...ab`	0.109	0.108	0.126	0.125
`ab...abz`	0.098	0.097	0.113	0.113

So, this version gives the more or less best numbers for both gcc and clang at the cost of explicit malloc. This malloc is definitely an UB in C++ ≤17 since we’re reading and writing to objects whose lifetime hasn’t started, and it potentially is an UB in C++ ≥20, but, frankly, I’m not exactly in the mood for C++ standard reading right now. std::start_lifetime_as_array could’ve helped here, but it’s a C++23 feature, and neither my gcc nor my clang have that yet.

Perhaps unsurprisingly, making a similar change of packing Trans into a ~~Word64~~ uint64_t has a negative effect in case of C++. Indeed, the std::variant is already stored effectively unboxed, so we just waste time fiddling with bits.

Anyway, back to our main cast: the unboxed Array-based version is faster than any of the C++ versions, while the unboxed Vector-based version is slower than gcc with push_back() but faster than the rest.

Summing it up

First, let’s look at the graph with all of the code versions together (except the HashMap intuition-verifying one):

Then, despite the length of this post, it’s not a real optimization one. I haven’t used the profiler (and for any real performance optimization you probably should do some profiling), my benchmarking input set is barely representative, and so on.

Yet, I think we have some takeaways:

Boxing absolutely kills performance in vectors and arrays. If you can, try using unboxed ones, even if it requires ugly and/or non-total conversions between your type and something unboxable.
GHC RTS is quite good at managing the stack in deeply recursive functions that need to backtrack. The only way to outperform it is via explicit mutability, either in the form of our good old ST monad or via linear types, and managing explicitly what goes onto the stack and what goes as direct, TCO-optimized function arguments.
Speaking of the latter, linear types do indeed look promising for problems like this. I’d really love to write code that looks almost like your normal immutable Haskell but doesn’t copy things around like crazy. Unfortunately, as of this writing, linear-base only has boxed vectors, and… well, see above regarding boxing.
Despite the optimism, the results (especially in the graph form) suggest that linear unboxed vectors would still be slower than ST-mutable unboxed vectors. Indeed, boxed linear ones are slightly slower than boxed ST, and it seems reasonable to extrapolate that.

Finally, it’s once again possible to write code that outperforms C++ without too much effort at the cost of just slightly less idiomatic Haskell. And I’m not even mentioning the productivity improvements (I’d hate to write the NFA parser in C++, for example), easier testing (think QuickCheck), and so on.

And if you have any comments or feedback, I’d be happy to hear from you!

Miscellanea

An inquisitive reader might also ask a different question: why do we care at all about the performance of this naive implementation, which has exponential backtracking problems?

One reason is, of course, to see how different decisions and approaches affect performance. It’s always good to experiment with a semi-toy example that is close enough to what you might encounter in the real world, but still manageable in code size and interpretability. This helps develop intuition and choose the right approaches straight away for other, more complex problems.

A more practical reason is that a naive algorithm can still be included in a production-grade library. Indeed, on simple regexps and inputs, the memoization bookkeeping will always make it slower. So, if a user of the library controls their inputs, they might just use this simpler one. Alternatively, we could expose a function that races the naive and and the memoizing ones. The naive one finishes first on simple regexps or non-DoS-ing inputs, so we get its smaller latency. However, the memoizing one finishes first on a bad input, and we still avoid going exponential. The cost is 2x CPU usage, but I can see how it’s a reasonable trade-off in some use cases.

And I also didn’t elaborate on my benchmarking setup, and it’s about time. So, my machine is a Ryzen 7950X3D stock. For Haskell, I’ve been using GHC 9.8.2 with NCG, while -fllvm was generally the same or worse. For C++, I’ve been using gcc 13.1 and clang 18.1. I’ve also tried gcc 14, but it seems like it consistently produces slightly worse results for the optimized versions of the C++ code towards the end of the post. -march=native has either no effect or worsens the numbers for C++, so I have not enabled it.

And that’s all, folks!

Edit history

18 Sep 2024: fixed enummapset link, added link to the reddit post, and clarified wording re stack/chunks as per reddit discussion.

Let's run some NFAs

Base baseline

NFAs

Matching

Benchmarking

But what about C++?

The battle of the stack

RTS options

Explicit continuations

Explicit stack

Vectors

Lists

Linear vectors

Good ol’ ST

But what about C++?

Transition tables

The EnumMap default

Why maps?

Unboxed vectors again

Unboxed arrays again

But what about C++?

Summing it up

Miscellanea

Good ol’ `ST`

The `EnumMap` default