Let's run some NFAs
Lately, I’ve been playing around with memoized NFAs for optimized regular expression matching, with features like lookahead and atomic groups, based on this paper. The original authors have their code in Scala, and I thought it’d be fun to code something in Haskell to see how it stacks up against their new implementation and the prior art.
But before diving into memoization and the more complex features, let’s start with the basics. In this post, we’ll focus on a simple, naive backtracking NFA implementation. We’ll start with the simplest, regexp 101 code and then make it significantly faster, step by step. We’ll also inevitably face some dead ends — that’s part of learning and experimentation, too!
To ground our work in reality, I’ll also implement some of the algorithms in C++, praised for its performance advantages over pretty much everything else. Is the praise deserved here? Let’s find out.
Base baseline
NFAs
Our NFA states come from a type q
that fulfills some constraints:
type StateId q = (Integral q, Enum q)
Typically, we’ll have q ~ Word32
, but it’s nice to have it abstracted away:
it’s easier to change the type this way, and we also won’t mistake q
for anything else (like string length or position) even if it’s instantiated to the same type.
Yay type safety!
An NFA consists of transitions between those states,
a particular initial state, and a set of final states.
A transition can be an ε-transition, a branch, or a character byte match:
data Trans q
= TEps q
| TBranch q q
| TCh Word8 q
deriving (Eq, Show)
The NFAs I’m working with have only one final state by construction, so the type describing an NFA itself looks like this:
import Data.EnumMap.Strict qualified as EM
type TransMap q = EM.EnumMap q (Trans q)
data NFA q = NFA
{ transitions :: TransMap q
, initState :: q
, finState :: q
}
Here, EnumMap
is a wrapper around IntMap
that supports arbitrary Enum
types
and comes from the enummapset
package.
My previous experience shows that for types with cheap toEnum
/fromEnum
(like our Word32
)
it’s much more efficient than the usual HashMap
from unordered-containers
,
so we’ll go with it straight away.
As usual, I’ll also have {-# LANGUAGE Strict #-}
from the get-go —
it’s almost always beneficial for problems like this.
Matching
In my experiments, the result of matching an NFA needs to keep the index where it’s matched,
so we’ll define a Maybe
-like type for this:
data MatchResult a = SuccessAt a | Failure
deriving (Eq, Ord, Show, Functor)
instance Applicative MatchResult where
SuccessAt f <*> SuccessAt v = SuccessAt $ f v
_ <*> _ = Failure
pure = SuccessAt
instance Alternative MatchResult where
SuccessAt a <|> ~_ = SuccessAt a
_ <|> ~r = r
empty = Failure
Note the explicit laziness annotation in <|>
’s second argument.
Since I’ve enabled the strictness by default,
omitting it results in always computing both branches,
which is definitely not what we want.
Now, it’s time for the first NFA matching implementation!
getTrans :: StateId q => q -> TransMap q -> Trans q
getTrans q m = case q `EM.lookup` m of
Just t -> t
Nothing -> error "invariant failure"
-- ^ alas, we could've proven statically this never happens in a stronger language
match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0
where
go q i | q == finState = SuccessAt i
go q i = case q `getTrans` transitions of
TEps q' -> go q' i
TBranch q1 q2 -> go q1 i <|> go q2 i
TCh ch q'
| bs `BS.indexMaybe` i == Just ch -> go q' (i + 1)
| otherwise -> Failure
This is most likely what you’d expect from a basic backtracking implementation:
- If we’re in the final state, we’re done and return the match.
- Otherwise, look up the transition from our current state:
- If it’s an ε-transition, just try it out.
- If it’s branching into states
q1
andq2
, try the first one first. If it fails, try the second one. - If it’s a byte matcher,
check that our current string position is within the string bounds
and that the byte at this string is what we expect
(
indexMaybe
does both, safely). If it’s all good, go to the next state and increment the index. Otherwise, fail.
Benchmarking
We won’t be doing any NFA parsing here,
and for the purposes of this post we’ll use a fairly simple regexp (aa|ab)*z
.
Here’s the definition.
nfa :: NFA Word32
nfa = NFA{..}
where
initState = 0
finState = 13
transitions = EM.fromList
[ (0, TBranch 2 1)
, (1, TEps 12)
, (2, TBranch 4 8)
, (3, TEps 0)
, (4, TCh a 5)
, (5, TEps 6)
, (6, TCh a 7)
, (7, TEps 3)
, (8, TCh a 9)
, (9, TEps 10)
, (10, TCh b 11)
, (11, TEps 3)
, (12, TCh z 13)
]
a = 97
b = 98
z = 122
We have a few test strings,
which we keep in separate files on a tmpfs
partition.
In each of these, the aa
/ab
pair is repeated 10 million times,
so the overall length is 20 (decimal) megabytes (plus one byte if it ends with z
):
aa...aa
Intuitively, this is the worst-case scenario. We’re taking all the first branches just to find there’s no finalz
. So, we backtrack on the finalaa
and tryab
instead, but it also fails. We backtrack on the previousaa
and tryab
again, but fails too. Rinse and repeat 10 million times.aa...aaz
Now, this is the best-case scenario: always taking the first branch leads us to accept the string, and we never need to backtrack.ab...ab
This is a somewhat mixed case. We try the first branchaa
on the first two charactersab
, but it immediately fails, and we backtrack toab
, which succeeds. Again, rinse and repeat 10 million times. All in all, we do the same number of backtrackings as in the first case ofaa...aa
, but we might expect it to be faster due to locality effects.ab...abz
This is a similar case, except we eventually find our success.
We mmap
these files
to avoid any unnecessary costs of copying the data around,
so our main
looks like this:
main :: IO ()
main = do
path <- getArgs <&> \case [path] -> path
_ -> error "wrong usage"
str <- mmapFileByteString path Nothing
print $ match nfa str
-O2
and run it on each input,
recording what +RTS -s
shows:
String | MUT time, s | GC time, s | Total time, s |
---|---|---|---|
aa...aa |
0.63 | 0.31 | 0.95 |
aa...aaz |
0.46 | 0.32 | 0.78 |
ab...ab |
0.54 | 0.16 | 0.70 |
ab...abz |
0.48 | 0.16 | 0.63 |
This is interesting! Our expected best case is indeed the best, and the worst case is indeed the worst, but only if we look at the MUT time (which is what’s actually spent in our program computation that we care about). For the first two strings, the GC time is precisely twice as much as for the other two, and the having any GC at all might be unexpected. Indeed, what is there to collect if we’re merely traversing the input string and don’t allocate anything?
Except we do: stack space and thunks.
All those recursive go
invocations aren’t free,
and we might hypothesize that GHC can’t tail-call-optimize them away
due to the <|>
expression in the TBranch
case,
which is not tail-recursive and requires allocating a thunk for each delayed go
invocation.
Indeed, how would you TCO that?
aa...aaz
string.
If we do that, our benchmark runs in 0.28s MUT time and zero GC (as opposed to 0.46s MUT and 0.32s GC).
Now, GHC is apparently able to TCO it away (and no thunks are created),
so there is nothing to GC (as expected),
and the MUT time is faster since the recursive go
call is cheaper (as perhaps not so expected).
But what about C++?
Ok, it’s time for some baseline C++ code.
I’m doing a fairly straightforward port:
- each transition kind is represented by a separate
struct
, - the transition type itself is an
std::variant
of those transition kinds, - the transition map is an
std::unordered_map
, - and our pattern-matching happens with the usual
overloaded
trick.
using Q = uint32_t;
struct TEps { Q q; };
struct TBranch { Q q1; Q q2; };
struct TCh { char ch; Q q; };
using Trans = std::variant<TEps, TBranch, TCh>;
struct NFA
{
std::unordered_map<Q, Trans> transitions;
Q initState;
Q finState;
};
template<typename... Ts>
struct overloaded : Ts... { using Ts::operator()...; };
struct Match
{
NFA nfa;
std::string_view string;
const size_t size = string.size();
std::optional<int> GoRec()
{
return GoRec(nfa.initState, 0);
}
std::optional<int> GoRec(Q state, size_t idx)
{
if (state == nfa.finState)
return idx;
return std::visit(overloaded {
[=, this](TEps eps) { return GoRec(eps.q, idx); },
[=, this](TBranch b)
{
if (const auto res = GoRec(b.q1, idx))
return res;
return GoRec(b.q2, idx);
},
[=, this](TCh ch)
{
if (idx < size && string[idx] == ch.ch)
return GoRec(ch.q, idx + 1);
else
return std::optional<int> {};
},
}, nfa.transitions[state]);
}
};
The rest of the boilerplate.
std::string_view GetFile(const char *path)
{
int fd = open(path, O_RDONLY);
if (fd == -1) {
perror("Error opening file");
return {};
}
struct stat sb;
if (fstat(fd, &sb) == -1) {
perror("Error getting the file size");
close(fd);
return {};
}
size_t filesize = sb.st_size;
auto mapped = mmap(nullptr, filesize, PROT_READ, MAP_PRIVATE, fd, 0);
if (mapped == MAP_FAILED) {
perror("Error mapping the file");
close(fd);
return {};
}
close(fd);
return { static_cast<const char*>(mapped), filesize };
}
int main(int, char **argv)
{
NFA nfa
{
{
{ 0, TBranch { 2, 1 } },
{ 1, TEps { 12 } },
{ 2, TBranch { 4, 8 } },
{ 3, TEps { 0 } },
{ 4, TCh { 'a', 5 } },
{ 5, TEps { 6 } },
{ 6, TCh { 'a', 7 } },
{ 7, TEps { 3 } },
{ 8, TCh { 'a', 9 } },
{ 9, TEps { 10 } },
{ 10, TCh { 'b', 11 } },
{ 11, TEps { 3 } },
{ 12, TCh { 'z', 13 } },
},
0,
13,
};
auto str = GetFile(argv[1]);
if (const auto res = Match { nfa, str }.GoRec())
std::cout << *res << '\n';
else
std::cout << "failure\n";
}
Let’s compile and run it… and get a segfault.
Of course, this is stack-heavy, and neither gcc nor clang TCO that away,
so we get a stack overflow.
Luckily, both compilers have the -fsplit-stack
option,
effectively transforming the stack into a somewhat std::deque
-like structure,
growing as needed.
Sure, it has some performance hit,
but since all our operations on the stack are more or less local,
the CPU cache should hide most of that latency away.
Moreover, I haven’t even started low-level optimizations on the Haskell code,
so I’m not doing that for C++!
Anyway, here are the results (again, best of ten runs):
String | g++ -O2, s | g++ -O3, s | clang -O2, s | clang -O3, s |
---|---|---|---|---|
aa...aa |
6.8 | 9.3 | 4.7 | 2.7 |
aa...aaz |
6.5 | 8.9 | 4.7 | 2.6 |
ab...ab |
7.0 | 9.6 | 4.8 | 2.7 |
ab...abz |
6.8 | 9.7 | 4.5 | 2.8 |
This is ridiculously slow! gcc is about an order of magnitude slower, and clang is 3-4 times slower than Haskell. I’m not feeling that bad now about Haskell and GC and no TCO and whatnot!
The battle of the stack
Nevertheless, let’s see how we can help the compiler and the run time system.
RTS options
The most straightforward attack direction is playing around with the RTS options controlling the GC behavior. In particular, it is worth fiddling with:
-A
— the allocation area size, with a bigger one meaning that fewer GCs happen at the cost of worse cache locality.-ki
— the initial thread stack size, with a bigger size requiring more (virtual) memory but also delaying the first stack overflow and growing.-kc
— the stack chunk size, adjusting how much the stack grows on stack overflow, with the cons similar to the previous two.-kb
— the stack chunk buffer size, which sets how much of the data from the previous stack chunk gets copied to the new stack chunk to avoid crossing the stack chunk boundary too often (which is detrimental for cache locality and, worse, causes the GC of one of the stack chunks).-w
— a new-ish GC algorithm for the oldest generation showing better performance in quite a few workloads but whose description is outside the scope of this post.
Since the stack operations here are inherently spatially and temporally close, the stack chunks locality shouldn’t matter much, and we can expect bigger numbers to give better performance.
Indeed, on my machine, the best options are something like -A128m -w -ki64m -kc8m
,
with -kb
either playing no significant role or making matters worse (yay reasonable defaults!).
But both -ki
and -kc
do have quite a positive impact, especially on GC times,
proving once again that it’s all about the stack and that we’re on the right track:
String | MUT time, s | GC time, s | Total time, s | % of untuned baseline total |
---|---|---|---|---|
aa...aa |
0.51 | 0.16 | 0.67 | 71% |
aa...aaz |
0.36 | 0.16 | 0.53 | 68% |
ab...ab |
0.48 | 0.04 | 0.52 | 74% |
ab...abz |
0.40 | 0.04 | 0.44 | 70% |
Since we’ve got this improvement so cheaply, I’ll call this the tuned baseline from now on, and I’ll also be running the subsequent benchmarks with these RTS options.
Of course, this is not a good solution — libraries can’t (and aren’t supposed to) force these options on the consuming executable, and even if they could, it feels wrong to set these fairly memory-hungry settings just for the sake of some corner cases. So, let’s keep digging.
Explicit continuations
The next most straightforward approach is to explicitly pass the continuation:
match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0 Failure
where
go q i ~cont
| q == finState = SuccessAt i
| otherwise = case q `getTrans` transitions of
TEps q' -> go q' i cont
TBranch q1 q2 -> go q1 i (go q2 i cont)
TCh ch q'
| bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) cont
| otherwise -> cont
Let’s run this:
String | MUT time, s | GC time, s | Total time, s | % of tuned baseline total |
---|---|---|---|---|
aa...aa |
0.70 | 0.62 | 1.32 | 197% |
aa...aaz |
0.27 | 0.60 | 0.87 | 164% |
ab...ab |
0.63 | 0.31 | 0.94 | 181% |
ab...abz |
0.46 | 0.31 | 0.77 | 175% |
Nope, nope, nope, nope. This is terrible.
The problem is that we’re creating a lazy closure on every recursive call, while we only really need to do it when we branch. We could make the continuation parameter strict and explicitly wrap it in something lazy where needed:
data Lazy a = Lazy { unLazy :: ~a }
match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0 (Lazy Failure)
where
go q i cont
| q == finState = SuccessAt i
| otherwise = case q `getTrans` transitions of
TEps q' -> go q' i cont
TBranch q1 q2 -> go q1 i (Lazy $ go q2 i cont)
TCh ch q'
| bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) cont
| otherwise -> unLazy cont
but it only shaves off a couple of dozen ms, so it’s not worth it, and it’s probably not even worth putting down a table here.
All in all, our attempt to outsmart the compiler: failed.
Explicit stack
But all is not lost! We can just have an explicit stack with the states to explore instead. How do we represent that?
Vectors
We might have a Data.Vector
like this:
match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0 mempty
where
go q i stack
| q == finState = SuccessAt i
| otherwise = case q `getTrans` transitions of
TEps q' -> go q' i stack
TBranch q1 q2 -> go q1 i (stack `V.snoc` (q2, i))
TCh ch q'
| bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) stack
| Just (stack'', (q'', i'')) <- V.unsnoc stack -> go q'' i'' stack''
| otherwise -> Failure
This requires a kind of inversion of control: we’re popping a value off the stack and trying it out when we would otherwise fail, and that happens in the byte matcher. Of course, we could refactor it into a separate function, but I’d like to keep these matchers similar, even if at the cost of the readability of each individual matcher.
Anyway, this is a bad idea since all these snoc
s and unsnoc
s
require the stack to be copied because of immutability.
Moreover, we can’t hope GHC fuses these operations away
as the control flow depends crucially on the inputs which are only known at run time.
And, in fact, if you try this out, you’d have to wait a long time before the benchmark finishes.
I, for one, just ^C
’ed it — no point in getting the exact numbers if it’s more than a few seconds.
Lists
Or, we might have our good old friend from when we were just learning Haskell — the list. The only changes compared to the vector are for the branching and for the failure handling, replacingV.snoc
with :
and V.unsnoc
with pattern matching on the list.
That is, the code looks like this.
match :: StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = go initState 0 mempty
where
go q i stack
| q == finState = SuccessAt i
| otherwise = case q `getTrans` transitions of
TEps q' -> go q' i stack
TBranch q1 q2 -> go q1 i ((q2, i) : stack)
TCh ch q'
| bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) stack
| ((q'', i'') : stack'') <- stack -> go q'' i'' stack''
| otherwise -> Failure
String | MUT time, s | GC time, s | Total time, s | % of tuned baseline total |
---|---|---|---|---|
aa...aa |
0.54 | 1.06 | 1.61 | 240% |
aa...aaz |
0.29 | 1.05 | 1.34 | 253% |
ab...ab |
0.50 | 0.48 | 0.97 | 187% |
ab...abz |
0.38 | 0.48 | 0.86 | 195% |
This is still a no-go, but at least I can wait until the benchmark finishes.
Linear vectors
Looks like immutability really hurts here.
Typically, our next attempt would be to have local mutability in ST
,
but in recent GHC, we can try something else: linear types and, in particular, linear vectors from the
linear-base
package.
No need for ST
monads, and the code is a more or less straightforward
port of the previous version, except to make the linearity checker happy, we need to:
consume
the vector when it’s no longer needed,- avoid pattern matching on
stack
in pattern guards, usingcase
instead, - add explicit type annotation for
go
since GHC doesn’t infer linearity.
The code now looks like this:
import Data.Unrestricted.Linear qualified as L
import Data.Vector.Mutable.Linear qualified as VL
import Prelude.Linear qualified as L
match :: forall q. StateId q => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = L.unur L.$ VL.empty L.$ go initState 0
where
go :: q -> Int -> VL.Vector (q, Int) %1-> L.Ur (MatchResult Int)
go q i stack
| q == finState = stack `L.lseq` L.Ur (SuccessAt i)
| otherwise = case q `getTrans` transitions of
TEps q' -> go q' i stack
TBranch q1 q2 -> go q1 i L.$ (q2, i) `VL.push` stack
TCh ch q'
| bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) stack
| otherwise -> case VL.pop stack of
(L.Ur top, stack'')
| (Just (q'', i'')) <- top -> go q'' i'' stack''
| otherwise -> stack'' `L.lseq` L.Ur Failure
Is the result worth making the linearity checker happy?
String | MUT time, s | GC time, s | Total time, s | % of tuned baseline total |
---|---|---|---|---|
aa...aa |
0.58 | 0.77 | 1.35 | 201% |
aa...aaz |
0.36 | 0.78 | 1.14 | 215% |
ab...ab |
0.58 | 0.46 | 1.04 | 200% |
ab...abz |
0.48 | 0.45 | 0.93 | 211% |
This is still worse than the GHC/RTS-managed stack, which is sad.
One might expect improved performance due to cache friendliness
even if we don’t do any random accesses
since vectors are, by definition, a continuous data structure,
except not really: these are boxed vectors.
So, it’s effectively an array of pointers,
canceling out most of the profits.
One might expect better performance with unboxed vectors,
but alas, linear-base
does not have them (yet).
Good ol’ ST
Alright, time for our good old performance savior, the mutable vectors living in the ST
monad.
Here, I’ll be doing a terrible thing:
preallocating the whole vector with a hardcoded capacity.
I know it should be enough for these tests,
and it allows comparing the performance vs. Data.Array
(which doesn’t grow as easily) much more directly.
Here’s the ugly ST
thing:
import Data.Vector.Unboxed.Mutable qualified as VM
match :: (VM.Unbox q, StateId q) => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = runST $ do
stack <- VM.unsafeNew 24_000_000
let go s q i
| q == finState = pure $ SuccessAt i
| otherwise = case q `getTrans` transitions of
TEps q' -> go s q' i
TBranch q1 q2 -> do VM.unsafeWrite stack s (q2, i)
go (s + 1) q1 i
TCh ch q'
| bs `BS.indexMaybe` i == Just ch -> go s q' (i + 1)
| s == 0 -> pure Failure
| otherwise -> do (q'', i'') <- VM.unsafeRead stack (s - 1)
go (s - 1) q'' i''
go 0 initState 0
And now we’re talking!
String | MUT time, s | GC time, s | Total time, s | % of tuned baseline total |
---|---|---|---|---|
aa...aa |
0.44 | 0.000 | 0.44 | 66% |
aa...aaz |
0.27 | 0.000 | 0.27 | 51% |
ab...ab |
0.46 | 0.000 | 0.46 | 88% |
ab...abz |
0.38 | 0.000 | 0.38 | 86% |
ST
is definitely worth it!
GC time is now precisely zero, and MUT times also got slightly faster.
This also means that GHC managed to tail-call optimize go
even though it’s technically in a monad,
so it is not the last operation: >>=
is (at least, syntactically).
Out of curiosity and to see if linear unboxed vectors could be worth it,
how would ST
-based mutable boxed vectors perform?
I’m not showing all numbers here (they’re in the final graph, though),
but if we replace the import with Data.Vector.Mutable
and remove the VM.Unbox q
constraint, we’ll get
the numbers quite similar to the linear vectors case,
suggesting that boxing is indeed to blame.
Vector
with an STUArray
,
but that gives me the same results while having
uglier code.
match :: (StateId q, (forall s. A.MArray (A.STUArray s) q (ST s))) => NFA q -> BS.ByteString -> MatchResult Int
match NFA{..} bs = runST $ do
qstack <- mkArr 24_000_000
istack <- mkArr 24_000_000
let go s q i
| q == finState = pure $ SuccessAt i
| otherwise = case q `getTrans` transitions of
TEps q' -> go s q' i
TBranch q1 q2 -> do A.unsafeWrite qstack s q2
A.unsafeWrite istack s i
go (s + 1) q1 i
TCh ch q'
| bs `BS.indexMaybe` i == Just ch -> go s q' (i + 1)
| s == 0 -> pure Failure
| otherwise -> do q'' <- A.unsafeRead qstack (s - 1)
i'' <- A.unsafeRead istack (s - 1)
go (s - 1) q'' i''
go 0 initState 0
where
mkArr :: A.MArray (A.STUArray s) e (ST s) => Int -> ST s (A.STUArray s Int e)
mkArr len = A.unsafeNewArray_ (0, len)
Data.Array
doesn’t do the Array of Structs → Struct of Arrays
transformation on tuples of unboxable types,
so we have to do that ourselves: hence, we have two separate arrays.
Since the numbers are the same within measurement errors (so I’m not showing them),
it looks like Data.Vector
compiles to quite optimal code in this particular case.
mkArr
definition just to guide type inference.
Programming with Array
directly is meh.
Looks like we’re out of easily reachable options, so let’s return to our favourite language.
But what about C++?
If we write a more or less similar code in C++,
struct Match
{
...
std::vector<std::pair<Q, size_t>> stack;
std::optional<int> GoStack()
{
stack.reserve(24'000'000);
return GoStack(nfa.initState, 0);
}
std::optional<int> GoStack(Q state, size_t idx)
{
if (state == nfa.finState)
return idx;
return std::visit(overloaded {
[=, this](TEps eps) { return GoStack(eps.q, idx); },
[=, this](TBranch b)
{
stack.push_back({ b.q2, idx });
return GoStack(b.q1, idx);
},
[=, this](TCh ch)
{
if (idx < size && string[idx] == ch.ch)
return GoStack(ch.q, idx + 1);
else if (stack.empty())
return std::optional<int> {};
else
{
auto [q, i] = stack.back();
stack.pop_back();
return GoStack(q, i);
}
},
}, nfa.transitions[state]);
}
};
it also segfaults with both gcc and clang if -fsplit-stack
is omitted,
suggesting neither compiler does TCO in this case, which is quite unfortunate.
With -fsplit-stack
, though, the results are even worse than before:
String | g++ -O2, s | g++ -O3, s | clang -O2, s | clang -O3, s |
---|---|---|---|---|
aa...aa |
21 | 19 | 9.3 | 5.3 |
aa...aaz |
12 | 11 | 5.6 | 3.2 |
ab...ab |
20 | 19 | 9.2 | 5.2 |
ab...abz |
16 | 16 | 7.9 | 4.4 |
manually converting that to a loop.
std::optional<int> GoLoop()
{
stack.reserve(24'000'000);
Q q = nfa.initState;
size_t i = 0;
bool keepGoing = true;
while (keepGoing)
{
if (q == nfa.finState)
return i;
std::visit(overloaded {
[&](TEps eps) { q = eps.q; },
[&](TBranch b)
{
stack.push_back({ b.q2, i });
q = b.q1;
},
[&](TCh ch)
{
if (i < size && string[i] == ch.ch)
{
++i;
q = ch.q;
}
else if (stack.empty())
keepGoing = false;
else
{
std::tie(q, i) = stack.back();
stack.pop_back();
}
},
}, nfa.transitions[q]);
}
return {};
}
Indeed, this doesn’t segfault without -fsplit-stack
,
and it performs much better, like 10-50 times better:
String | g++ -O2, s | g++ -O3, s | clang -O2, s | clang -O3, s |
---|---|---|---|---|
aa...aa |
0.46 | 0.46 | 0.49 | 0.48 |
aa...aaz |
0.35 | 0.34 | 0.39 | 0.34 |
ab...ab |
0.42 | 0.43 | 0.50 | 0.43 |
ab...abz |
0.36 | 0.37 | 0.42 | 0.36 |
This is on par with Haskell, with the first two cases being marginally slower and the other two slightly faster. Not bad!
An attentive reader might object that
the Haskell version doesn’t do anything analogous to push_back()
/pop_back()
.
It just allocates an uninitialized chunk of memory and proceeds with it,
while push_back
must check if there’s enough memory and potentially grow the vector
(even if it never happens on our test data).
Sure, since it never happens during execution,
the CPU’s branch predictor learns that quickly,
but it’s still some extra code that might also prevent further optimizations.
Moreover, getting the last element happens via a pair of back()
/pop_back()
calls.
Looks like the compilers optimize that quite nicely,
but that’s nevertheless a valid objection.
Let’s also try a version where we keep the “pointer” to the top of the stack, just like in Haskell. That is:
- we add a variable
s
(which has the same intent as our Haskell’ss
), - we replace
stack.reserve()
withstack.resize()
, - pushing to the stack via
stack.push_back({ b.q2, i })
is replaced withstack[s++] = { b.q2, i }
, - checking if the stack is empty is now done via
if (!s)
instead ofif (stack.empty())
, - getting the last element of the stack is now done via
std::tie(q, i) = stack[--s];
instead of thestack.back()
/stack.pop_back()
pair.
The results are curious:
String | g++ -O2, s | g++ -O3, s | clang -O2, s | clang -O3, s |
---|---|---|---|---|
aa...aa |
0.52 | 0.48 | 0.56 | 0.54 |
aa...aaz |
0.37 | 0.36 | 0.23 | 0.23 |
ab...ab |
0.45 | 0.40 | 0.49 | 0.51 |
ab...abz |
0.39 | 0.38 | 0.25 | 0.26 |
Some cases got even better, others got worse, which is somewhat unexpected.
My intuition is that resize()
causes eager initialization of all its elements,
which isn’t necessary if the whole capacity of our stack
isn’t used,
but we’ll check that later.
Overall, it’s fair to say we won the battle for the stack.
Transition tables
Now let’s examine the other part of the problem: our transition tables.
The EnumMap
default
I mentioned that EnumMap
instead of HashMap
should be the default choice
for types like Word32
.
But is that actually the case?
If we replace our EnumMap
with a HashMap
, we’ll get the following results
(NB here we’re comparing against EnumMap
):
String | MUT time, s | GC time, s | Total time, s | % of EnumMap |
---|---|---|---|---|
aa...aa |
0.78 | 0.000 | 0.78 | 177% |
aa...aaz |
0.50 | 0.000 | 0.50 | 185% |
ab...ab |
0.77 | 0.000 | 0.77 | 167% |
ab...abz |
0.66 | 0.000 | 0.66 | 174% |
Indeed, the run time with the HashMap
is
generally 1.7-1.8 times worse than with the EnumMap
.
Overall, EnumMap
is a good choice for a map, so our intuition was correct.
Why maps?
But do we really need a map in the first place?
An observant reader might notice that
the states in our test NFA are indexed by a dense increasing sequence of numbers,
and the final state is the state with the largest index.
While my regexp parsing/NFA conversion functions don’t guarantee that,
we can always reindex the states appropriately.
Then, we can avoid EnumMap
/std::unordered_map
and just use a vector instead.
In this case, we don’t even have to bother with mutability or linearity
since we never modify that vector once the NFA is constructed.
So let’s do that!
We merely change TransMap
and getTrans
to become
import Data.Vector qualified as V
type TransMap q = V.Vector (Trans q)
getTrans :: StateId q => q -> TransMap q -> Trans q
getTrans q m = m V.! fromIntegral q
Having done that, we get:
String | MUT time, s | GC time, s | Total time, s | % of EnumMap |
% of tuned baseline total |
---|---|---|---|---|---|
aa...aa |
0.28 | 0.000 | 0.28 | 64% | 42% |
aa...aaz |
0.20 | 0.000 | 0.20 | 74% | 38% |
ab...ab |
0.25 | 0.000 | 0.25 | 54% | 48% |
ab...abz |
0.21 | 0.000 | 0.21 | 55% | 48% |
A nice improvement for a simple change!
Interestingly, the performance of both V.!
and V.unsafeIndex
is roughly the same in this case.
Moreover, perhaps less surprisingly, the -A128m -w -ki64m -kc8m
incantation
no longer has any measurable effect — we do no GC, so the GC RTS options don’t matter.
From now on, we can omit these options if we wish.
But, as we’ve learned from our previous experiments, boxed vs unboxed vectors do make a difference.
It’s a shame that we cannot use unboxed ones here since Trans q
is a non-trivial sum type.
Or can we?
Unboxed vectors again
We can map the Trans q
type to, say, Word64
,
reserving two bits for the type tag
and packing the corresponding constructor arguments into the rest of the bits.
If we don’t have too many states (say, less than 2n),
we can pack each q
into n
bits of the Word64
.
All this is better explained in code:
tagShift :: Int
tagShift = 60
toWord64 :: Integral q => Trans q -> Word64
toWord64 = \case
TEps q -> 0b01 .<<. tagShift .|. fromIntegral q
TBranch q1 q2 -> 0b00 .<<. tagShift .|. fromIntegral q1 .|. fromIntegral q2 .<<. 32
TCh w q -> 0b11 .<<. tagShift .|. fromIntegral w .|. fromIntegral q .<<. 8
fromWord64 :: Integral q => Word64 -> Trans q
fromWord64 w = case w .>>. tagShift of
0b01 -> TEps $ fromIntegral w
0b00 -> TBranch (fromIntegral w) (fromIntegral $ w .>>. 32)
_ -> TCh (fromIntegral w) (fromIntegral $ w .>>. 8)
With these two functions, we can declare support for unboxed vectors of Trans q
:
import Data.Vector.Generic qualified as VG
import Data.Vector.Generic.Mutable qualified as VG
import Data.Vector.Unboxed qualified as VU
instance Integral q => VU.IsoUnbox (Trans q) Word64 where
toURepr = toWord64
fromURepr = fromWord64
newtype instance VU.MVector s (Trans q) = MV_Trans (VU.MVector s Word64)
newtype instance VU.Vector (Trans q) = V_Trans (VU.Vector Word64)
deriving via (Trans q `VU.As` Word64) instance Integral q => VG.MVector VU.MVector (Trans q)
deriving via (Trans q `VU.As` Word64) instance Integral q => VG.Vector VU.Vector (Trans q)
and then just replace V
-qualified names with VU
as needed.
I’m also replacing V.!
with VU.unsafeIndex
,
since it starts to show some effect on the performance now.
String | MUT time, s | GC time, s | Total time, s | % of boxed Vector |
% of tuned baseline total |
---|---|---|---|---|---|
aa...aa |
0.134 | 0.000 | 0.136 | 49% | 20% |
aa...aaz |
0.104 | 0.000 | 0.105 | 52% | 20% |
ab...ab |
0.113 | 0.000 | 0.115 | 46% | 22% |
ab...abz |
0.102 | 0.000 | 0.104 | 50% | 24% |
Wow! That’s about 2x improvement in performance! This ugly bit-mingling code is definitely worth it.
Not all encodings are created equal.
My initial implementation of toWord64
/fromWord64
was more like
toWord64 :: Integral q => Trans q -> Word64
toWord64 = \case
TEps q -> 0b00 .|. (fromIntegral q .<<. 2)
TBranch q1 q2 -> 0b01 .|. (fromIntegral q1 .<<. 18) .|. (fromIntegral q2 .<<. 2)
TCh w q -> 0b10 .|. (fromIntegral w .<<. 2) .|. (fromIntegral q .<<. 18)
fromWord64 :: Integral q => Word64 -> Trans q
fromWord64 w = case w .&. 0b11 of
0b00 -> TEps $ fromIntegral $ w .>>. 2
0b01 -> TBranch (fromIntegral $ w .>>. 18) (fromIntegral $ (w .>>. 2) .&. 0xffff)
_ -> TCh (fromIntegral $ w .>>. 2) (fromIntegral $ w .>>. 18)
fromWord64
does considerably more operations
than the final version above,
and it has a measurable effect:
this version results in 8-10 ms added to each benchmark case.
Unboxed arrays again
But, again, vectors sometimes do have some overhead.
How about an unboxed array?
We have to use fromWord64
and toWord64
explicitly when looking up transitions:
type TransMap q = A.UArray Int Word64
getTrans :: StateId q => q -> TransMap q -> Trans q
getTrans q m = fromWord64 $ m `A.unsafeAt` fromIntegral q
and when creating the transitions table:
nfa :: NFA Word32
nfa = NFA{..}
where
...
transitions = A.listArray (0, 12) $ toWord64 @Word32 <$> [...]
Otherwise, the code is mostly the same.
But it’s worth it, giving a further 10-15% improvement over unboxed Vector
s:
String | MUT time, s | GC time, s | Total time, s | % of unboxed Vector |
% of tuned baseline total |
---|---|---|---|---|---|
aa...aa |
0.119 | 0.000 | 0.120 | 88% | 18% |
aa...aaz |
0.091 | 0.000 | 0.093 | 89% | 18% |
ab...ab |
0.095 | 0.000 | 0.096 | 83% | 18% |
ab...abz |
0.088 | 0.000 | 0.090 | 87% | 20% |
I’d say this is extremely good!
But what about C++?
Here’s how the C++ version performs if we similarly
replace the std::unordered_map
with the std::vector
.
If we try the push_back()
/pop_back()
/back()
stack version:
String | g++ -O2, s | g++ -O3, s | clang -O2, s | clang -O3, s |
---|---|---|---|---|
aa...aa |
0.141 | 0.132 | 0.338 | 0.221 |
aa...aaz |
0.104 | 0.105 | 0.207 | 0.138 |
ab...ab |
0.111 | 0.108 | 0.317 | 0.224 |
ab...abz |
0.096 | 0.097 | 0.261 | 0.196 |
If we keep the explicit stack position, clang does significantly better at the cost of gcc performing slightly worse:
String | g++ -O2, s | g++ -O3, s | clang -O2, s | clang -O3, s |
---|---|---|---|---|
aa...aa |
0.152 | 0.152 | 0.154 | 0.150 |
aa...aaz |
0.123 | 0.120 | 0.128 | 0.128 |
ab...ab |
0.142 | 0.141 | 0.147 | 0.140 |
ab...abz |
0.125 | 0.129 | 0.133 | 0.136 |
As a side quest,
let’s also see if my intuition about the initialization overhead is correct.
We replace the stack
-as-an-std::vector
with
a malloc
’ed array:
using StackElem = std::pair<Q, size_t>;
auto stack = static_cast<StackElem*>(malloc(24'000'000 * sizeof(StackElem)));
// the rest as before
malloc
because anything relying on new
will still
likely initialize the elements.
For instance, gcc compiles (even with -O3 -march=native
)
auto foo()
{
auto stack = std::make_unique<std::pair<uint32_t, size_t>[]>(24'000'000);
stack[0] = { 3, 4 };
return stack[0];
}
to
foo():
push rbx
mov edi, 384000000
call operator new[](unsigned long)
mov rdi, rax
mov rdx, rax
lea rcx, [rax+384000000]
.L2:
mov DWORD PTR [rdx], 0
add rdx, 16
mov QWORD PTR [rdx-8], 0
cmp rcx, rdx
jne .L2
mov DWORD PTR [rdi], 3
mov QWORD PTR [rdi+8], 4
mov rbx, QWORD PTR [rdi]
call operator delete[](void*)
mov edx, 4
mov eax, ebx
pop rbx
ret
Note the unvectorized and fairly silly loop
setting the whole of stack
to 0
without any vectorization
only to destroy it.
For comparison, clang does a good job doing constant propagation and elision, and compiles this to
foo():
mov edx, 4
mov eax, 3
ret
If we run this version, we’ll indeed get better numbers:
String | g++ -O2, s | g++ -O3, s | clang -O2, s | clang -O3, s |
---|---|---|---|---|
aa...aa |
0.134 | 0.133 | 0.144 | 0.145 |
aa...aaz |
0.107 | 0.107 | 0.115 | 0.114 |
ab...ab |
0.109 | 0.108 | 0.126 | 0.125 |
ab...abz |
0.098 | 0.097 | 0.113 | 0.113 |
So, this version gives the more or less best numbers
for both gcc and clang at the cost of explicit malloc
.
This malloc
is definitely an UB in C++ ≤17
since we’re reading and writing to objects whose lifetime hasn’t started,
and it potentially is an UB in C++ ≥20, but, frankly,
I’m not exactly in the mood for C++ standard reading right now.
std::start_lifetime_as_array
could’ve helped here, but it’s a C++23 feature,
and neither my gcc nor my clang have that yet.
Perhaps unsurprisingly, making a similar change of packing Trans
into a Word64
uint64_t
has a negative effect in case of C++.
Indeed, the std::variant
is already stored effectively unboxed,
so we just waste time fiddling with bits.
Anyway, back to our main cast:
the unboxed Array
-based version is faster than any of the C++ versions,
while the unboxed Vector
-based version is slower than gcc with push_back()
but faster than the rest.
Summing it up
First, let’s look at the graph with all of the code versions together
(except the HashMap
intuition-verifying one):
Then, despite the length of this post, it’s not a real optimization one. I haven’t used the profiler (and for any real performance optimization you probably should do some profiling), my benchmarking input set is barely representative, and so on.
Yet, I think we have some takeaways:
- Boxing absolutely kills performance in vectors and arrays. If you can, try using unboxed ones, even if it requires ugly and/or non-total conversions between your type and something unboxable.
- GHC RTS is quite good at managing the stack in deeply recursive functions that need to backtrack.
The only way to outperform it is via explicit mutability,
either in the form of our good old
ST
monad or via linear types, and managing explicitly what goes onto the stack and what goes as direct, TCO-optimized function arguments. - Speaking of the latter, linear types do indeed look promising for problems like this.
I’d really love to write code that looks almost like your normal immutable Haskell
but doesn’t copy things around like crazy.
Unfortunately, as of this writing,
linear-base
only has boxed vectors, and… well, see above regarding boxing. - Despite the optimism, the results (especially in the graph form) suggest that
linear unboxed vectors would still be slower than
ST
-mutable unboxed vectors. Indeed, boxed linear ones are slightly slower than boxedST
, and it seems reasonable to extrapolate that.
Finally, it’s once again possible to write code that outperforms C++ without too much effort at the cost of just slightly less idiomatic Haskell. And I’m not even mentioning the productivity improvements (I’d hate to write the NFA parser in C++, for example), easier testing (think QuickCheck), and so on.
And if you have any comments or feedback, I’d be happy to hear from you!
Miscellanea
An inquisitive reader might also ask a different question: why do we care at all about the performance of this naive implementation, which has exponential backtracking problems?
One reason is, of course, to see how different decisions and approaches affect performance. It’s always good to experiment with a semi-toy example that is close enough to what you might encounter in the real world, but still manageable in code size and interpretability. This helps develop intuition and choose the right approaches straight away for other, more complex problems.
A more practical reason is that a naive algorithm
can still be included in a production-grade library.
Indeed, on simple regexps and inputs,
the memoization bookkeeping will always make it slower.
So, if a user of the library controls their inputs, they might just use this simpler one.
Alternatively, we could expose a function that race
s
the naive and and the memoizing ones.
The naive one finishes first on simple regexps or non-DoS-ing inputs,
so we get its smaller latency.
However, the memoizing one finishes first on a bad input,
and we still avoid going exponential.
The cost is 2x CPU usage, but I can see how it’s a reasonable trade-off in some use cases.
And I also didn’t elaborate on my benchmarking setup, and it’s about time.
So, my machine is a Ryzen 7950X3D stock.
For Haskell, I’ve been using GHC 9.8.2 with NCG,
while -fllvm
was generally the same or worse.
For C++, I’ve been using gcc 13.1 and clang 18.1.
I’ve also tried gcc 14, but it seems like
it consistently produces slightly worse results for the
optimized versions of the C++ code towards the end of the post.
-march=native
has either no effect or worsens the numbers for C++,
so I have not enabled it.
And that’s all, folks!
Edit history
- 18 Sep 2024: fixed
enummapset
link, added link to the reddit post, and clarified wording re stack/chunks as per reddit discussion.