# Haskell is quite OK for images: encoding QOI

Last time we’ve looked at writing a decoder for the QOI format. Today, we’ll look at the inverse: encoding QOI images and all that it entails.

Like the last time, this post describes both the final result and the road there. So, there will be lots of code and lots of diffs, beware!

Again, we’ll start with three-channel RGB pixels and generalize to RGBA later.

## Encoding RGB pixels

In the QOI format, encoding each pixel requires going through the possible encoding methods until we find one that can handle our pixel in the context of the previous ones.

For instance, we can produce the `DIFF8`

chunk if the pixel’s alpha channel
hasn’t changed compared to the previous pixel,
and each other channel’s difference is within the `[-2; 2)`

interval.
Otherwise, if alpha again hasn’t changed, the red channel’s delta is within `[-16; 16)`

,
and blue and green deltas are both within `[-8; 8)`

, then we can use `DIFF16`

.
Otherwise, if we’ve recently seen this pixel, we can encode it using the `INDEX`

chunk.
Otherwise… well, you get the idea.

Note that for RGB images the alpha channel doesn’t change for obvious reasons, so we won’t consider it for now.

How can we express this nicely?
If we squint hard enough, we’ll see a `Maybe`

with its `Alternative`

instance!

Indeed, we can represent each way of encoding a pixel as a computation that either succeeds or fails. So, encoding a pixel is just trying each one until one succeeds.

For instance, this is how we can express the `DIFF8`

part
(note again we’re working with `Pixel3`

s for now, so we don’t care about alpha yet):

```
encodeDiff8 :: Word8 -> Word8 -> Word8 -> A.STUArray s Int Word8 -> Int -> Maybe (ST s Int)
encodeDiff8 dr dg db out outPos
| (`isBounded` 2) `all` [dr, dg, db] = let byte = 0b10000000 .|. ((dr + 2) .<<. 4)
.|. ((dg + 2) .<<. 2)
.|. (db + 2)
in Just $ A.unsafeWrite out outPos byte $> 1
| otherwise = Nothing
```

where the helper function `isBounded delta bound`

checks that `-bound <= delta < bound`

:

```
isBounded :: Word8 -> Word8 -> Bool
isBounded d h = d + h < 2 * h
```

So, if all of `dr`

, `dg`

, `db`

are small enough,
then `encodeDiff8 dr dg db out outPos`

returns a computation (wrapped in `Just`

)
that writes the corresponding byte into `out`

at `outPos`

and returns (via `$> 1`

)
how many bytes were written.
Otherwise, if at least one delta is too big, this function just returns `Nothing`

.

In fact, the last two arguments and the return types are common for all of the encoding functions, so let’s abstract that away:

`type VarEncoder s = A.STUArray s Int Word8 -> Int -> Maybe (ST s Int)`

so that our function type looks like

`encodeDiff8 :: Word8 -> Word8 -> Word8 -> VarEncoder s`

`encodeDiff16`

, `encodeDiff24`

, `encodeIndex`

, and `encodeColor`

for the other chunk types.
## That’s how they look!

```
encodeDiff16 :: Word8 -> Word8 -> Word8 -> VarEncoder s
encodeDiff16 dr dg db out outPos
| uncurry isBounded `all` [(dr, 16), (dg, 8), (db, 8)] = let b1 = 0b11000000 .|. (dr + 16)
b2 = (dg + 8) .<<. 4
.|. (db + 8)
in Just $ do A.unsafeWrite out outPos b1
A.unsafeWrite out (outPos + 1) b2
pure 2
| otherwise = Nothing
encodeDiff24 :: Word8 -> Word8 -> Word8 -> Word8 -> VarEncoder s
encodeDiff24 dr dg db da out outPos
| (`isBounded` 16) `all` [dr, dg, db, da] = let bytes :: Word32
bytes = (0b11100000 .<<. 16)
.|. fromIntegral (dr + 16) .<<. 15
.|. fromIntegral (dg + 16) .<<. 10
.|. fromIntegral (db + 16) .<<. 5
.|. fromIntegral (da + 16)
in Just $ do A.unsafeWrite out outPos (fromIntegral $ bytes .>>. 16)
A.unsafeWrite out (outPos + 1) (fromIntegral $ bytes .>>. 8)
A.unsafeWrite out (outPos + 2) (fromIntegral bytes)
pure 3
| otherwise = Nothing
encodeIndex :: Pixel3 -> Word8 -> Pixel3 -> VarEncoder s
encodeIndex px hash runningPx out outPos
| px == runningPx = Just $ A.unsafeWrite out outPos hash $> 1
| otherwise = Nothing
encodeColor :: Pixel3 -> Pixel3 -> A.STUArray s Int Word8 -> Int -> ST s Int
encodeColor (Pixel3 r1 g1 b1) (Pixel3 r0 g0 b0) out outPos = do
A.unsafeWrite out outPos bh
when (hr == 1) $ A.unsafeWrite out (outPos + 1) r1
when (hg == 1) $ A.unsafeWrite out (outPos + 1 + hr) g1
when (hb == 1) $ A.unsafeWrite out (outPos + 1 + hr + hg) b1
pure (1 + hr + hg + hb)
where
hr = fromEnum $ r1 /= r0
hg = fromEnum $ g1 /= g0
hb = fromEnum $ b1 /= b0
bh = 0b11110000 .|. (fromIntegral hr .<<. 3)
.|. (fromIntegral hg .<<. 2)
.|. (fromIntegral hb .<<. 1)
```

`encodeColor`

is not a `VarEncoder`

,
since it always succeeds and doesn’t need to be wrapped in a `Maybe`

.
There’s one more chunk missing: the one denoting a run of a given pixel.
Unfortunately, it does not fit this `Maybe (ST s Int)`

idea as nicely since it can potentially *consume* more than one pixel,
so let’s just express it separately:

```
maxRunLen :: Int
maxRunLen = 8224
encodeRun :: Int -> A.STUArray s Int Word8 -> Int -> ST s Int
encodeRun runLen out outPos
| runLen == 0 = pure outPos
| runLen <= 32 = A.unsafeWrite out outPos (0b01000000 .|. fromIntegral (runLen - 1)) $> outPos + 1
| runLen <= maxRunLen = do let runLen' = runLen - 33
A.unsafeWrite out outPos (0b01100000 .|. fromIntegral (runLen' .>>. 8))
A.unsafeWrite out (outPos + 1) (fromIntegral runLen')
pure $ outPos + 2
| otherwise = pure outPos
```

The handling for runs longer than `maxRunLen`

is, to put it mildly, suboptimal,
but we’ll fix this later.
Let’s get at least something working now!

Given all these functions, the primary encoding loop is straightforward:

- For the current pixel, try each encoding method until one succeeds.
If none does, we fall back to
`encodeColor`

. - Loop over subsequent pixels if they are equal to the current one, and if they are, encode a run.
- Update the “recently seen” pixels array.
- Rinse and repeat.

Or, in code, that’s how it looks. The worker function takes the current input and output positions, the current pixel, and the next pixel:

` let step inPos outPos px@(Pixel3 r1 g1 b1) prevPx@(Pixel3 r0 g0 b0) = do`

It then computes the difference between the pixels and finds the “recently seen” pixel with our pixel’s hash:

```
let (dr, dg, db) = (r1 - r0, g1 - g0, b1 - b0)
let hash = pixelHash px
runningPx <- A.unsafeRead running hash
```

It then tries each encoder, leveraging the `Applicative`

instance for `Maybe`

and using `fromMaybe`

with `encodeColor`

as the fallback:

```
pxDiff <- fromMaybe (encodeColor px prevPx result outPos)
$ encodeDiff8 dr dg db result outPos
<|> encodeIndex px (fromIntegral hash) runningPx result outPos
<|> encodeDiff16 dr dg db result outPos
<|> encodeDiff24 dr dg db 0 result outPos
```

Note that the encoders are arranged in a carefully selected order
so that the ones producing smaller output are tried first.
Moreover, this is the place where the laziness comes in really handy.
Since `fromMaybe`

and `Maybe`

’s `<|>`

are lazy,
none of the encoders will actually be executed unless the previous ones fail.

Having done that, `pxDiff`

says how much we should advance in our output.
But, before that, we update our recently seen pixels
and check if the current pixel is a start of a run:

```
A.unsafeWrite running hash px
let tryRun pos runLen
| pos + 3 <= inLen = do
let thisPixel = readPixel3 inBytes pos
if thisPixel /= px || runLen >= maxRunLen
then (pos, runLen, thisPixel)
else tryRun (pos + 3) (runLen + 1)
| otherwise = (pos + 3, runLen, px)
let (afterRun, runLen, nextPix) = tryRun (inPos + 3) 0
outPos' <- encodeRun runLen result (outPos + pxDiff)
```

Here, `readPixel inBytes pos`

reads the RGB pixel at the given position in the input,
which is literally just about reading three `Word8`

s from the input `ByteString`

.

And finally, if we have more pixels to encode, we loop again. Otherwise we stop:

```
if afterRun <= inLen
then step afterRun outPos' nextPix px
else pure outPos'
```

The only question is how big should the output be? Unfortunately, we don’t know until we’ve processed the whole image.

The solution is to compute the upper bound!
Indeed, the worst case is that each pixel is different enough from the previous one
to require the entire `COLOR`

chunk, encoding all the pixel channels.
This gives us one byte per channel plus one byte to denote the `COLOR`

chunk per pixel
for a total of `(channels + 1) × width × height`

bytes.

Or, in Haskell terms, we’ll just write a function that, given a QOI header describing the image we’re encoding, computes the maximum output length (and also encodes the header while we’re at it):

```
maxResultSize :: Header -> (Int, BS.ByteString)
maxResultSize h@Header { .. } = (maxLen, headerBS)
where
headerBS = BSL.toStrict $ encode h
maxLen = BS.length headerBS
+ fromIntegral (hChannels + 1) * fromIntegral (hWidth * hHeight)
+ 4 -- end padding
```

## encoder function.

```
encodeRaw :: Header -> BS.ByteString -> A.UArray Int Word8
encodeRaw header inBytes = A.runSTUArray $ do
(result :: A.STUArray s Int Word8) <- A.unsafeNewArray_ (0, maxLen - 1)
forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)
running <- A.newArray @(A.STUArray s) (0, 63 :: Int) initPixel
let step inPos outPos px@(Pixel3 r1 g1 b1) prevPx@(Pixel3 r0 g0 b0) = do
let (dr, dg, db) = (r1 - r0, g1 - g0, b1 - b0)
let hash = pixelHash px
runningPx <- A.unsafeRead running hash
pxDiff <- fromMaybe (encodeColor px prevPx result outPos)
$ encodeDiff8 dr dg db result outPos
<|> encodeIndex px (fromIntegral hash) runningPx result outPos
<|> encodeDiff16 dr dg db result outPos
<|> encodeDiff24 dr dg db 0 result outPos
A.unsafeWrite running hash px
let tryRun pos runLen
| pos + 3 <= inLen = do
let thisPixel = readPixel3 inBytes pos
if thisPixel /= px || runLen >= maxRunLen
then (pos, runLen, thisPixel)
else tryRun (pos + 3) (runLen + 1)
| otherwise = (pos + 3, runLen, px)
let (afterRun, runLen, nextPix) = tryRun (inPos + 3) 0
outPos' <- encodeRun runLen result (outPos + pxDiff)
if afterRun <= inLen
then step afterRun outPos' nextPix px
else pure outPos'
final <- step 0 headerLen (readPixel3 inBytes 0) initPixel
forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0
pure $ unsafeShrink result (final + 4)
where
inLen = BS.length inBytes
(maxLen, headerBS) = maxResultSize header
headerLen = BS.length headerBS
```

How well does this stuff perform? If we run it on the test image, it’ll take **259** ms.
As a comparison, recall that the baseline C implementation takes around the same ≈260 ms compiled with gcc and ≈340 ms compiled with clang.
Not too bad, right?

There’s one minor optimization we can already do, though.
Notice that `encodeDiff8`

is tried first, but it doesn’t use the `runningPx`

.
Indeed, let’s delay reading it until `encodeDiff8`

fails!
Unfortunately, the code gets a little bit uglier:

```
pxDiff <- case encodeDiff8 dr dg db result outPos of
Just act -> act
_ -> do runningPx <- A.unsafeRead running hash
fromMaybe (encodeColor px prevPx result outPos)
$ encodeIndex px (fromIntegral hash) runningPx result outPos
<|> encodeDiff16 dr dg db result outPos
<|> encodeDiff24 dr dg db 0 result outPos
```

but it now runs in 238 ms. That’s already faster than our C baseline, noice!

But, before seeing how we could optimize this, let’s first switch to something else.

## Tests

Now that we have both an encoder and a decoder, we can write very straightforward tests, verifying what we did makes sense and forming the basis for our future changes, optimizations, and refactorings.

Indeed, we expect the following to hold: `decode ∘ encode = id`

.
Unfortunately, we’re doing this in Haskell and not Idris, so we cannot *prove* this,
but we can go with the next best approximation: just QuickCheck it!

### Generating random images

For QuickCheck, we need to come up with a random images generator.
The most obvious way is to generate a random sequence of `Word8`

s
of the right size, but it doesn’t cover the domain that well.

Indeed, for an RGB image,
the probability that the next pixel is similar enough to the previous one
to exercise the `DIFF8`

path is (3 * 3 * 3) / (255 * 255 * 255) ≈ 1.6 × 10^{-6}.
This means that we can expect a megapixel image to contain roughly one or two such pairs of pixels
that `DIFF8`

is exercised.

And that’s just for RGB.
For RGBA, which we’ll need to eventually test as well, it’s even worse — in fact, 255 times worse:
a *gigapixel* image can be expected to contain just about 6 such pairs!

That won’t fly. We need a more intelligent way of generating test cases.

So, instead, we first choose whether we want a run of the current pixel,
or the next pixel to be “really close” to the current one (to go the `DIFF8`

route),
or a tad more different (to exercise `DIFF16`

), and so on.
Then we generate the specific parameters: the run length,
or the difference within the `DIFF8`

-expressible bounds, or…
Well, you get the idea.

Before expressing that, we’ll need a couple of helper definitions.
First, a wrapper around `ByteString`

s that displays their contents as bytes,
which will come in handy for pretty-printing the failing cases, if any:

```
newtype ShowAsBytes = ShowAsBytes { bytes :: BS.ByteString } deriving (Eq)
instance Show ShowAsBytes where
show = show . toArray3 . bytes
toArray3 :: BS.ByteString -> A.UArray Int Pixel3
toArray3 bs = A.array (0, pxCnt - 1) [ (i, Pixel3 r g b)
| i <- [0..pxCnt - 1]
, let r = bs `BS.index` (i * 3)
g = bs `BS.index` (i * 3 + 1)
b = bs `BS.index` (i * 3 + 2)
]
where
pxCnt = BS.length bs `div` 3
```

Then, we define a type for randomly generated raw images:

```
data Image3 = Image3
{ iWidth :: Int
, iHeight :: Int
, iBytes :: ShowAsBytes
} deriving (Eq, Show)
```

We’ll also need a couple of helper generators for pixels:
one generating a completely arbitrary pixel
and another one for a pixel that’s different from the given one
by not more than `delta`

:

```
genPixel3 :: Gen Pixel3
genPixel3 = Pixel3 <$> chooseAny <*> chooseAny <*> chooseAny
genDiff3Bounded :: Word8 -> Pixel3 -> Gen Pixel3
genDiff3Bounded delta (Pixel3 r g b) = do
diffs <- replicateM 3 $ choose (negate delta, delta - 1)
case diffs of
[dr, dg, db] -> pure $ Pixel3 (r + dr) (g + dg) (b + db)
_ -> error "muh dependent types"
```

Now, we can finally make `Image3`

s `Arbitrary`

.
To generate an image,
we first choose its width and height within some bounds
(controlled by QuickCheck’s `size`

parameter)
and an initial pixel.
Then we repeatedly generate the rest of the pixels,
each time choosing the specific generator:

```
instance Arbitrary Image3 where
arbitrary = do
maxDim <- getSize
width <- chooseInt (1, maxDim)
height <- chooseInt (1, maxDim)
px0 <- genPixel3
pixels <- V.iterateNM (width * height) step (px0, 0)
pure $ Image3 width height $ ShowAsBytes $ BS.pack $ V.toList $ V.concatMap (\(Pixel3 r g b, _) -> [r, g, b]) pixels
where
step (prevPixel, 0) = do
runToss <- chooseInt (1, 100)
if runToss >= 10
then do nextPixel <- oneof [ genDiff3Bounded 2 prevPixel
, genDiff3Bounded 8 prevPixel
, genDiff3Bounded 16 prevPixel
, genPixel3
]
pure (nextPixel, 0)
else (prevPixel,) <$> chooseInt (1, runToss * 1000)
step (prevPixel, n) = pure (prevPixel, n - 1)
```

With this instance, writing a property test is more or less straightforward: for an arbitrary image, we encode it and then try to decode it. The decoding should succeed, and the decoded bytes should be equal to the original ones:

```
main :: IO ()
main = hspec $ modifyMaxSuccess (const 100000) $
describe "QOI encoder" $
it "decode . encode = id" $ property $ \Image3 { .. } -> do
let header = Header { hMagic = matchBytes
, hWidth = fromIntegral iWidth
, hHeight = fromIntegral iHeight
, hChannels = 3
, hColorspace = 0
}
let encoded = encodeRaw header (bytes iBytes)
decoded = decodeQoi $ BS.pack $ A.elems encoded
decoded `shouldSatisfy` isRight
case decoded of
Left _ -> pure ()
Right (header', Pixels3 pixels') -> do
header' `shouldBe` header
pixels' `shouldBe` toArray3 (bytes iBytes)
Right _ -> fail "Expected Pixels3"
```

Let’s run this!

```
QOI encoder
decode . encode = id FAILED [1]
Failures:
test/Spec.hs:83:7:
1) QOI encoder decode . encode = id
Falsifiable (after 1 test):
Image3 {iWidth = 0, iHeight = 0, iBytes = array (0,-1) []}
predicate failed on: Left UnpaddedFile
```

Ugh, right. It fails on an empty input — something we haven’t thought of!

Luckily, the fix is just about it — we just need to check if the input is not null:

```
--- a/src/Data/Image/Qoi/Encoder.hs
+++ b/src/Data/Image/Qoi/Encoder.hs
@@ -154,7 +154,9 @@ encodeRaw header inBytes = A.runSTUArray $ do
- final <- step 0 headerLen (readPixel3 inBytes startPos) initPixel
+ final <- if not $ BS.null inBytes
+ then step startPos headerLen (readPixel3 inBytes startPos) initPixel
+ else pure headerLen
```

After changing this and rerunning the tests, we get a very satisfying result:

```
QOI encoder
decode . encode = id
+++ OK, passed 100000 tests.
Finished in 92.7709 seconds
1 example, 0 failures
```

So, we verified the property holds on 100k random images in just a minute and a half! Pretty cool, isn’t it?

### Parallel execution

One minor downside is that
all these random images are generated and passed through our QOI implementation sequentially
in a single thread.
My machine has 12 cores, so that’s unfortunate, and while `hspec`

supports parallel execution of tests,
QuickCheck doesn’t seem to parallelize that.

What we could do instead is spawn several test groups and execute *them* in parallel:

```
imgProperty :: Image3 -> IO ()
imgProperty Image3 { .. } = do
let header = Header { hMagic = matchBytes
, hWidth = fromIntegral iWidth
, hHeight = fromIntegral iHeight
, hChannels = 3
, hColorspace = 0
}
let encoded = encodeRaw header (bytes iBytes) 0
decoded = decodeQoi $ BS.pack $ A.elems encoded
decoded `shouldSatisfy` isRight
case decoded of
Left _ -> pure ()
Right (header', Pixels3 pixels') -> do
header' `shouldBe` header
pixels' `shouldBe` toArray3 (bytes iBytes)
Right _ -> fail "Expected Pixels3"
main :: IO ()
main = hspec $ modifyMaxSuccess (const 100000) $
describe "QOI encoder" $
parallel $ forM_ ([1..10] :: [Int]) $ \wrk ->
it ("decode . encode = id (worker " <> show wrk <> ")") $ property imgProperty
```

Running this allows processing 1 million random images in about 3 minutes:

```
QOI encoder
decode . encode = id (worker 1)
+++ OK, passed 100000 tests.
decode . encode = id (worker 2)
+++ OK, passed 100000 tests.
decode . encode = id (worker 3)
+++ OK, passed 100000 tests.
decode . encode = id (worker 4)
+++ OK, passed 100000 tests.
decode . encode = id (worker 5)
+++ OK, passed 100000 tests.
decode . encode = id (worker 6)
+++ OK, passed 100000 tests.
decode . encode = id (worker 7)
+++ OK, passed 100000 tests.
decode . encode = id (worker 8)
+++ OK, passed 100000 tests.
decode . encode = id (worker 9)
+++ OK, passed 100000 tests.
decode . encode = id (worker 10)
+++ OK, passed 100000 tests.
Finished in 176.7814 seconds
10 examples, 0 failures
```

Of course, this is not ideal. Firstly, due to how QuickCheck generates examples, there will be some overlap between different workers that could otherwise be avoided. and also the overall wall time has also increased, but it’s still better than running 1 million images in 15 minutes.

### Shrinking

Now, *arguendo*,
suppose QuickCheck has found a much bigger case manifesting a bug in our implementation.

To be more concrete, suppose we’ve accidentally made a typo,
and we’re subtracting `32`

instead of `33`

in the `encodeRun`

function:

```
@@ -105,7 +105,7 @@ encodeRun :: Int -> A.STUArray s Int Word8 -> Int -> ST s Int
encodeRun runLen out outPos
| runLen == 0 = pure outPos
| runLen <= 32 = A.unsafeWrite out outPos (0b01000000 .|. fromIntegral (runLen - 1)) $> outPos + 1
- | otherwise = do let runLen' = runLen - 33
+ | otherwise = do let runLen' = runLen - 32
A.unsafeWrite out outPos (0b01100000 .|. fromIntegral (runLen' .>>. 8))
A.unsafeWrite out (outPos + 1) (fromIntegral runLen')
pure $ outPos + 2
```

On my system, QuickCheck does find the issue after just 41 tests:

```
1) QOI encoder decode . encode = id
Falsifiable (after 41 tests):
Image3 {iWidth = 39, iHeight = 40, iBytes = array (0,1559)
...
Randomized with seed 1709694074
```

The problem here is that the counterexample is really huge — 1560 pixels, and so are the expected and actual outputs. Good luck finding the problem or even grasping the example!

The solution is to enable *shrinking* in our `Arbitrary Image3`

instance.
Shrinking allows QuickCheck to reduce a particular counterexample.
That is, a function `shrink`

(which is our responsibility to write)
is given a failing counterexample by QuickCheck,
and it should return some smaller examples that might exhibit the same bug.

How do we shrink images?
One way is as follows.
For a *w×h* image, if `h > 1`

, we can reduce it to *h* non-empty sub-images,
where the `i`

th sub-image is our original one with the `i`

th row crossed out.
We can also reduce the width by crossing out each column,
but since the images are laid out in the row-major order,
removing columns is computationally expensive, so let’s only do this if, say,
there are less than five rows.
Or, in code:

```
shrink Image3 { .. } = [ Image3
{ iWidth = iWidth
, iHeight = iHeight - 1
, iBytes = ShowAsBytes $ dropIthRow i $ bytes iBytes
}
| i <- [ 0 .. iHeight - 1 ]
]
<>
[ Image3
{ iWidth = iWidth - 1
, iHeight = iHeight
, iBytes = ShowAsBytes $ dropJthCol j $ bytes iBytes
}
| iHeight < 5
, j <- [0 .. iWidth - 1]
]
where
dropIthRow i str = BS.take (iWidth * 3 * i) str
<> BS.drop (iWidth * 3 * (i + 1)) str
dropJthCol j str
| BS.null str = str
| otherwise = let (row, rest) = BS.splitAt (iWidth * 3) str
(left, right) = BS.splitAt (j * 3) row
in left <> BS.drop 3 right <> dropJthCol j rest
```

```
1) QOI encoder decode . encode = id
Falsifiable (after 41 tests and 43 shrinks):
Image3 {iWidth = 34, iHeight = 2, iBytes = array (0,67) [...]}
expected: array (0,67) [...,(32,Pixel3 238 173 40),(33,Pixel3 238 173 40),(34,Pixel3 197 136 117),(35,Pixel3 197 136 117),...]
but got: array (0,67) [...,(32,Pixel3 238 173 40),(33,Pixel3 238 173 40),(34,Pixel3 238 173 40),(35,Pixel3 197 136 117),...]
```

Seems like there’s one `(238, 173, 40)`

pixel too many in the output!

Unfortunately, the test doesn’t tell us what *exactly* is broken:
it could be an off-by-one error either in the encoder (encoding one extra pixel in a run)
or in the decoder (similarly, decoding one extra pixel in a run),
and it’s up to us to verify which one is it, really.
But the test nevertheless gives a pretty decent pointer at where we should look!

Having written the tests, we change either the decoder or the encoder as we see fit while having at least some assurance that our changes don’t break things.

## Looping wisely

Right now, detecting a pixel run happens separately within `step`

,
in the `tryRun`

recursive function.
This most likely compiles to a tight nested loop,
but is this really the most CPU-friendly approach?

This is where there is no good intuition or a priori knowledge,
and we shall just try a different approach.
Namely, we’ll make `step`

itself check if the current pixel is equal to the previous one.
While we’re at it, we’ll also ensure that runs longer than `maxRunLen`

are handled correctly.

All in all, the resulting `step`

function now looks like this:

```
encodeRaw :: Header -> BS.ByteString -> Int -> A.UArray Int Word8
encodeRaw header inBytes startPos = A.runSTUArray $ do
(result :: A.STUArray s Int Word8) <- A.unsafeNewArray_ (0, maxLen - 1)
forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)
running <- A.newArray @(A.STUArray s) (0, 63 :: Int) initPixel
let step inPos runLen prevPx@(Pixel3 r0 g0 b0) outPos
| inPos + 3 <= inLen
, readPixel3 inBytes inPos == prevPx =
if runLen /= maxRunLen - 1
then step (inPos + 3) (runLen + 1) prevPx outPos
else encodeRun maxRunLen result outPos >>= step (inPos + 3) 0 prevPx
| inPos + 3 <= inLen = do
let px@(Pixel3 r1 g1 b1) = readPixel3 inBytes inPos
let (dr, dg, db) = (r1 - r0, g1 - g0, b1 - b0)
outPos' <- encodeRun runLen result outPos
let hash = pixelHash px
pxDiff <- case encodeDiff8 dr dg db result outPos' of
Just act -> act
_ -> do runningPx <- A.unsafeRead running hash
fromMaybe (encodeColor px prevPx result outPos')
$ encodeIndex px (fromIntegral hash) runningPx result outPos'
<|> encodeDiff16 dr dg db result outPos'
<|> encodeDiff24 dr dg db 0 result outPos'
A.unsafeWrite running hash px
step (inPos + 3) 0 px (outPos' + pxDiff)
| runLen /= 0 = encodeRun runLen result outPos
| otherwise = pure outPos
final <- step 0 0 initPixel headerLen
forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0
pure $ unsafeShrink result (final + 4)
where
inLen = BS.length inBytes
(maxLen, headerBS) = maxResultSize header
headerLen = BS.length headerBS
```

Notice that we don’t need any additional checks for non-empty input in this version. We’re also correctly handling runs of more than 8224 pixels by dumping intermediate chunks. So, for instance, for a run of length 20000, we’ll output:

- a run of 8224 pixels,
- a
`DIFF8`

chunk with all-zero diffs, - another run of 8224 pixels,
- another zero
`DIFF8`

, - the remaining run.

To be fair, these extra `DIFF8`

s result in somewhat suboptimal compression
since we could have written out one run right after the other.
But even in the worst case of an image being a single colossal run,
that’s just 1/8224th, or 0.012% worse compression —
something I, for one, can surely live with!

The effect?
**231 ms**, or about 2.5% improvement over the version with the nested loop.
Not too much, but let’s keep it this way, especially given that the tests still pass
and the code is IMO slightly more straightforward.

## like this.

```
encodeIntoArray :: forall s. Int -> BS.ByteString -> Int -> A.STUArray s Int Word8 -> ST s Int
encodeIntoArray headerLen inBytes startPos result = do
running <- A.newArray @(A.STUArray s) (0, 63 :: Int) initPixel
let step inPos runLen prevPx@(Pixel3 r0 g0 b0) outPos
| inPos + 3 <= inLen
, readPixel3 inBytes inPos == prevPx =
if runLen /= maxRunLen - 1
then step (inPos + 3) (runLen + 1) prevPx outPos
else encodeRun maxRunLen result outPos >>= step (inPos + 3) 0 prevPx
| inPos + 3 <= inLen = do
let px@(Pixel3 r1 g1 b1) = readPixel3 inBytes inPos
let (dr, dg, db) = (r1 - r0, g1 - g0, b1 - b0)
outPos' <- encodeRun runLen result outPos
let hash = pixelHash px
pxDiff <- case encodeDiff8 dr dg db result outPos' of
Just act -> act
_ -> do runningPx <- A.unsafeRead running hash
fromMaybe (encodeColor px prevPx result outPos')
$ encodeIndex px (fromIntegral hash) runningPx result outPos'
<|> encodeDiff16 dr dg db result outPos'
<|> encodeDiff24 dr dg db 0 result outPos'
A.unsafeWrite running hash px
step (inPos + 3) 0 px (outPos' + pxDiff)
| runLen /= 0 = encodeRun runLen result outPos
| otherwise = pure outPos
step startPos 0 initPixel headerLen
where
inLen = BS.length inBytes
{-# INLINE encodeIntoArray #-}
encodeRaw :: Header -> BS.ByteString -> Int -> A.UArray Int Word8
encodeRaw header inBytes startPos = A.runSTUArray $ do
result <- A.unsafeNewArray_ (0, maxLen - 1)
forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)
final <- encodeIntoArray headerLen inBytes startPos result
forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0
pure $ unsafeShrink result (final + 4)
where
(maxLen, headerBS) = maxResultSize header
headerLen = BS.length headerBS
```

I think it doesn’t change much, but it’s just a tad nicer code.

## Generalizing to RGBA pixels

Ah, that sweet part! We’re finally in a good position to do some finishing touches on our encoder.

Again, we’ll rely heavily on the `Pixel`

class we’ve written earlier.
There’s one thing missing, though: this class doesn’t support reading `Pixel`

from a `ByteString`

.

Reading a pixel might be expressed as

```
readPixel :: Pixel pixel
=> BS.ByteString
-> Int {- current offset -}
-> (Pixel, Int {- new offset -})
```

But this type is too powerful for our purposes:
it implies the new offset must be determined dynamically from the current offset.
Luckily, this is not the case: we know the new offset equals the current one
plus `3`

or `4`

for RGB and RGBA pixels, respectively.
Thus, we end up adding two different methods:

```
class Pixel pixel where
...
readPixel :: BS.ByteString -> Int -> a
channelCount :: proxy a -> Int
instance Pixel Pixel3 where
...
readPixel str pos = Pixel3 (str ! pos) (str ! pos + 1) (str ! pos + 2)
{-# INLINE readPixel #-}
channelCount _ = 3
{-# INLINE channelCount #-}
instance Pixel Pixel4 where
...
readPixel str pos = Pixel4 (str ! pos) (str ! pos + 1) (str ! pos + 2) (str ! pos + 3)
{-# INLINE readPixel #-}
channelCount _ = 4
{-# INLINE channelCount #-}
```

Next, we need to modify our `encodeDiff8`

and `encodeDiff16`

functions
since we must now check that the alpha channel difference is indeed `0`

.
This is just a matter of a single extra argument that we pattern match on:

```
-encodeDiff8 :: Word8 -> Word8 -> Word8 -> VarEncoder s
-encodeDiff8 dr dg db out outPos
+encodeDiff8 :: Word8 -> Word8 -> Word8 -> Word8 -> VarEncoder s
+encodeDiff8 dr dg db 0 out outPos
| (`isBounded` 2) `all` [dr, dg, db] = let byte = 0b10000000 .|. ((dr + 2) .<<. 4)
.|. ((dg + 2) .<<. 2)
.|. (db + 2)
in Just $ A.unsafeWrite out outPos byte $> 1
- | otherwise = Nothing
+encodeDiff8 _ _ _ _ _ _ = Nothing
{-# INLINE encodeDiff8 #-}
-encodeDiff16 :: Word8 -> Word8 -> Word8 -> VarEncoder s
-encodeDiff16 dr dg db out outPos
+encodeDiff16 :: Word8 -> Word8 -> Word8 -> Word8 -> VarEncoder s
+encodeDiff16 dr dg db 0 out outPos
| uncurry isBounded `all` [(dr, 16), (dg, 8), (db, 8)] = let b1 = 0b11000000 .|. (dr + 16)
b2 = (dg + 8) .<<. 4
.|. (db + 8)
in Just $ do A.unsafeWrite out outPos b1
A.unsafeWrite out (outPos + 1) b2
pure 2
- | otherwise = Nothing
+encodeDiff16 _ _ _ _ _ _ = Nothing
{-# INLINE encodeDiff16 #-}
```

`encodeIndex`

’s definition is not updated; it’s only the type that changes:

```
-encodeIndex :: Pixel3 -> Word8 -> Pixel3 -> VarEncoder s
+encodeIndex :: Eq pixel => pixel -> Word8 -> pixel -> VarEncoder s
```

In `encodeColor`

, we now need to account for the alpha channel too:

```
-encodeColor :: Pixel3 -> Pixel3 -> A.STUArray s Int Word8 -> Int -> ST s Int
-encodeColor (Pixel3 r1 g1 b1) (Pixel3 r0 g0 b0) out outPos = Just $ do
+encodeColor :: Pixel pixel => pixel -> pixel -> A.STUArray s Int Word8 -> Int -> ST s Int
+encodeColor px1 px0 out outPos = do
A.unsafeWrite out outPos bh
when (hr == 1) $ A.unsafeWrite out (outPos + 1) r1
when (hg == 1) $ A.unsafeWrite out (outPos + 1 + hr) g1
when (hb == 1) $ A.unsafeWrite out (outPos + 1 + hr + hg) b1
- pure (1 + hr + hg + hb)
+ when (ha == 1) $ A.unsafeWrite out (outPos + 1 + hr + hg + hb) a1
+ pure (1 + hr + hg + hb + ha)
where
+ (r1, g1, b1, a1) = toRGBA px1
+ (r0, g0, b0, a0) = toRGBA px0
hr = fromEnum $ r1 /= r0
hg = fromEnum $ g1 /= g0
hb = fromEnum $ b1 /= b0
+ ha = fromEnum $ a1 /= a0
bh = 0b11110000 .|. (fromIntegral hr .<<. 3)
.|. (fromIntegral hg .<<. 2)
.|. (fromIntegral hb .<<. 1)
+ .|. fromIntegral ha
{-# INLINE encodeColor #-}
```

The rest of the changes primarily amount to using the `Pixel`

class methods.
We also change `encodeIntoArray`

to accept
a dummy `proxy pixel`

argument denoting what pixels are expected,
and we also check the channels count in `encodeRaw`

,
calling `encodeIntoArray @Pixel3`

or `encodeIntoArray @Pixel4`

, respectively.
Again, we rely on the compiler substituting the corresponding instance methods into each of these calls,
specializing them and inlining any indirections away.

##
The final form of `encodeRaw`

and `encodeIntoArray`

, for the curious.

```
encodeIntoArray :: forall pixel s. Pixel pixel
=> Proxy pixel
-> Int
-> BS.ByteString
-> Int
-> A.STUArray s Int Word8
-> ST s Int
encodeIntoArray proxy headerLen inBytes startPos result = do
running <- A.newArray @(A.STUArray s) (0, 63 :: Int) (fromRGBA @pixel 0 0 0 255)
let step inPos runLen prevPx outPos
| inPos + diff <= inLen
, readPixel inBytes inPos == prevPx =
if runLen /= maxRunLen - 1
then step (inPos + diff) (runLen + 1) prevPx outPos
else encodeRun maxRunLen result outPos >>= step (inPos + diff) 0 prevPx
| inPos + diff <= inLen = do
let (r0, g0, b0, a0) = toRGBA prevPx
let px = readPixel inBytes inPos
let (r1, g1, b1, a1) = toRGBA px
let (dr, dg, db, da) = (r1 - r0, g1 - g0, b1 - b0, a1 - a0)
outPos' <- encodeRun runLen result outPos
let hash = pixelHash px
pxDiff <- case encodeDiff8 dr dg db da result outPos' of
Just act -> act
_ -> do runningPx <- A.unsafeRead running hash
fromMaybe (encodeColor px prevPx result outPos')
$ encodeIndex px (fromIntegral hash) runningPx result outPos'
<|> encodeDiff16 dr dg db da result outPos'
<|> encodeDiff24 dr dg db da result outPos'
A.unsafeWrite running hash px
step (inPos + diff) 0 px (outPos' + pxDiff)
| runLen /= 0 = encodeRun runLen result outPos
| otherwise = pure outPos
step startPos 0 (fromRGBA 0 0 0 255) headerLen
where
inLen = BS.length inBytes
diff = channelCount proxy
{-# INLINE encodeIntoArray #-}
encodeRaw :: Header -> BS.ByteString -> Int -> A.UArray Int Word8
encodeRaw header inBytes startPos = A.runSTUArray $ do
result <- A.unsafeNewArray_ (0, maxLen - 1)
forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)
final <- if hChannels header == 3
then encodeIntoArray @Pixel3 Proxy headerLen inBytes startPos result
else encodeIntoArray @Pixel4 Proxy headerLen inBytes startPos result
forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0
pure $ unsafeShrink result (final + 4)
where
(maxLen, headerBS) = maxResultSize header
headerLen = BS.length headerBS
```

Similarly, we can generalize the tests to generate and verify the right property for RGBA images, which is left as an exercise for the reader.

Anyway, does this generalization affect performance in any way?
Surprisingly, it even improves the numbers by a couple percent:
the code now runs in **233 ms**.

But, speaking of surprises, there’s another oddity.
Right now, we don’t handle `BS.ByteString`

s that are slices of some bigger strings
(see the discussion in the first part).
One way to handle them is to just apply the `unoffsetBS`

function from the first part to the input string:

```
@@ -164,7 +164,7 @@ encodeIntoArray proxy headerLen inBytes startPos result = do
{-# INLINE encodeIntoArray #-}
encodeRaw :: Header -> BS.ByteString -> Int -> A.UArray Int Word8
-encodeRaw header inBytes startPos = A.runSTUArray $ do
+encodeRaw header inBytes' startPos = A.runSTUArray $ do
result <- A.unsafeNewArray_ (0, maxLen - 1)
forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)
final <- if hChannels header == 3
@@ -173,5 +173,6 @@ encodeRaw header inBytes startPos = A.runSTUArray $ do
forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0
pure $ unsafeShrink result (final + 4)
where
+ inBytes = unoffsetBS inBytes'
(maxLen, headerBS) = maxResultSize header
headerLen = BS.length headerBS
```

Curiously, this change makes the code run yet *faster* by about 6%,
reducing the run time further down to **219 ms**.
I don’t have a good intuition why this happens,
except perhaps the compiler is now able to see that the `offset`

string field is constantly `0`

and apply some extra optimizations based on that.

Anyway, **219 ms** is a pretty decent result.

## Optimizing RGBA pixels

All the above results, both for encoding and decoding (from the previous part), are on a 3-channel, RGB image. But how does our implementation perform with RGBA?

Converting the same test image to RGBA by setting the alpha channel to constant 255
and running the *encoder* yields **272 ms**.
*Decoding* the corresponding QOI image, on the other hand, takes about **204 ms**.
The encoder is somewhat on par with the C implementation (whose best times are 260 ms when compiled with gcc).
The decoder is also on par with the C version compiled with gcc
but loses about 15% if the C decoder is compiled with clang.
So it’s time to ask our favourite question: can we do better?

Turns out we can.

One place to look at is the representation of the RGBA pixels.
Right now, an RGBA pixel is represented as an unpacked 4-tuple of `Word8`

s:

`data Pixel4 = Pixel4 Word8 Word8 Word8 Word8 deriving (Show, Eq)`

Reading (and writing) these pixels amounts to four `Word8`

loads (and stores),
which might be expensive.
On the bright side, the implementation of `toRGBA`

/`fromRGBA`

is really cheap
and boils down to some register shuffling.

But what if we instead represent an RGBA pixel as a `Word32`

?
Reading or writing would be a single load or store,
at the expense of some bit shifts in `toRGBA`

/`fromRGBA`

.
Would it be worth it?
Let’s find out!

So, we change the definition to

`newtype Pixel4 = Pixel4 Word32 deriving (Show, Eq)`

and the corresponding array instance methods become straightforward:

```
@@ -73,26 +75,18 @@ instance A.MArray (A.STUArray s) Pixel4 (ST s) where
getNumElements (A.STUArray _ _ n _) = pure n
{-# INLINE getNumElements #-}
- newArray_ arrBounds = A.newArray arrBounds (Pixel4 0 0 0 0)
+ newArray_ arrBounds = A.newArray arrBounds (Pixel4 0)
{-# INLINE newArray_ #-}
unsafeNewArray_ (l, u) = A.unsafeNewArraySTUArray_ (l, u) (*# 4#)
{-# INLINE unsafeNewArray_ #-}
unsafeRead (A.STUArray _ _ _ marr#) (I# n#) = ST $ \s1# ->
- let n'# = n# *# 4#
- !(# s2#, r# #) = readWord8Array# marr# n'# s1#
- !(# s3#, g# #) = readWord8Array# marr# (n'# +# 1#) s2#
- !(# s4#, b# #) = readWord8Array# marr# (n'# +# 2#) s3#
- !(# s5#, a# #) = readWord8Array# marr# (n'# +# 3#) s4#
- in (# s5#, Pixel4 (W8# r#) (W8# g#) (W8# b#) (W8# a#) #)
+ let !(# s2#, rgba# #) = readWord32Array# marr# n# s1#
+ in (# s2#, Pixel4 (W32# rgba#) #)
{-# INLINE unsafeRead #-}
- unsafeWrite (A.STUArray _ _ _ marr#) (I# n#) (Pixel4 (W8# r#) (W8# g#) (W8# b#) (W8# a#)) = ST $ \s1# ->
- let n'# = n# *# 4#
- s2# = writeWord8Array# marr# n'# r# s1#
- s3# = writeWord8Array# marr# (n'# +# 1#) g# s2#
- s4# = writeWord8Array# marr# (n'# +# 2#) b# s3#
- s5# = writeWord8Array# marr# (n'# +# 3#) a# s4#
- in (# s5#, () #)
+ unsafeWrite (A.STUArray _ _ _ marr#) (I# n#) (Pixel4 (W32# rgba#)) = ST $ \s1# ->
+ let s2# = writeWord32Array# marr# n# rgba# s1#
+ in (# s2#, () #)
{-# INLINE unsafeWrite #-}
instance A.IArray A.UArray Pixel4 where
@@ -100,14 +94,9 @@ instance A.IArray A.UArray Pixel4 where
{-# INLINE bounds #-}
numElements (A.UArray _ _ n _) = n
{-# INLINE numElements #-}
- unsafeArray lu ies = runST (A.unsafeArrayUArray lu ies $ Pixel4 0 0 0 0)
+ unsafeArray lu ies = runST (A.unsafeArrayUArray lu ies $ Pixel4 0)
{-# INLINE unsafeArray #-}
- unsafeAt (A.UArray _ _ _ arr#) (I# n#) = Pixel4 (W8# (indexWord8Array# arr# n'#))
- (W8# (indexWord8Array# arr# (n'# +# 1#)))
- (W8# (indexWord8Array# arr# (n'# +# 2#)))
- (W8# (indexWord8Array# arr# (n'# +# 3#)))
- where
- n'# = n# *# 4#
+ unsafeAt (A.UArray _ _ _ arr#) (I# n#) = Pixel4 (W32# (indexWord32Array# arr# n#))
{-# INLINE unsafeAt #-}
class (Eq a, forall s. A.MArray (A.STUArray s) a (ST s)) => Pixel a where
```

The `Pixel`

class methods become a tad less pleasant, though:

```
instance Pixel Pixel4 where
toRGBA (Pixel4 rgba) = ( fromIntegral $ rgba .>>. 24
, fromIntegral $ rgba .>>. 16
, fromIntegral $ rgba .>>. 8
, fromIntegral rgba
)
{-# INLINE toRGBA #-}
fromRGBA r g b a = Pixel4 $ fromIntegral r .<<. 24
.|. fromIntegral g .<<. 16
.|. fromIntegral b .<<. 8
.|. fromIntegral a
{-# INLINE fromRGBA #-}
readPixel (BSI.PS x _ _) pos = Pixel4 $ BSI.accursedUnutterablePerformIO $ withForeignPtr x $ \p -> peek (p `plusPtr` pos)
{-# INLINE readPixel #-}
channelCount _ = 4
{-# INLINE channelCount #-}
```

here, we rely on the endianness in `readPixel`

,
and the production library will do the suitable byte swaps,
but it’s acceptable for our purposes right now.

Is it worth it?

Yes!
Encoding now takes **209 ms** (23% improvement!),
while decoding is about **187 ms** (8% improvement) —
and those are our final results for RGBA.

Shall we do the same for `Pixel3`

?
I won’t describe the implementation in all detail,
but, as it turns out, doing a similar trick for `Pixel3`

only degrades performance.

## Conclusion

Wow, what a weekend!

Surprisingly, it takes much more effort and energy to write *about* writing and improving the code
than to actually write the code.
Indeed, I did most of the above in a couple of sittings in a couple of days, and it took me more than a month to sum it up as two blog posts.

To my disliking, these posts contain too much code — I feel like nobody will go through all of that. Thus, a note to self: investigate more efficient ways of describing these experiences.

Anyway, that’s enough for the feels, so let’s go back to the technical side. To sum it up, we implemented the (already outdated) QOI image format purely in Haskell, performing at least as fast (and most of the time noticeably faster) than the baseline C version. Our implementation also overcomes some others in other languages, a couple of which were explicitly written with speed in mind.

Sure, our version has quite some `unsafe`

ty under the hood.
Ideally, those `unsafe`

s, especially `unsafeRead`

s and `unsafeWrite`

,
become unnecessary once Haskell adopts dependent types, or at least liquid types gain more traction.
For now, though, that is the only reliable and straightforward way to achieve decent performance.
By the way, investigating how much of this unsafety can be avoided via Liquid Haskell
is another exciting direction of further work.