Haskell is quite OK for images: encoding QOI

January 29, 2022 // haskell, performance

Last time we’ve looked at writing a decoder for the QOI format. Today, we’ll look at the inverse: encoding QOI images and all that it entails.

Like the last time, this post describes both the final result and the road there. So, there will be lots of code and lots of diffs, beware!

Again, we’ll start with three-channel RGB pixels and generalize to RGBA later.

Encoding RGB pixels

In the QOI format, encoding each pixel requires going through the possible encoding methods until we find one that can handle our pixel in the context of the previous ones.

For instance, we can produce the DIFF8 chunk if the pixel’s alpha channel hasn’t changed compared to the previous pixel, and each other channel’s difference is within the [-2; 2) interval. Otherwise, if alpha again hasn’t changed, the red channel’s delta is within [-16; 16), and blue and green deltas are both within [-8; 8), then we can use DIFF16. Otherwise, if we’ve recently seen this pixel, we can encode it using the INDEX chunk. Otherwise… well, you get the idea.

Note that for RGB images the alpha channel doesn’t change for obvious reasons, so we won’t consider it for now.

How can we express this nicely? If we squint hard enough, we’ll see a Maybe with its Alternative instance!

Indeed, we can represent each way of encoding a pixel as a computation that either succeeds or fails. So, encoding a pixel is just trying each one until one succeeds.

For instance, this is how we can express the DIFF8 part (note again we’re working with Pixel3s for now, so we don’t care about alpha yet):

encodeDiff8 :: Word8 -> Word8 -> Word8 -> A.STUArray s Int Word8 -> Int -> Maybe (ST s Int)
encodeDiff8 dr dg db out outPos
  | (`isBounded` 2) `all` [dr, dg, db] = let byte = 0b10000000 .|. ((dr + 2) .<<. 4)
                                                               .|. ((dg + 2) .<<. 2)
                                                               .|.  (db + 2)
                                          in Just $ A.unsafeWrite out outPos byte $> 1
  | otherwise = Nothing

where the helper function isBounded delta bound checks that -bound <= delta < bound:

isBounded :: Word8 -> Word8 -> Bool
isBounded d h = d + h < 2 * h

So, if all of dr, dg, db are small enough, then encodeDiff8 dr dg db out outPos returns a computation (wrapped in Just) that writes the corresponding byte into out at outPos and returns (via $> 1) how many bytes were written. Otherwise, if at least one delta is too big, this function just returns Nothing.

In fact, the last two arguments and the return types are common for all of the encoding functions, so let’s abstract that away:

type VarEncoder s = A.STUArray s Int Word8 -> Int -> Maybe (ST s Int)

so that our function type looks like

encodeDiff8 :: Word8 -> Word8 -> Word8 -> VarEncoder s

Now, we can write similar functions encodeDiff16, encodeDiff24, encodeIndex, and encodeColor for the other chunk types.

That’s how they look!

encodeDiff16 :: Word8 -> Word8 -> Word8 -> VarEncoder s
encodeDiff16 dr dg db out outPos
  | uncurry isBounded `all` [(dr, 16), (dg, 8), (db, 8)] = let b1 = 0b11000000 .|. (dr + 16)
                                                               b2 = (dg + 8) .<<. 4
                                                                .|. (db + 8)
                                                            in Just $ do A.unsafeWrite out outPos       b1
                                                                         A.unsafeWrite out (outPos + 1) b2
                                                                         pure 2
  | otherwise = Nothing

encodeDiff24 :: Word8 -> Word8 -> Word8 -> Word8 -> VarEncoder s
encodeDiff24 dr dg db da out outPos
  | (`isBounded` 16) `all` [dr, dg, db, da] = let bytes :: Word32
                                                  bytes = (0b11100000 .<<. 16)
                                                      .|. fromIntegral (dr + 16) .<<. 15
                                                      .|. fromIntegral (dg + 16) .<<. 10
                                                      .|. fromIntegral (db + 16) .<<. 5
                                                      .|. fromIntegral (da + 16)
                                               in Just $ do A.unsafeWrite out outPos       (fromIntegral $ bytes .>>. 16)
                                                            A.unsafeWrite out (outPos + 1) (fromIntegral $ bytes .>>. 8)
                                                            A.unsafeWrite out (outPos + 2) (fromIntegral   bytes)
                                                            pure 3
  | otherwise = Nothing

encodeIndex :: Pixel3 -> Word8 -> Pixel3 -> VarEncoder s
encodeIndex px hash runningPx out outPos
  | px == runningPx = Just $ A.unsafeWrite out outPos hash $> 1
  | otherwise = Nothing

encodeColor :: Pixel3 -> Pixel3 -> A.STUArray s Int Word8 -> Int -> ST s Int
encodeColor (Pixel3 r1 g1 b1) (Pixel3 r0 g0 b0) out outPos = do
  A.unsafeWrite out outPos bh
  when (hr == 1) $ A.unsafeWrite out (outPos + 1) r1
  when (hg == 1) $ A.unsafeWrite out (outPos + 1 + hr) g1
  when (hb == 1) $ A.unsafeWrite out (outPos + 1 + hr + hg) b1
  pure (1 + hr + hg + hb)
  where
    hr = fromEnum $ r1 /= r0
    hg = fromEnum $ g1 /= g0
    hb = fromEnum $ b1 /= b0
    bh = 0b11110000 .|. (fromIntegral hr .<<. 3)
                    .|. (fromIntegral hg .<<. 2)
                    .|. (fromIntegral hb .<<. 1)

Here, the encodeColor is not a VarEncoder, since it always succeeds and doesn’t need to be wrapped in a Maybe.

There’s one more chunk missing: the one denoting a run of a given pixel. Unfortunately, it does not fit this Maybe (ST s Int) idea as nicely since it can potentially consume more than one pixel, so let’s just express it separately:

maxRunLen :: Int
maxRunLen = 8224

encodeRun :: Int -> A.STUArray s Int Word8 -> Int -> ST s Int
encodeRun runLen out outPos
  | runLen == 0 = pure outPos
  | runLen <= 32 = A.unsafeWrite out outPos (0b01000000 .|. fromIntegral (runLen - 1)) $> outPos + 1
  | runLen <= maxRunLen = do let runLen' = runLen - 33
                             A.unsafeWrite out outPos       (0b01100000 .|. fromIntegral (runLen' .>>. 8))
                             A.unsafeWrite out (outPos + 1) (fromIntegral runLen')
                             pure $ outPos + 2
  | otherwise = pure outPos

The handling for runs longer than maxRunLen is, to put it mildly, suboptimal, but we’ll fix this later. Let’s get at least something working now!

Given all these functions, the primary encoding loop is straightforward:

For the current pixel, try each encoding method until one succeeds. If none does, we fall back to encodeColor.
Loop over subsequent pixels if they are equal to the current one, and if they are, encode a run.
Update the “recently seen” pixels array.
Rinse and repeat.

Or, in code, that’s how it looks. The worker function takes the current input and output positions, the current pixel, and the next pixel:

  let step inPos outPos px@(Pixel3 r1 g1 b1) prevPx@(Pixel3 r0 g0 b0) = do

It then computes the difference between the pixels and finds the “recently seen” pixel with our pixel’s hash:

        let (dr, dg, db) = (r1 - r0, g1 - g0, b1 - b0)

        let hash = pixelHash px
        runningPx <- A.unsafeRead running hash

It then tries each encoder, leveraging the Applicative instance for Maybe and using fromMaybe with encodeColor as the fallback:

        pxDiff <- fromMaybe (encodeColor px prevPx result outPos)
                           $ encodeDiff8 dr dg db result outPos
                         <|> encodeIndex px (fromIntegral hash) runningPx result outPos
                         <|> encodeDiff16 dr dg db result outPos
                         <|> encodeDiff24 dr dg db 0 result outPos

Note that the encoders are arranged in a carefully selected order so that the ones producing smaller output are tried first. Moreover, this is the place where the laziness comes in really handy. Since fromMaybe and Maybe’s <|> are lazy, none of the encoders will actually be executed unless the previous ones fail.

Having done that, pxDiff says how much we should advance in our output. But, before that, we update our recently seen pixels and check if the current pixel is a start of a run:

        A.unsafeWrite running hash px

        let tryRun pos runLen
              | pos + 3 <= inLen = do
                let thisPixel = readPixel3 inBytes pos
                if thisPixel /= px || runLen >= maxRunLen
                   then (pos, runLen, thisPixel)
                   else tryRun (pos + 3) (runLen + 1)
              | otherwise = (pos + 3, runLen, px)

        let (afterRun, runLen, nextPix) = tryRun (inPos + 3) 0
        outPos' <- encodeRun runLen result (outPos + pxDiff)

Here, readPixel inBytes pos reads the RGB pixel at the given position in the input, which is literally just about reading three Word8s from the input ByteString.

And finally, if we have more pixels to encode, we loop again. Otherwise we stop:

        if afterRun <= inLen
           then step afterRun outPos' nextPix px
           else pure outPos'

The only question is how big should the output be? Unfortunately, we don’t know until we’ve processed the whole image.

The solution is to compute the upper bound! Indeed, the worst case is that each pixel is different enough from the previous one to require the entire COLOR chunk, encoding all the pixel channels. This gives us one byte per channel plus one byte to denote the COLOR chunk per pixel for a total of (channels + 1) × width × height bytes.

Or, in Haskell terms, we’ll just write a function that, given a QOI header describing the image we’re encoding, computes the maximum output length (and also encodes the header while we’re at it):

maxResultSize :: Header -> (Int, BS.ByteString)
maxResultSize h@Header { .. } = (maxLen, headerBS)
  where
    headerBS = BSL.toStrict $ encode h
    maxLen = BS.length headerBS
           + fromIntegral (hChannels + 1) * fromIntegral (hWidth * hHeight)
           + 4 -- end padding

Adding all the required boilerplate for allocating the resulting array, encoding the header, allocating the “recently seen” array, writing the final padding, and what not on gives us this

encoder function.

encodeRaw :: Header -> BS.ByteString -> A.UArray Int Word8
encodeRaw header inBytes = A.runSTUArray $ do
  (result :: A.STUArray s Int Word8) <- A.unsafeNewArray_ (0, maxLen - 1)

  forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)

  running <- A.newArray @(A.STUArray s) (0, 63 :: Int) initPixel

  let step inPos outPos px@(Pixel3 r1 g1 b1) prevPx@(Pixel3 r0 g0 b0) = do
        let (dr, dg, db) = (r1 - r0, g1 - g0, b1 - b0)

        let hash = pixelHash px
        runningPx <- A.unsafeRead running hash
        pxDiff <- fromMaybe (encodeColor px prevPx result outPos)
                           $ encodeDiff8 dr dg db result outPos
                         <|> encodeIndex px (fromIntegral hash) runningPx result outPos
                         <|> encodeDiff16 dr dg db result outPos
                         <|> encodeDiff24 dr dg db 0 result outPos

        A.unsafeWrite running hash px
        let tryRun pos runLen
              | pos + 3 <= inLen = do
                let thisPixel = readPixel3 inBytes pos
                if thisPixel /= px || runLen >= maxRunLen
                   then (pos, runLen, thisPixel)
                   else tryRun (pos + 3) (runLen + 1)
              | otherwise = (pos + 3, runLen, px)

        let (afterRun, runLen, nextPix) = tryRun (inPos + 3) 0
        outPos' <- encodeRun runLen result (outPos + pxDiff)

        if afterRun <= inLen
           then step afterRun outPos' nextPix px
           else pure outPos'
  final <- step 0 headerLen (readPixel3 inBytes 0) initPixel
  forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0

  pure $ unsafeShrink result (final + 4)
  where
    inLen = BS.length inBytes
    (maxLen, headerBS) = maxResultSize header
    headerLen = BS.length headerBS

How well does this stuff perform? If we run it on the test image, it’ll take 259 ms. As a comparison, recall that the baseline C implementation takes around the same ≈260 ms compiled with gcc and ≈340 ms compiled with clang. Not too bad, right?

There’s one minor optimization we can already do, though. Notice that encodeDiff8 is tried first, but it doesn’t use the runningPx. Indeed, let’s delay reading it until encodeDiff8 fails! Unfortunately, the code gets a little bit uglier:

        pxDiff <- case encodeDiff8 dr dg db result outPos of
                       Just act -> act
                       _ -> do runningPx <- A.unsafeRead running hash
                               fromMaybe (encodeColor px prevPx result outPos)
                                        $ encodeIndex px (fromIntegral hash) runningPx result outPos
                                      <|> encodeDiff16 dr dg db result outPos
                                      <|> encodeDiff24 dr dg db 0 result outPos

but it now runs in 238 ms. That’s already faster than our C baseline, noice!

But, before seeing how we could optimize this, let’s first switch to something else.

Tests

Now that we have both an encoder and a decoder, we can write very straightforward tests, verifying what we did makes sense and forming the basis for our future changes, optimizations, and refactorings.

Note that the inverse (encode ∘ decode = id) doesn’t hold in general since each image (except degenerate ones) can be encoded in more than one way.

Indeed, we expect the following to hold: decode ∘ encode = id. Unfortunately, we’re doing this in Haskell and not Idris, so we cannot prove this, but we can go with the next best approximation: just QuickCheck it!

Generating random images

For QuickCheck, we need to come up with a random images generator. The most obvious way is to generate a random sequence of Word8s of the right size, but it doesn’t cover the domain that well.

Indeed, for an RGB image, the probability that the next pixel is similar enough to the previous one to exercise the DIFF8 path is (3 * 3 * 3) / (255 * 255 * 255) ≈ 1.6 × 10^-6. This means that we can expect a megapixel image to contain roughly one or two such pairs of pixels that DIFF8 is exercised.

And that’s just for RGB. For RGBA, which we’ll need to eventually test as well, it’s even worse — in fact, 255 times worse: a gigapixel image can be expected to contain just about 6 such pairs!

That won’t fly. We need a more intelligent way of generating test cases.

So, instead, we first choose whether we want a run of the current pixel, or the next pixel to be “really close” to the current one (to go the DIFF8 route), or a tad more different (to exercise DIFF16), and so on. Then we generate the specific parameters: the run length, or the difference within the DIFF8-expressible bounds, or… Well, you get the idea.

Before expressing that, we’ll need a couple of helper definitions. First, a wrapper around ByteStrings that displays their contents as bytes, which will come in handy for pretty-printing the failing cases, if any:

newtype ShowAsBytes = ShowAsBytes { bytes :: BS.ByteString } deriving (Eq)

instance Show ShowAsBytes where
  show = show . toArray3 . bytes

toArray3 :: BS.ByteString -> A.UArray Int Pixel3
toArray3 bs = A.array (0, pxCnt - 1) [ (i, Pixel3 r g b)
                                     | i <- [0..pxCnt - 1]
                                     , let r = bs `BS.index` (i * 3)
                                           g = bs `BS.index` (i * 3 + 1)
                                           b = bs `BS.index` (i * 3 + 2)
                                     ]
  where
    pxCnt = BS.length bs `div` 3

Then, we define a type for randomly generated raw images:

data Image3 = Image3
  { iWidth :: Int
  , iHeight :: Int
  , iBytes :: ShowAsBytes
  } deriving (Eq, Show)

We’ll also need a couple of helper generators for pixels: one generating a completely arbitrary pixel and another one for a pixel that’s different from the given one by not more than delta:

genPixel3 :: Gen Pixel3
genPixel3 = Pixel3 <$> chooseAny <*> chooseAny <*> chooseAny

genDiff3Bounded :: Word8 -> Pixel3 -> Gen Pixel3
genDiff3Bounded delta (Pixel3 r g b) = do
  diffs <- replicateM 3 $ choose (negate delta, delta - 1)
  case diffs of
       [dr, dg, db] -> pure $ Pixel3 (r + dr) (g + dg) (b + db)
       _ -> error "muh dependent types"

Now, we can finally make Image3s Arbitrary. To generate an image, we first choose its width and height within some bounds (controlled by QuickCheck’s size parameter) and an initial pixel. Then we repeatedly generate the rest of the pixels, each time choosing the specific generator:

instance Arbitrary Image3 where
  arbitrary = do
    maxDim <- getSize
    width <- chooseInt (1, maxDim)
    height <- chooseInt (1, maxDim)

    px0 <- genPixel3

    pixels <- V.iterateNM (width * height) step (px0, 0)
    pure $ Image3 width height $ ShowAsBytes $ BS.pack $ V.toList $ V.concatMap (\(Pixel3 r g b, _) -> [r, g, b]) pixels
    where
      step (prevPixel, 0) = do
        runToss <- chooseInt (1, 100)
        if runToss >= 10
           then do nextPixel <- oneof [ genDiff3Bounded 2 prevPixel
                                      , genDiff3Bounded 8 prevPixel
                                      , genDiff3Bounded 16 prevPixel
                                      , genPixel3
                                      ]
                   pure (nextPixel, 0)
           else (prevPixel,) <$> chooseInt (1, runToss * 1000)
      step (prevPixel, n) = pure (prevPixel, n - 1)

With this instance, writing a property test is more or less straightforward: for an arbitrary image, we encode it and then try to decode it. The decoding should succeed, and the decoded bytes should be equal to the original ones:

main :: IO ()
main = hspec $ modifyMaxSuccess (const 100000) $
  describe "QOI encoder" $
    it "decode . encode = id" $ property $ \Image3 { .. } -> do
      let header = Header { hMagic = matchBytes
                          , hWidth = fromIntegral iWidth
                          , hHeight = fromIntegral iHeight
                          , hChannels = 3
                          , hColorspace = 0
                          }
      let encoded = encodeRaw header (bytes iBytes)
          decoded = decodeQoi $ BS.pack $ A.elems encoded
      decoded `shouldSatisfy` isRight
      case decoded of
           Left _ -> pure ()
           Right (header', Pixels3 pixels') -> do
             header' `shouldBe` header
             pixels' `shouldBe` toArray3 (bytes iBytes)
           Right _ -> fail "Expected Pixels3"

Let’s run this!

QOI encoder
  decode . encode = id FAILED [1]

Failures:

  test/Spec.hs:83:7:
  1) QOI encoder decode . encode = id
       Falsifiable (after 1 test):
         Image3 {iWidth = 0, iHeight = 0, iBytes = array (0,-1) []}
       predicate failed on: Left UnpaddedFile

Ugh, right. It fails on an empty input — something we haven’t thought of!

Luckily, the fix is just about it — we just need to check if the input is not null:

--- a/src/Data/Image/Qoi/Encoder.hs
+++ b/src/Data/Image/Qoi/Encoder.hs
@@ -154,7 +154,9 @@ encodeRaw header inBytes = A.runSTUArray $ do
-  final <- step 0 headerLen (readPixel3 inBytes startPos) initPixel
+  final <- if not $ BS.null inBytes
+              then step startPos headerLen (readPixel3 inBytes startPos) initPixel
+              else pure headerLen

After changing this and rerunning the tests, we get a very satisfying result:

QOI encoder
  decode . encode = id
    +++ OK, passed 100000 tests.

Finished in 92.7709 seconds
1 example, 0 failures

So, we verified the property holds on 100k random images in just a minute and a half! Pretty cool, isn’t it?

Parallel execution

One minor downside is that all these random images are generated and passed through our QOI implementation sequentially in a single thread. My machine has 12 cores, so that’s unfortunate, and while hspec supports parallel execution of tests, QuickCheck doesn’t seem to parallelize that.

What we could do instead is spawn several test groups and execute them in parallel:

imgProperty :: Image3 -> IO ()
imgProperty Image3 { .. } = do
  let header = Header { hMagic = matchBytes
                      , hWidth = fromIntegral iWidth
                      , hHeight = fromIntegral iHeight
                      , hChannels = 3
                      , hColorspace = 0
                      }
  let encoded = encodeRaw header (bytes iBytes) 0
      decoded = decodeQoi $ BS.pack $ A.elems encoded
  decoded `shouldSatisfy` isRight
  case decoded of
       Left _ -> pure ()
       Right (header', Pixels3 pixels') -> do
         header' `shouldBe` header
         pixels' `shouldBe` toArray3 (bytes iBytes)
       Right _ -> fail "Expected Pixels3"

main :: IO ()
main = hspec $ modifyMaxSuccess (const 100000) $
  describe "QOI encoder" $
    parallel $ forM_ ([1..10] :: [Int]) $ \wrk ->
      it ("decode . encode = id (worker " <> show wrk <> ")") $ property imgProperty

Running this allows processing 1 million random images in about 3 minutes:

QOI encoder
  decode . encode = id (worker 1)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 2)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 3)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 4)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 5)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 6)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 7)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 8)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 9)
    +++ OK, passed 100000 tests.
  decode . encode = id (worker 10)
    +++ OK, passed 100000 tests.

Finished in 176.7814 seconds
10 examples, 0 failures

Of course, this is not ideal. Firstly, due to how QuickCheck generates examples, there will be some overlap between different workers that could otherwise be avoided. and also the overall wall time has also increased, but it’s still better than running 1 million images in 15 minutes.

Shrinking

Now, arguendo, suppose QuickCheck has found a much bigger case manifesting a bug in our implementation.

To be more concrete, suppose we’ve accidentally made a typo, and we’re subtracting 32 instead of 33 in the encodeRun function:

@@ -105,7 +105,7 @@ encodeRun :: Int -> A.STUArray s Int Word8 -> Int -> ST s Int
 encodeRun runLen out outPos
   | runLen == 0 = pure outPos
   | runLen <= 32 = A.unsafeWrite out outPos (0b01000000 .|. fromIntegral (runLen - 1)) $> outPos + 1
-  | otherwise = do let runLen' = runLen - 33
+  | otherwise = do let runLen' = runLen - 32
                    A.unsafeWrite out outPos       (0b01100000 .|. fromIntegral (runLen' .>>. 8))
                    A.unsafeWrite out (outPos + 1) (fromIntegral runLen')
                    pure $ outPos + 2

On my system, QuickCheck does find the issue after just 41 tests:

  1) QOI encoder decode . encode = id
       Falsifiable (after 41 tests):
         Image3 {iWidth = 39, iHeight = 40, iBytes = array (0,1559)

         ...

Randomized with seed 1709694074

The problem here is that the counterexample is really huge — 1560 pixels, and so are the expected and actual outputs. Good luck finding the problem or even grasping the example!

The solution is to enable shrinking in our Arbitrary Image3 instance. Shrinking allows QuickCheck to reduce a particular counterexample. That is, a function shrink (which is our responsibility to write) is given a failing counterexample by QuickCheck, and it should return some smaller examples that might exhibit the same bug.

How do we shrink images? One way is as follows. For a w×h image, if h > 1, we can reduce it to h non-empty sub-images, where the ith sub-image is our original one with the ith row crossed out. We can also reduce the width by crossing out each column, but since the images are laid out in the row-major order, removing columns is computationally expensive, so let’s only do this if, say, there are less than five rows. Or, in code:

  shrink Image3 { .. } = [ Image3
                           { iWidth = iWidth
                           , iHeight = iHeight - 1
                           , iBytes = ShowAsBytes $ dropIthRow i $ bytes iBytes
                           }
                         | i <- [ 0 .. iHeight - 1 ]
                         ]
                         <>
                         [ Image3
                           { iWidth = iWidth - 1
                           , iHeight = iHeight
                           , iBytes = ShowAsBytes $ dropJthCol j $ bytes iBytes
                           }
                         | iHeight < 5
                         , j <- [0 .. iWidth - 1]
                         ]
    where
      dropIthRow i str = BS.take (iWidth * 3 * i) str
                      <> BS.drop (iWidth * 3 * (i + 1)) str
      dropJthCol j str
        | BS.null str = str
        | otherwise = let (row, rest) = BS.splitAt (iWidth * 3) str
                          (left, right) = BS.splitAt (j * 3) row
                       in left <> BS.drop 3 right <> dropJthCol j rest

If we now rerun QuickCheck with the same seed, we’ll get a much more interpretable example of just 68 pixels. QuickCheck even helpfully highlights the differences for us:


  1) QOI encoder decode . encode = id
       Falsifiable (after 41 tests and 43 shrinks):
         Image3 {iWidth = 34, iHeight = 2, iBytes = array (0,67) [...]}
       expected: array (0,67) [...,(32,Pixel3 238 173 40),(33,Pixel3 238 173 40),(34,Pixel3 197 136 117),(35,Pixel3 197 136 117),...]
        but got: array (0,67) [...,(32,Pixel3 238 173 40),(33,Pixel3 238 173 40),(34,Pixel3 238 173 40),(35,Pixel3 197 136 117),...]

Seems like there’s one (238, 173, 40) pixel too many in the output!

Unfortunately, the test doesn’t tell us what exactly is broken: it could be an off-by-one error either in the encoder (encoding one extra pixel in a run) or in the decoder (similarly, decoding one extra pixel in a run), and it’s up to us to verify which one is it, really. But the test nevertheless gives a pretty decent pointer at where we should look!

Having written the tests, we change either the decoder or the encoder as we see fit while having at least some assurance that our changes don’t break things.

Looping wisely

Right now, detecting a pixel run happens separately within step, in the tryRun recursive function. This most likely compiles to a tight nested loop, but is this really the most CPU-friendly approach?

This is where there is no good intuition or a priori knowledge, and we shall just try a different approach. Namely, we’ll make step itself check if the current pixel is equal to the previous one. While we’re at it, we’ll also ensure that runs longer than maxRunLen are handled correctly.

All in all, the resulting step function now looks like this:


encodeRaw :: Header -> BS.ByteString -> Int -> A.UArray Int Word8
encodeRaw header inBytes startPos = A.runSTUArray $ do
  (result :: A.STUArray s Int Word8) <- A.unsafeNewArray_ (0, maxLen - 1)

  forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)

  running <- A.newArray @(A.STUArray s) (0, 63 :: Int) initPixel

  let step inPos runLen prevPx@(Pixel3 r0 g0 b0) outPos
        | inPos + 3 <= inLen
        , readPixel3 inBytes inPos == prevPx =
          if runLen /= maxRunLen - 1
             then step (inPos + 3) (runLen + 1) prevPx outPos
             else encodeRun maxRunLen result outPos >>= step (inPos + 3) 0 prevPx
        | inPos + 3 <= inLen = do
          let px@(Pixel3 r1 g1 b1) = readPixel3 inBytes inPos
          let (dr, dg, db) = (r1 - r0, g1 - g0, b1 - b0)
          outPos' <- encodeRun runLen result outPos

          let hash = pixelHash px
          pxDiff <- case encodeDiff8 dr dg db result outPos' of
                         Just act -> act
                         _ -> do runningPx <- A.unsafeRead running hash
                                 fromMaybe (encodeColor px prevPx result outPos')
                                          $ encodeIndex px (fromIntegral hash) runningPx result outPos'
                                        <|> encodeDiff16 dr dg db result outPos'
                                        <|> encodeDiff24 dr dg db 0 result outPos'

          A.unsafeWrite running hash px

          step (inPos + 3) 0 px (outPos' + pxDiff)
        | runLen /= 0 = encodeRun runLen result outPos
        | otherwise = pure outPos

  final <- step 0 0 initPixel headerLen
  forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0

  pure $ unsafeShrink result (final + 4)
  where
    inLen = BS.length inBytes
    (maxLen, headerBS) = maxResultSize header
    headerLen = BS.length headerBS

Notice that we don’t need any additional checks for non-empty input in this version. We’re also correctly handling runs of more than 8224 pixels by dumping intermediate chunks. So, for instance, for a run of length 20000, we’ll output:

a run of 8224 pixels,
a DIFF8 chunk with all-zero diffs,
another run of 8224 pixels,
another zero DIFF8,
the remaining run.

To be fair, these extra DIFF8s result in somewhat suboptimal compression since we could have written out one run right after the other. But even in the worst case of an image being a single colossal run, that’s just 1/8224th, or 0.012% worse compression — something I, for one, can surely live with!

The effect? 231 ms, or about 2.5% improvement over the version with the nested loop. Not too much, but let’s keep it this way, especially given that the tests still pass and the code is IMO slightly more straightforward.

While we’re at it, let’s also move the actual encoding loop out to a separate function

like this.

encodeIntoArray :: forall s. Int -> BS.ByteString -> Int -> A.STUArray s Int Word8 -> ST s Int
encodeIntoArray headerLen inBytes startPos result = do
  running <- A.newArray @(A.STUArray s) (0, 63 :: Int) initPixel

  let step inPos runLen prevPx@(Pixel3 r0 g0 b0) outPos
        | inPos + 3 <= inLen
        , readPixel3 inBytes inPos == prevPx =
          if runLen /= maxRunLen - 1
             then step (inPos + 3) (runLen + 1) prevPx outPos
             else encodeRun maxRunLen result outPos >>= step (inPos + 3) 0 prevPx
        | inPos + 3 <= inLen = do
          let px@(Pixel3 r1 g1 b1) = readPixel3 inBytes inPos
          let (dr, dg, db) = (r1 - r0, g1 - g0, b1 - b0)
          outPos' <- encodeRun runLen result outPos

          let hash = pixelHash px
          pxDiff <- case encodeDiff8 dr dg db result outPos' of
                         Just act -> act
                         _ -> do runningPx <- A.unsafeRead running hash
                                 fromMaybe (encodeColor px prevPx result outPos')
                                          $ encodeIndex px (fromIntegral hash) runningPx result outPos'
                                        <|> encodeDiff16 dr dg db result outPos'
                                        <|> encodeDiff24 dr dg db 0 result outPos'

          A.unsafeWrite running hash px

          step (inPos + 3) 0 px (outPos' + pxDiff)
        | runLen /= 0 = encodeRun runLen result outPos
        | otherwise = pure outPos

  step startPos 0 initPixel headerLen
  where
    inLen = BS.length inBytes
{-# INLINE encodeIntoArray #-}

encodeRaw :: Header -> BS.ByteString -> Int -> A.UArray Int Word8
encodeRaw header inBytes startPos = A.runSTUArray $ do
  result <- A.unsafeNewArray_ (0, maxLen - 1)
  forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)
  final <- encodeIntoArray headerLen inBytes startPos result
  forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0
  pure $ unsafeShrink result (final + 4)
  where
    (maxLen, headerBS) = maxResultSize header
    headerLen = BS.length headerBS

I think it doesn’t change much, but it’s just a tad nicer code.

Generalizing to RGBA pixels

Ah, that sweet part! We’re finally in a good position to do some finishing touches on our encoder.

Again, we’ll rely heavily on the Pixel class we’ve written earlier. There’s one thing missing, though: this class doesn’t support reading Pixel from a ByteString.

Reading a pixel might be expressed as

readPixel :: Pixel pixel
          => BS.ByteString
          -> Int {- current offset -}
          -> (Pixel, Int {- new offset -})

But this type is too powerful for our purposes: it implies the new offset must be determined dynamically from the current offset. Luckily, this is not the case: we know the new offset equals the current one plus 3 or 4 for RGB and RGBA pixels, respectively. Thus, we end up adding two different methods:

class Pixel pixel where
  ...

  readPixel :: BS.ByteString -> Int -> a
  channelCount :: proxy a -> Int

instance Pixel Pixel3 where
  ...

  readPixel str pos = Pixel3 (str ! pos) (str ! pos + 1) (str ! pos + 2)
  {-# INLINE readPixel #-}

  channelCount _ = 3
  {-# INLINE channelCount #-}

instance Pixel Pixel4 where
  ...

  readPixel str pos = Pixel4 (str ! pos) (str ! pos + 1) (str ! pos + 2) (str ! pos + 3)
  {-# INLINE readPixel #-}
  channelCount _ = 4
  {-# INLINE channelCount #-}

Next, we need to modify our encodeDiff8 and encodeDiff16 functions since we must now check that the alpha channel difference is indeed 0. This is just a matter of a single extra argument that we pattern match on:

-encodeDiff8 :: Word8 -> Word8 -> Word8 -> VarEncoder s
-encodeDiff8 dr dg db out outPos
+encodeDiff8 :: Word8 -> Word8 -> Word8 -> Word8 -> VarEncoder s
+encodeDiff8 dr dg db 0 out outPos
   | (`isBounded` 2) `all` [dr, dg, db] = let byte = 0b10000000 .|. ((dr + 2) .<<. 4)
                                                                .|. ((dg + 2) .<<. 2)
                                                                .|.  (db + 2)
                                           in Just $ A.unsafeWrite out outPos byte $> 1
-  | otherwise = Nothing
+encodeDiff8 _ _ _ _ _ _  = Nothing
 {-# INLINE encodeDiff8 #-}
 
-encodeDiff16 :: Word8 -> Word8 -> Word8 -> VarEncoder s
-encodeDiff16 dr dg db out outPos
+encodeDiff16 :: Word8 -> Word8 -> Word8 -> Word8 -> VarEncoder s
+encodeDiff16 dr dg db 0 out outPos
   | uncurry isBounded `all` [(dr, 16), (dg, 8), (db, 8)] = let b1 = 0b11000000 .|. (dr + 16)
                                                                b2 = (dg + 8) .<<. 4
                                                                 .|. (db + 8)
                                                             in Just $ do A.unsafeWrite out outPos       b1
                                                                          A.unsafeWrite out (outPos + 1) b2
                                                                          pure 2
-  | otherwise = Nothing
+encodeDiff16 _ _ _ _ _ _  = Nothing
 {-# INLINE encodeDiff16 #-}

encodeIndex’s definition is not updated; it’s only the type that changes:

-encodeIndex :: Pixel3 -> Word8 -> Pixel3 -> VarEncoder s
+encodeIndex :: Eq pixel => pixel -> Word8 -> pixel -> VarEncoder s

In encodeColor, we now need to account for the alpha channel too:

-encodeColor :: Pixel3 -> Pixel3 -> A.STUArray s Int Word8 -> Int -> ST s Int
-encodeColor (Pixel3 r1 g1 b1) (Pixel3 r0 g0 b0) out outPos = Just $ do
+encodeColor :: Pixel pixel => pixel -> pixel -> A.STUArray s Int Word8 -> Int -> ST s Int
+encodeColor px1 px0 out outPos = do
   A.unsafeWrite out outPos bh
   when (hr == 1) $ A.unsafeWrite out (outPos + 1) r1
   when (hg == 1) $ A.unsafeWrite out (outPos + 1 + hr) g1
   when (hb == 1) $ A.unsafeWrite out (outPos + 1 + hr + hg) b1
-  pure (1 + hr + hg + hb)
+  when (ha == 1) $ A.unsafeWrite out (outPos + 1 + hr + hg + hb) a1
+  pure (1 + hr + hg + hb + ha)
   where
+    (r1, g1, b1, a1) = toRGBA px1
+    (r0, g0, b0, a0) = toRGBA px0
     hr = fromEnum $ r1 /= r0
     hg = fromEnum $ g1 /= g0
     hb = fromEnum $ b1 /= b0
+    ha = fromEnum $ a1 /= a0
     bh = 0b11110000 .|. (fromIntegral hr .<<. 3)
                     .|. (fromIntegral hg .<<. 2)
                     .|. (fromIntegral hb .<<. 1)
+                    .|.  fromIntegral ha
 {-# INLINE encodeColor #-}

The rest of the changes primarily amount to using the Pixel class methods. We also change encodeIntoArray to accept a dummy proxy pixel argument denoting what pixels are expected, and we also check the channels count in encodeRaw, calling encodeIntoArray @Pixel3 or encodeIntoArray @Pixel4, respectively. Again, we rely on the compiler substituting the corresponding instance methods into each of these calls, specializing them and inlining any indirections away.

The final form of encodeRaw and encodeIntoArray, for the curious.

encodeIntoArray :: forall pixel s. Pixel pixel
                => Proxy pixel
                -> Int
                -> BS.ByteString
                -> Int
                -> A.STUArray s Int Word8
                -> ST s Int
encodeIntoArray proxy headerLen inBytes startPos result = do
  running <- A.newArray @(A.STUArray s) (0, 63 :: Int) (fromRGBA @pixel 0 0 0 255)

  let step inPos runLen prevPx outPos
        | inPos + diff <= inLen
        , readPixel inBytes inPos == prevPx =
          if runLen /= maxRunLen - 1
             then step (inPos + diff) (runLen + 1) prevPx outPos
             else encodeRun maxRunLen result outPos >>= step (inPos + diff) 0 prevPx
        | inPos + diff <= inLen = do
          let (r0, g0, b0, a0) = toRGBA prevPx
          let px = readPixel inBytes inPos
          let (r1, g1, b1, a1) = toRGBA px
          let (dr, dg, db, da) = (r1 - r0, g1 - g0, b1 - b0, a1 - a0)
          outPos' <- encodeRun runLen result outPos

          let hash = pixelHash px
          pxDiff <- case encodeDiff8 dr dg db da result outPos' of
                         Just act -> act
                         _ -> do runningPx <- A.unsafeRead running hash
                                 fromMaybe (encodeColor px prevPx result outPos')
                                          $ encodeIndex px (fromIntegral hash) runningPx result outPos'
                                        <|> encodeDiff16 dr dg db da result outPos'
                                        <|> encodeDiff24 dr dg db da result outPos'

          A.unsafeWrite running hash px
          step (inPos + diff) 0 px (outPos' + pxDiff)
        | runLen /= 0 = encodeRun runLen result outPos
        | otherwise = pure outPos
  step startPos 0 (fromRGBA 0 0 0 255) headerLen
  where
    inLen = BS.length inBytes
    diff = channelCount proxy
{-# INLINE encodeIntoArray #-}

encodeRaw :: Header -> BS.ByteString -> Int -> A.UArray Int Word8
encodeRaw header inBytes startPos = A.runSTUArray $ do
  result <- A.unsafeNewArray_ (0, maxLen - 1)
  forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)
  final <- if hChannels header == 3
              then encodeIntoArray @Pixel3 Proxy headerLen inBytes startPos result
              else encodeIntoArray @Pixel4 Proxy headerLen inBytes startPos result
  forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0
  pure $ unsafeShrink result (final + 4)
  where
    (maxLen, headerBS) = maxResultSize header
    headerLen = BS.length headerBS

It isn’t too different from what we had previously, is it?

Similarly, we can generalize the tests to generate and verify the right property for RGBA images, which is left as an exercise for the reader.

Anyway, does this generalization affect performance in any way? Surprisingly, it even improves the numbers by a couple percent: the code now runs in 233 ms.

But, speaking of surprises, there’s another oddity. Right now, we don’t handle BS.ByteStrings that are slices of some bigger strings (see the discussion in the first part). One way to handle them is to just apply the unoffsetBS function from the first part to the input string:

@@ -164,7 +164,7 @@ encodeIntoArray proxy headerLen inBytes startPos result = do
 {-# INLINE encodeIntoArray #-}
 
 encodeRaw :: Header -> BS.ByteString -> Int -> A.UArray Int Word8
-encodeRaw header inBytes startPos = A.runSTUArray $ do
+encodeRaw header inBytes' startPos = A.runSTUArray $ do
   result <- A.unsafeNewArray_ (0, maxLen - 1)
   forM_ [0 .. headerLen - 1] $ \i -> A.unsafeWrite result i (headerBS ! i)
   final <- if hChannels header == 3
@@ -173,5 +173,6 @@ encodeRaw header inBytes startPos = A.runSTUArray $ do
   forM_ [0..3] $ \i -> A.unsafeWrite result (final + i) 0
   pure $ unsafeShrink result (final + 4)
   where
+    inBytes = unoffsetBS inBytes'
     (maxLen, headerBS) = maxResultSize header
     headerLen = BS.length headerBS

Curiously, this change makes the code run yet faster by about 6%, reducing the run time further down to 219 ms. I don’t have a good intuition why this happens, except perhaps the compiler is now able to see that the offset string field is constantly 0 and apply some extra optimizations based on that.

Anyway, 219 ms is a pretty decent result.

Optimizing RGBA pixels

All the above results, both for encoding and decoding (from the previous part), are on a 3-channel, RGB image. But how does our implementation perform with RGBA?

Converting the same test image to RGBA by setting the alpha channel to constant 255 and running the encoder yields 272 ms. Decoding the corresponding QOI image, on the other hand, takes about 204 ms. The encoder is somewhat on par with the C implementation (whose best times are 260 ms when compiled with gcc). The decoder is also on par with the C version compiled with gcc but loses about 15% if the C decoder is compiled with clang. So it’s time to ask our favourite question: can we do better?

Turns out we can.

One place to look at is the representation of the RGBA pixels. Right now, an RGBA pixel is represented as an unpacked 4-tuple of Word8s:

data Pixel4 = Pixel4 Word8 Word8 Word8 Word8 deriving (Show, Eq)

Reading (and writing) these pixels amounts to four Word8 loads (and stores), which might be expensive. On the bright side, the implementation of toRGBA/fromRGBA is really cheap and boils down to some register shuffling.

But what if we instead represent an RGBA pixel as a Word32? Reading or writing would be a single load or store, at the expense of some bit shifts in toRGBA/fromRGBA. Would it be worth it? Let’s find out!

So, we change the definition to

newtype Pixel4 = Pixel4 Word32 deriving (Show, Eq)

and the corresponding array instance methods become straightforward:

@@ -73,26 +75,18 @@ instance A.MArray (A.STUArray s) Pixel4 (ST s) where
   getNumElements (A.STUArray _ _ n _) = pure n
   {-# INLINE getNumElements #-}
 
-  newArray_ arrBounds = A.newArray arrBounds (Pixel4 0 0 0 0)
+  newArray_ arrBounds = A.newArray arrBounds (Pixel4 0)
   {-# INLINE newArray_ #-}
   unsafeNewArray_ (l, u) = A.unsafeNewArraySTUArray_ (l, u) (*# 4#)
   {-# INLINE unsafeNewArray_ #-}
 
   unsafeRead (A.STUArray _ _ _ marr#) (I# n#) = ST $ \s1# ->
-    let n'# = n# *# 4#
-        !(# s2#, r# #) = readWord8Array# marr# n'#         s1#
-        !(# s3#, g# #) = readWord8Array# marr# (n'# +# 1#) s2#
-        !(# s4#, b# #) = readWord8Array# marr# (n'# +# 2#) s3#
-        !(# s5#, a# #) = readWord8Array# marr# (n'# +# 3#) s4#
-     in (# s5#, Pixel4 (W8# r#) (W8# g#) (W8# b#) (W8# a#) #)
+    let !(# s2#, rgba# #) = readWord32Array# marr# n# s1#
+     in (# s2#, Pixel4 (W32# rgba#) #)
   {-# INLINE unsafeRead #-}
-  unsafeWrite (A.STUArray _ _ _ marr#) (I# n#) (Pixel4 (W8# r#) (W8# g#) (W8# b#) (W8# a#)) = ST $ \s1# ->
-    let n'# = n# *# 4#
-        s2# = writeWord8Array# marr# n'#         r# s1#
-        s3# = writeWord8Array# marr# (n'# +# 1#) g# s2#
-        s4# = writeWord8Array# marr# (n'# +# 2#) b# s3#
-        s5# = writeWord8Array# marr# (n'# +# 3#) a# s4#
-     in (# s5#, () #)
+  unsafeWrite (A.STUArray _ _ _ marr#) (I# n#) (Pixel4 (W32# rgba#)) = ST $ \s1# ->
+    let s2# = writeWord32Array# marr# n# rgba# s1#
+     in (# s2#, () #)
   {-# INLINE unsafeWrite #-}
 
 instance A.IArray A.UArray Pixel4 where
@@ -100,14 +94,9 @@ instance A.IArray A.UArray Pixel4 where
   {-# INLINE bounds #-}
   numElements (A.UArray  _ _ n _) = n
   {-# INLINE numElements #-}
-  unsafeArray lu ies = runST (A.unsafeArrayUArray lu ies $ Pixel4 0 0 0 0)
+  unsafeArray lu ies = runST (A.unsafeArrayUArray lu ies $ Pixel4 0)
   {-# INLINE unsafeArray #-}
-  unsafeAt (A.UArray _ _ _ arr#) (I# n#) = Pixel4 (W8# (indexWord8Array# arr# n'#))
-                                                  (W8# (indexWord8Array# arr# (n'# +# 1#)))
-                                                  (W8# (indexWord8Array# arr# (n'# +# 2#)))
-                                                  (W8# (indexWord8Array# arr# (n'# +# 3#)))
-    where
-      n'# = n# *# 4#
+  unsafeAt (A.UArray _ _ _ arr#) (I# n#) = Pixel4 (W32# (indexWord32Array# arr# n#))
   {-# INLINE unsafeAt #-}
 
 class (Eq a, forall s. A.MArray (A.STUArray s) a (ST s)) => Pixel a where

The Pixel class methods become a tad less pleasant, though:

instance Pixel Pixel4 where
  toRGBA (Pixel4 rgba) = ( fromIntegral $ rgba .>>. 24
                         , fromIntegral $ rgba .>>. 16
                         , fromIntegral $ rgba .>>. 8
                         , fromIntegral   rgba
                         )
  {-# INLINE toRGBA #-}
  fromRGBA r g b a = Pixel4 $ fromIntegral r .<<. 24
                          .|. fromIntegral g .<<. 16
                          .|. fromIntegral b .<<. 8
                          .|. fromIntegral a
  {-# INLINE fromRGBA #-}

  readPixel (BSI.PS x _ _) pos = Pixel4 $ BSI.accursedUnutterablePerformIO $ withForeignPtr x $ \p -> peek (p `plusPtr` pos)
  {-# INLINE readPixel #-}
  channelCount _ = 4
  {-# INLINE channelCount #-}

here, we rely on the endianness in readPixel, and the production library will do the suitable byte swaps, but it’s acceptable for our purposes right now.

Is it worth it?

Yes! Encoding now takes 209 ms (23% improvement!), while decoding is about 187 ms (8% improvement) — and those are our final results for RGBA.

Shall we do the same for Pixel3? I won’t describe the implementation in all detail, but, as it turns out, doing a similar trick for Pixel3 only degrades performance.

Conclusion

Wow, what a weekend!

Surprisingly, it takes much more effort and energy to write about writing and improving the code than to actually write the code. Indeed, I did most of the above in a couple of sittings in a couple of days, and it took me more than a month to sum it up as two blog posts.

To my disliking, these posts contain too much code — I feel like nobody will go through all of that. Thus, a note to self: investigate more efficient ways of describing these experiences.

Anyway, that’s enough for the feels, so let’s go back to the technical side. To sum it up, we implemented the (already outdated) QOI image format purely in Haskell, performing at least as fast (and most of the time noticeably faster) than the baseline C version. Our implementation also overcomes some others in other languages, a couple of which were explicitly written with speed in mind.

Sure, our version has quite some unsafety under the hood. Ideally, those unsafes, especially unsafeReads and unsafeWrite, become unnecessary once Haskell adopts dependent types, or at least liquid types gain more traction. For now, though, that is the only reliable and straightforward way to achieve decent performance. By the way, investigating how much of this unsafety can be avoided via Liquid Haskell is another exciting direction of further work.