Haskell Database Implementation - Part 1, Growing a Tree

Having started my career as a software developer in 2013 with an undergraduate degree in Mathematics, I never got the chance to take some of the higher-level courses on subjects like compilers, programming languages, or databases. In 2020 I set out to correct some of the gaps in my knowledge, and one of the ways I did so was by writing my own database, in Haskell, from scratch. This series of posts chronicles the decisions I made along the way.

If you’d like to skip ahead and read the code, it’s here.

What I was interested in

In starting this project, I was most interested in (a) how the data was stored, (b) transactionality, and (c) how indexes work.

I found out pretty quickly that data is stored in a B-tree (see below) in most circumstances. Since there are many how-tos out there already I decided I was more interested in index creation and management. Nevertheless, in order to index my data I’d still need a tree representation, and I hadn’t written one in a while, so I made a halfhearted attempt.

Efficient searches use trees

In any most implementations you’d probably want to use a B Tree as the underlying data structure. The TL;DR is that B-trees are good at managing access to files while keeping only small chunks of those files in memory at any time.

I ended up settling on using a Red Black Tree. It has many of the same properties as the B Tree (self balancing, efficient), and is simpler since it doesn’t need access to a file system. Since this was mostly a project for my own learning, I figured I could just write the tree to disk whenever I wished.

Tree definition

I won’t go into the implementation specifics, since the algorithm is copied from the Wikipedia article, but I’ll quickly go over the data declaration so that later discussion makes sense.

data Color = Red | Black
  deriving (Eq, Ord, Generic)

data TreeMap a b = Node !a !b Color !(TreeMap a b) !(TreeMap a b) | Nil
  deriving (Eq, Ord, Generic)

type Tree a = TreeMap a ()

A TreeMap a b node either contains a key a, a value b, a color, and left/right branches, or is a leaf node (Nil).

For testing purposes, define the Tree type to be unitary values, since we really care about finding and inserting keys efficiently, and we don’t really care about the data (other than that it’s there).

Property testing

Definitions

Any recursive code always needs a good set of property tests. Luckily for us, Red Black Trees already have a list of laws (source: Wikipedia):

In addition to the requirements imposed on a binary search tree the following must be satisfied by a red–black tree:

Each node is either red or black.

The root is black. This rule is sometimes omitted. Since the root can always be changed from red to black, but not necessarily vice versa, this rule has little effect on analysis.

All leaves (NIL) are black.

If a node is red, then both its children are black.

Every path from a given node to any of its descendant NIL nodes goes through the same number of black nodes.

Laws

Right out the gate, the first three properties listed above are either correct by construction or trivial to test. We defined a Color type, with only two constructors, so nodes of the tree can only inhabit one of those two colors. Testing whether the root node is black isn’t very interesting, since it’s easy to set it to black after any operation. Finally, by omitting a color in the definition of a leaf node we can ensure that its color is black.

Binary Search Law

Our first property test is to ensure that any tree we generate adheres to the binary search property.

checkBinary :: (HasCallStack, Ord a) => Tree a -> Bool
checkBinary = \ case
  Tree.Nil -> True
  Tree.Node x _ _ tl tr -> all (\ l -> l < x) (Tree.setToList tl) && all (\ r -> r > x) (Tree.setToList tr)

Here, we check that every element on the left side of the root is less than the root, and vice-versa for the right side. Since we will be running this test for hundreds of test cases, we don’t care that we’re only checking the root node, as we’ll be generating many different depths.

Children of Red Nodes

Our second property test is to ensure that children of a red node are black.

checkColor :: HasCallStack => Tree a -> Bool
checkColor = \ case
  Tree.Nil -> True
  Tree.Node _ _ Tree.Red (Tree.Node _ _ Tree.Red _ _) _ -> False
  Tree.Node _ _ Tree.Red _ (Tree.Node _ _ Tree.Red _ _) -> False
  Tree.Node _ _ _ tl tr -> checkColor tl && checkColor tr

Pretty straighforward; if we unpack a red node, and it has a red child, fail the test case.

Path Counting

Finally, we want to test that the distance from the root to all leaves passes through the same number of black nodes.

checkPathCount :: HasCallStack => Tree a -> Bool
checkPathCount t =
  let go = \ case
        Tree.Nil -> (0 :: Int, True)
        Tree.Node _ _ c tl tr ->
          let (il, bl) = go tl
              (ir, br) = go tr
              j = if c == Tree.Black then 1 + il else il
          in (j, bl && br && il == ir)
  in snd $ go t

Pass a counter through a traversal of the tree. For each subtree, verify that its subtrees satisfy the predicate, and increment the counter if the root node of that subtree is black. Warning, you may have to squint.

Writing tests

We want to ensure that any mutation we apply to a tree results in a tree that satisfies the laws we set out above. For simplicity, I have outlined only tree creation and element insertion here, but we would want to property test each operation.

genTree :: Ord a => Gen a -> Gen (Tree a)
genTree gen = Tree.setFromList <$> listOf gen

genIntTree :: Gen (Tree Int)
genIntTree = genTree arbitrary

runTests :: (HasCallStack, Ord a, Show b) => String -> Gen b -> (b -> Tree a) -> Spec
runTests d gen f = do
  prop (d <> ": binary") $ forAll gen $ checkBinary . f
  prop (d <> ": color") $ forAll gen $ checkColor . f
  prop (d <> ": path count") $ forAll gen $ checkPathCount . f

spec :: Spec
spec = describe "Tree" $ do
  runTests "generation" genIntTree id
  runTests "insertion" ((,) <$> arbitrary <*> genIntTree) $ uncurry Tree.insertSet

And that’s it!

To be continued

Thanks for reading! Part 2, on the DSL and transactions, is here.

As always, I’d love to hear anything I’ve mixed up, especially in this case from database experts. Find me on fpchat (@dfithian) or reddit (/u/dfith).