Deferred Binary Serialisation

by Neil Mitchell

This library provides a framework for doing binary serialisation, with support for deferred loading. If you don't intend to use the deferred loading side, I recommend the new binary module from Hac 07, as it is pure and likely to be faster for you. If you do need deferred loading, then this is the only module for you!

Deferred loading is for when a large data structure exists, but typically only a small fraction of this data structure will be required. By using deferred loading, some of the data structure can be read quickly, and the rest can be read on demand.

This document proceeds as follows:

  1. Deferred loading explained, the basics
  2. A real-life example (from Hoogle)
  3. Writing new custom instances

The source code to this library is at:

darcs get --partial http://community.haskell.org/~ndm/darcs/binarydefer

Acknowledgements

The Hac 07 organisers and attendees - for both helpful ideas and implementation tweaks, Tom Shackell for discussion on the interface and mechanisms.

Deferred Loading

A normal serialisation library will dump the entire structure to disk, and then read it back. Normally, this is exactly what is wanted. However, in some cases a large part of the data structure may go unused. Lets take a very basic example of a search engine, which indexes web pages. The data structure for this is:

data DataBase = DataBase [Page]

data Page = Page {url :: String, title :: String, text :: String}

For the purposes of this example, lets assume that the search only works on the title. The search engine searches through all the titles, and for all the ones that match it writes out a page listing all the urls. If a page is clicked, then it is displayed from the cache. Some really basic code for this would be:

search :: DataBase -> String -> [Page]
search (DataBase pages) text = filter (isInfixOf text . title) pages

One thing to notice is that the search function only looks at the title. If we use a normal binary library the program could easily spend the majority of the time loading the actual text from disk, when typically it goes unused. Deferred loading instead puts a pointer in the file for the deferred fields (of which page would be one), and writes the actual contents out afterwards. This gives good locality of the non-deferred data, and a very easy interface. When using deferred loading, the code does not need to change based on which fields are deferred etc, so the abstraction remains.

Task 1 is to write an instance for BinaryDefer for Page and DataBase, to start with we'll make all fields strictly loaded:

instance BinaryDefer Page where
	bothDefer = defer [\ ~(Page a b c) -> unit Page << a << b << c]

instance BinaryDefer DataBase where
	bothDefer = defer [\ ~(DataBase a) -> unit DataBase << a]

The way to write these instances should be fairly mechanical. If you have access to GHCi this can be automated. First add deriving (Data, Typeable) to each data type, then import Data.Binary.Defer.Derive, then run defer (undefined :: YourType). This will print out an instance.

To use deferred loading we need to manually decide which fields should be strict, and which should be deferred. In the above application text should definately be deferred - it's likely to be very big, and it's unlikely to be used. The URL is probably also a good candidate for deferring - the size is reasonable and assuming there are few matches it will usually go wasted. In contrast, every single Page requires it's title straight away, so keeping this non-deferred seems a sensible choice.

instance BinaryDefer Page where
	bothDefer = defer [\ ~(Page a b c) -> unit Page <<~ a << b <<~ c]

Note the addition of a ~ character on the << to the left of each field we wish to make lazy. That is all there is too it.

Now we've got our data structure the only thing remaining is to actually load and save it. This can be done very easily with Haskell handles:

save :: DataBase -> IO ()
save value = do
	hndl <- openBinaryFile "temp.txt" WriteMode
	put hndl val
	hClose hndl

load :: IO DataBase
load = do
	hndl <- openBinaryFile "temp.txt" ReadMode
	get hndl

There are only two things of note in this example. The first is that files must be openned with openBinaryFile, as otherwise on Windows certain characters get corrupted due to the default of text mode. The other thing is that in the load method, hClose is not called. Because the deferred data is read lazily on demand, if the file handle was closed this would fail.

One important exception

If one of your constructors is of zero-arity, the pattern must change slightly:

instance BinaryDefer Bool where
    bothDefer = defer [\ ~(x@False) -> unit False <<! x
                      ,\ ~(x@True ) -> unit True  <<! x

The reason for this is an implementation detail leaking through, alas. If you have a zero-arity constructor which incorrectly misses the extra step you will get a warning about PendingSome versus PendingNone - this mistake is guarded by a phantom type.

A Real-life Example

This example is much of the motivation for this library. Hoogle 4 has a big trie data structure, along with a big list of matches. Each node in the trie gives a start and end of the list, which says which items match. This data structure is fast, has good space usage and scales well - all important properties for Hoogle. The basic data structure is:

data NameSearch = NameSearch (ListDefer ItemId) Trie

data Trie = Trie {mapping :: [(Char,Defer Trie)], start :: Int, len :: Int}

Note that this data structure uses two deferred data types supplied by this library - ListDefer and Defer. Defer is a data type that simply wraps the underlying one, but even using a strict variant of the definition, namely (>>), the data is still written out lazy. This is particularly useful in this case as a tuple already has a strict definition of BinaryDefer, so a new tuple type would have had to be defined. This method is clearer.

The ListDefer type is more interesting, and is defined in Data.Binary.Defer.List. This allows a big array to be created, and then only a substring of this array to be loaded at any one time. The interface is minimal

With this definition, everything else simply falls out easily. All the combinators used are the strict versions, in this particular case, because the types capture the laziness. This does mean that the user needs to manually call fromDefer on the second element of the pair, but this is a small price to pay.

Defining a Custom Instance

While the combinators, in addition to the special few data structures defined should cover nearly all data types, it may not cover quite all. In this case a basic familiarity with the underlying mechanism is desirable. The best method is to define the following two methods:

class BinaryDefer a where
    putDefer :: Handle -> a -> IO [(Int, IO ())]
    get :: Handle -> IO a

The get method is most simple - it loads a value. If this value is deferred, it is likely that this load will read an integer (the real location), then use an unsafePerformIO to seek to that location and read the value. This ensures that the reading is not done until the value is required.

The putDefer method saves the value to the handle, and returns a list of locations to backpatch, along with an IO action that will write the value to wherever the handle is at that point. The easiest way to see this is on a real example. The Defer type is defined and implemented as follows:

newtype Defer x = Defer {fromDefer :: x}

instance BinaryDefer a => BinaryDefer (Defer a) where
    putDefer hndl (Defer x) = do
        i <- hGetPos hndl
        hPutInt hndl 0
        return [(i, put hndl x)]

    get hndl = do
            i <- hGetInt hndl
            return $ Defer $ unsafePerformIO $ f i
        where
            f i = hSetPos hndl i >> get hndl

The putDefer follows a standard pattern of getting the current position, writing out a 0 (space for the backpatch) and then returning the location of the backpatch, along with an action to save the item. The get method reads the position, but defers actually reading the value until it is demanded, by using an unsafePerformIO.

The methods hPutInt, getGetPos etc are taken from Data.Binary.Defer.Internal. Whether external packages should be using these or not is a good question. Hopefully noone will need to hand-write BinaryDefer instances anyway.

Limitations

All Int's are treated as 32 bits, regardless of the real Int size on your platform. All files must be below 2Gb. No other known limitations.