Skip to main content

Toolkit - Writing My Own Database Engine

This post if part of a series about the File System Toolkit - a custom content delivery API for SDL Tridion.

In the previous post Criteria for Dynamic Queries, I presented index classes that allow us to execute performant lookups for keys in indexes on the file system. This post presents the logic behind these indexes.

The requirement for the Toolkit API was not to use a database or a third party product such as search engines or indexers. In order to do searches for content on a file system where we had JSON files filled with properties we want to search on, we need to have indexes of those properties. It would be impossible to do performant searches without such indexes. This meant creating my own indexes and indexing logic (put, get) backed by a fast searching algorithm. In other words, it meant writing my own simple database engine.

The rationale looks like this:
  • Each query must be as fast as possible. The fastest way to query would be to lookup a key in a dictionary and to retrieve the associated value. For large collections, this can't be done in memory; therefore we must have our dictionary on disk as a file. That's something we don't have out of the box.
  • In order to retrieve values fast, we need to do a binary search. This requires the list of keys to be ordered, let's say in ascending order. Next to the key we need, there is a list of values. Each value represents the TcmUri of the item we need. As such we need an index file containing key-value tuples, ordered ascending by key.
  • The index can change at any moment due to publish/unpublish activities. New items must be inserted at the right location in the index in order to maintain its ascending property. Removing items will either remove a value from the list of values associated to a key or, if the last value of a key is removed, it will remove the key entry as well.
For example consider we have a CustomMeta Key-StringValue index. It must contain as index key the concatenation of CustomMeta key with CustomMeta string-value. The index values associated with this index key are the list of TcmUris of items that contain such CustomMeta.

Component tcm:1-2 with CustomMeta key=Type, value=Article would be represented in the index as:

Component tcm:1-3 with the same CustomMeta key=Typevalue=Article would be represented in the index as:

When indexing both Components, the index would have the following entry:

Other Components in the index would be represented according to their CustomMeta. For example:

The index is ordered ascending according to the index key (i.e. TypeArticle, TypeProduct).

The same Component might appear several times in the index, but next to other CustomMeta key-value entries:

Component tcm:1-2 has both CustomMeta Type=Article and Author=John. Hence it appears in the index twice.

Index lookups are done using binary search for a key-value tuple. For example, a query for CustomMeta key=Type and value=Article, results in an index lookup for TypeArticle, which yields tcm:1-2 and tcm:1-3.

IndexAbstract Class

The base class of the index implementation offers the following operations. The write operations put, remove are performed by the Deployer extension only. The get operations are performed mainly by the dynamic query criteria:

    public abstract Set<String> get(String key);
    public abstract Set<String> get(String key, Filter filter);
    public abstract Set<String> get(String startKey, String endKey, Filter filter);
    public abstract boolean put(String key, String value);
    public abstract boolean remove(String key);
    public abstract boolean remove(String key, String value);

IndexImpl Class

The class offers an implementation of the get, put, remove methods in IndexAbstract. The simplest get(String) method retrieves all values associated to an index key.

public Set<String> get(String key, Filter filter) {
    RandomAccessFile raf = new RandomAccessFile(file, "rw");
    FileChannel fileChannel = raf.getChannel();

    IndexEntry indexEntry = binarySearch(fileChannel, key, filter);
    return indexEntry.getValues();

The binarySearch method performs a binary search on a given random access file. The file is the index file. The returned IndexEntry contains the key and values objects read from the index, as well as the start and end positions in the index file where the index entry is located.

In order to make the index lookup perform fast, we need to read parts of the index in memory for faster access. Java offers a neat feature for this called MappedByteBuffer where one can map part of a file to memory. All file operations would then be performed in memory instead, yielding a great performance boost. More information about memory buffer is available in follow-up post Tricks with Memory Byte Buffer.

private IndexEntry binarySearch(FileChannel fileChannel, String key, Filter filter) {
    MemoryBuffer buffer = MemoryBufferFactory.INSTANCE.getBuffer(fileChannel,
    return binarySearch(buffer, key, filter, 0, buffer.capacity());

The binarySearch method below is the one that performs the actual binary search. Given the nature of the random access file, we don't know where in the file a certain index entry begins. Each index entry has an arbitrary length due to its number of TcmUris in the value, so we can't know where it ends. Therefore, the binary search algorithm must be adapted a bit in order to work.

private IndexEntry binarySearch(MemoryBuffer buffer, String key, Filter filter, long low, long high) {
    if (low > high) {
        return new IndexEntry(key, low);

    long middle = (low + high) >>> 1;
    IndexEntry indexEntry = readIndexEntry(buffer, key, middle, filter);
    if (indexEntry.isEmpty()) {
        return indexEntry;

    int compare = compareKey(key, indexEntry.getKey());
    if (compare == 0) {
        return indexEntry;
    } else if (compare > 0) {
        return binarySearch(buffer, key, filter, indexEntry.getEndPosition() + 1, high);
    } else {
        return binarySearch(buffer, key, filter, low, indexEntry.getStartPosition() - 1);

When we perform the middle split in binary search, we need to know which entry is at that position. We need to find where the beginning of that index entry is, because that is where we read the index key from. The method readIndexEntry reads backwards in the index from the 'current' position until it finds the 'entry-separator' -- i.e. a special character that delimits two index entries.

The index file format of an index entry is the following:
key KeySeparator value ValueSeparator value EntrySeparator
where there can be a variable number of values and each separator is a standard ASCII character.

The compareKey method above performs a comparison between the searched key and the found key. If they are the same, the binary search found the index entry we want. If they are different, the binary search continues on the left/right partitions.


Popular posts from this blog

Running sp_updatestats on AWS RDS database

Part of the maintenance tasks that I perform on a MSSQL Content Manager database is to run stored procedure sp_updatestats . exec sp_updatestats However, that is not supported on an AWS RDS instance. The error message below indicates that only the sa  account can perform this: Msg 15247 , Level 16 , State 1 , Procedure sp_updatestats, Line 15 [Batch Start Line 0 ] User does not have permission to perform this action. Instead there are several posts that suggest using UPDATE STATISTICS instead: I stumbled upon the following post from 2008 (!!!), , which describes a way to wrap the call to sp_updatestats and execute it under a different user: create procedure dbo.sp_updstats with execute as 'dbo' as

Content Delivery Monitoring in AWS with CloudWatch

This post describes a way of monitoring a Tridion 9 combined Deployer by sending the health checks into a custom metric in CloudWatch in AWS. The same approach can also be used for other Content Delivery services. Once the metric is available in CloudWatch, we can create alarms in case the service errors out or becomes unresponsive. The overall architecture is as follows: Content Delivery service sends heartbeat (or exposes HTTP endpoint) for monitoring Monitoring Agent checks heartbeat (or HTTP health check) regularly and stores health state AWS lambda function: runs regularly reads the health state from Monitoring Agent pushes custom metrics into CloudWatch I am running the Deployer ( installation docs ) and Monitoring Agent ( installation docs ) on a t2.medium EC2 instance running CentOS on which I also installed the Systems Manager Agent (SSM Agent) ( installation docs ). In my case I have a combined Deployer that I want to monitor. This consists of an Endpoint and a

Debugging a Tridion 2011 Event System

OK, so you wrote your Tridion Event System. Now it's time to debug it. I know this is a hypothetical situtation -- your code never needs any kind of debugging ;) but indulge me... Recently, Alvin Reyes ( @nivlong ) blogged about being difficult to know how exactly to debug a Tridion Event System. More exactly, the question was " What process do I attach to for debugging even system code? ". Unfortunately, there is no simple or generic answer for it. Different events are fired by different Tridion CM modules. These modules run as different programs (or services) or run inside other programs (e.g. IIS). This means that you will need to monitor (or debug) different processes, based on which events your code handles. So the usual suspects are: dllhost.exe (or dllhost3g.exe ) - running as the MTSUser is the SDL Tridion Content Manager COM+ application and it fires events on generic TOM objects (e.g. events based on Tridion.ContentManager.Extensibility.Events.CrudEven