Skip to main content

Toolkit - Writing My Own Database Engine

This post if part of a series about the File System Toolkit - a custom content delivery API for SDL Tridion.

In the previous post Criteria for Dynamic Queries, I presented index classes that allow us to execute performant lookups for keys in indexes on the file system. This post presents the logic behind these indexes.

The requirement for the Toolkit API was not to use a database or a third party product such as search engines or indexers. In order to do searches for content on a file system where we had JSON files filled with properties we want to search on, we need to have indexes of those properties. It would be impossible to do performant searches without such indexes. This meant creating my own indexes and indexing logic (put, get) backed by a fast searching algorithm. In other words, it meant writing my own simple database engine.

The rationale looks like this:
  • Each query must be as fast as possible. The fastest way to query would be to lookup a key in a dictionary and to retrieve the associated value. For large collections, this can't be done in memory; therefore we must have our dictionary on disk as a file. That's something we don't have out of the box.
  • In order to retrieve values fast, we need to do a binary search. This requires the list of keys to be ordered, let's say in ascending order. Next to the key we need, there is a list of values. Each value represents the TcmUri of the item we need. As such we need an index file containing key-value tuples, ordered ascending by key.
  • The index can change at any moment due to publish/unpublish activities. New items must be inserted at the right location in the index in order to maintain its ascending property. Removing items will either remove a value from the list of values associated to a key or, if the last value of a key is removed, it will remove the key entry as well.
For example consider we have a CustomMeta Key-StringValue index. It must contain as index key the concatenation of CustomMeta key with CustomMeta string-value. The index values associated with this index key are the list of TcmUris of items that contain such CustomMeta.

Component tcm:1-2 with CustomMeta key=Type, value=Article would be represented in the index as:
TypeArticle=tcm:1-2

Component tcm:1-3 with the same CustomMeta key=Typevalue=Article would be represented in the index as:
TypeArticle=tcm:1-3

When indexing both Components, the index would have the following entry:
TypeArticle=tcm:1-2,tcm:1-3

Other Components in the index would be represented according to their CustomMeta. For example:
TypeArticle=tcm:1-2,tcm:1-3
TypeProduct=tcm:1-4

The index is ordered ascending according to the index key (i.e. TypeArticle, TypeProduct).

The same Component might appear several times in the index, but next to other CustomMeta key-value entries:
AuthorJohn=tcm:1-2
TypeArticle=tcm:1-2,tcm:1-3
TypeProduct=tcm:1-4

Component tcm:1-2 has both CustomMeta Type=Article and Author=John. Hence it appears in the index twice.

Index lookups are done using binary search for a key-value tuple. For example, a query for CustomMeta key=Type and value=Article, results in an index lookup for TypeArticle, which yields tcm:1-2 and tcm:1-3.

IndexAbstract Class

The base class of the index implementation offers the following operations. The write operations put, remove are performed by the Deployer extension only. The get operations are performed mainly by the dynamic query criteria:

    public abstract Set<String> get(String key);
    public abstract Set<String> get(String key, Filter filter);
    public abstract Set<String> get(String startKey, String endKey, Filter filter);
    public abstract boolean put(String key, String value);
    public abstract boolean remove(String key);
    public abstract boolean remove(String key, String value);

IndexImpl Class

The class offers an implementation of the get, put, remove methods in IndexAbstract. The simplest get(String) method retrieves all values associated to an index key.

public Set<String> get(String key, Filter filter) {
    RandomAccessFile raf = new RandomAccessFile(file, "rw");
    FileChannel fileChannel = raf.getChannel();

    IndexEntry indexEntry = binarySearch(fileChannel, key, filter);
    return indexEntry.getValues();
}

The binarySearch method performs a binary search on a given random access file. The file is the index file. The returned IndexEntry contains the key and values objects read from the index, as well as the start and end positions in the index file where the index entry is located.

In order to make the index lookup perform fast, we need to read parts of the index in memory for faster access. Java offers a neat feature for this called MappedByteBuffer where one can map part of a file to memory. All file operations would then be performed in memory instead, yielding a great performance boost. More information about memory buffer is available in follow-up post Tricks with Memory Byte Buffer.

private IndexEntry binarySearch(FileChannel fileChannel, String key, Filter filter) {
    MemoryBuffer buffer = MemoryBufferFactory.INSTANCE.getBuffer(fileChannel,
        FileChannel.MapMode.READ_ONLY);
    return binarySearch(buffer, key, filter, 0, buffer.capacity());
}

The binarySearch method below is the one that performs the actual binary search. Given the nature of the random access file, we don't know where in the file a certain index entry begins. Each index entry has an arbitrary length due to its number of TcmUris in the value, so we can't know where it ends. Therefore, the binary search algorithm must be adapted a bit in order to work.

private IndexEntry binarySearch(MemoryBuffer buffer, String key, Filter filter, long low, long high) {
    if (low > high) {
        return new IndexEntry(key, low);
    }

    long middle = (low + high) >>> 1;
    IndexEntry indexEntry = readIndexEntry(buffer, key, middle, filter);
    if (indexEntry.isEmpty()) {
        return indexEntry;
    }

    int compare = compareKey(key, indexEntry.getKey());
    if (compare == 0) {
        return indexEntry;
    } else if (compare > 0) {
        return binarySearch(buffer, key, filter, indexEntry.getEndPosition() + 1, high);
    } else {
        return binarySearch(buffer, key, filter, low, indexEntry.getStartPosition() - 1);
    }
}

When we perform the middle split in binary search, we need to know which entry is at that position. We need to find where the beginning of that index entry is, because that is where we read the index key from. The method readIndexEntry reads backwards in the index from the 'current' position until it finds the 'entry-separator' -- i.e. a special character that delimits two index entries.

The index file format of an index entry is the following:
key KeySeparator value ValueSeparator value EntrySeparator
where there can be a variable number of values and each separator is a standard ASCII character.

The compareKey method above performs a comparison between the searched key and the found key. If they are the same, the binary search found the index entry we want. If they are different, the binary search continues on the left/right partitions.



Comments

Popular posts from this blog

Toolkit - Dynamic Content Queries

This post if part of a series about the  File System Toolkit  - a custom content delivery API for SDL Tridion. This post presents the Dynamic Content Query capability. The requirements for the Toolkit API are that it should be able to provide CustomMeta queries, pagination, and sorting -- all on the file system, without the use third party tools (database, search engines, indexers, etc). Therefore I had to implement a simple database engine and indexer -- which is described in more detail in post Writing My Own Database Engine . The querying logic does not make use of cache. This means the query logic is executed every time. When models are requested, the models are however retrieved using the ModelFactory and those are cached. Query Class This is the main class for dynamic content queries. It is the entry point into the execution logic of a query. The class takes as parameter a Criterion (presented below) which triggers the execution of query in all sub-criteria of a Criterio

A DD4T.net Implementation - Custom Binary Publisher

The default way to publish binaries in DD4T is implemented in class DD4T.Templates.Base.Utils.BinaryPublisher and uses method RenderedItem.AddBinary(Component) . This produces binaries that have their TCM URI as suffix in their filename. In my recent project, we had a requirement that binary file names should be clean (without the TCM URI suffix). Therefore, it was time to modify the way DD4T was publishing binaries. The method in charge with publishing binaries is called PublishItem and is defined in class BinaryPublisher . I therefore extended the BinaryPublisher and overrode method PublishItem. public class CustomBinaryPublisher : BinaryPublisher { private Template currentTemplate; private TcmUri structureGroupUri; In its simplest form, method PublishItem just takes the item and passes it to the AddBinary. In order to accomplish the requirement, we must specify a filename while publishing. This is the file name part of the binary path of Component.BinaryConten

Scaling Policies

This post is part of a bigger topic Autoscaling Publishers in AWS . In a previous post we talked about the Auto Scaling Groups , but we didn't go into details on the Scaling Policies. This is the purpose of this blog post. As defined earlier, the Scaling Policies define the rules according to which the group size is increased or decreased. These rules are based on instance metrics (e.g. CPU), CloudWatch custom metrics, or even CloudWatch alarms and their states and values. We defined a Scaling Policy with Steps, called 'increase_group_size', which is triggered first by the CloudWatch Alarm 'Publish_Alarm' defined earlier. Also depending on the size of the monitored CloudWatch custom metric 'Waiting for Publish', the Scaling Policy with Steps can add a difference number of instances to the group. The scaling policy sets the number of instances in group to 1 if there are between 1000 and 2000 items Waiting for Publish in the queue. It also sets the