Skip to main content

Toolkit - Writing My Own Database Engine

This post if part of a series about the File System Toolkit - a custom content delivery API for SDL Tridion.

In the previous post Criteria for Dynamic Queries, I presented index classes that allow us to execute performant lookups for keys in indexes on the file system. This post presents the logic behind these indexes.

The requirement for the Toolkit API was not to use a database or a third party product such as search engines or indexers. In order to do searches for content on a file system where we had JSON files filled with properties we want to search on, we need to have indexes of those properties. It would be impossible to do performant searches without such indexes. This meant creating my own indexes and indexing logic (put, get) backed by a fast searching algorithm. In other words, it meant writing my own simple database engine.

The rationale looks like this:
  • Each query must be as fast as possible. The fastest way to query would be to lookup a key in a dictionary and to retrieve the associated value. For large collections, this can't be done in memory; therefore we must have our dictionary on disk as a file. That's something we don't have out of the box.
  • In order to retrieve values fast, we need to do a binary search. This requires the list of keys to be ordered, let's say in ascending order. Next to the key we need, there is a list of values. Each value represents the TcmUri of the item we need. As such we need an index file containing key-value tuples, ordered ascending by key.
  • The index can change at any moment due to publish/unpublish activities. New items must be inserted at the right location in the index in order to maintain its ascending property. Removing items will either remove a value from the list of values associated to a key or, if the last value of a key is removed, it will remove the key entry as well.
For example consider we have a CustomMeta Key-StringValue index. It must contain as index key the concatenation of CustomMeta key with CustomMeta string-value. The index values associated with this index key are the list of TcmUris of items that contain such CustomMeta.

Component tcm:1-2 with CustomMeta key=Type, value=Article would be represented in the index as:
TypeArticle=tcm:1-2

Component tcm:1-3 with the same CustomMeta key=Typevalue=Article would be represented in the index as:
TypeArticle=tcm:1-3

When indexing both Components, the index would have the following entry:
TypeArticle=tcm:1-2,tcm:1-3

Other Components in the index would be represented according to their CustomMeta. For example:
TypeArticle=tcm:1-2,tcm:1-3
TypeProduct=tcm:1-4

The index is ordered ascending according to the index key (i.e. TypeArticle, TypeProduct).

The same Component might appear several times in the index, but next to other CustomMeta key-value entries:
AuthorJohn=tcm:1-2
TypeArticle=tcm:1-2,tcm:1-3
TypeProduct=tcm:1-4

Component tcm:1-2 has both CustomMeta Type=Article and Author=John. Hence it appears in the index twice.

Index lookups are done using binary search for a key-value tuple. For example, a query for CustomMeta key=Type and value=Article, results in an index lookup for TypeArticle, which yields tcm:1-2 and tcm:1-3.

IndexAbstract Class

The base class of the index implementation offers the following operations. The write operations put, remove are performed by the Deployer extension only. The get operations are performed mainly by the dynamic query criteria:

    public abstract Set<String> get(String key);
    public abstract Set<String> get(String key, Filter filter);
    public abstract Set<String> get(String startKey, String endKey, Filter filter);
    public abstract boolean put(String key, String value);
    public abstract boolean remove(String key);
    public abstract boolean remove(String key, String value);

IndexImpl Class

The class offers an implementation of the get, put, remove methods in IndexAbstract. The simplest get(String) method retrieves all values associated to an index key.

public Set<String> get(String key, Filter filter) {
    RandomAccessFile raf = new RandomAccessFile(file, "rw");
    FileChannel fileChannel = raf.getChannel();

    IndexEntry indexEntry = binarySearch(fileChannel, key, filter);
    return indexEntry.getValues();
}

The binarySearch method performs a binary search on a given random access file. The file is the index file. The returned IndexEntry contains the key and values objects read from the index, as well as the start and end positions in the index file where the index entry is located.

In order to make the index lookup perform fast, we need to read parts of the index in memory for faster access. Java offers a neat feature for this called MappedByteBuffer where one can map part of a file to memory. All file operations would then be performed in memory instead, yielding a great performance boost. More information about memory buffer is available in follow-up post Tricks with Memory Byte Buffer.

private IndexEntry binarySearch(FileChannel fileChannel, String key, Filter filter) {
    MemoryBuffer buffer = MemoryBufferFactory.INSTANCE.getBuffer(fileChannel,
        FileChannel.MapMode.READ_ONLY);
    return binarySearch(buffer, key, filter, 0, buffer.capacity());
}

The binarySearch method below is the one that performs the actual binary search. Given the nature of the random access file, we don't know where in the file a certain index entry begins. Each index entry has an arbitrary length due to its number of TcmUris in the value, so we can't know where it ends. Therefore, the binary search algorithm must be adapted a bit in order to work.

private IndexEntry binarySearch(MemoryBuffer buffer, String key, Filter filter, long low, long high) {
    if (low > high) {
        return new IndexEntry(key, low);
    }

    long middle = (low + high) >>> 1;
    IndexEntry indexEntry = readIndexEntry(buffer, key, middle, filter);
    if (indexEntry.isEmpty()) {
        return indexEntry;
    }

    int compare = compareKey(key, indexEntry.getKey());
    if (compare == 0) {
        return indexEntry;
    } else if (compare > 0) {
        return binarySearch(buffer, key, filter, indexEntry.getEndPosition() + 1, high);
    } else {
        return binarySearch(buffer, key, filter, low, indexEntry.getStartPosition() - 1);
    }
}

When we perform the middle split in binary search, we need to know which entry is at that position. We need to find where the beginning of that index entry is, because that is where we read the index key from. The method readIndexEntry reads backwards in the index from the 'current' position until it finds the 'entry-separator' -- i.e. a special character that delimits two index entries.

The index file format of an index entry is the following:
key KeySeparator value ValueSeparator value EntrySeparator
where there can be a variable number of values and each separator is a standard ASCII character.

The compareKey method above performs a comparison between the searched key and the found key. If they are the same, the binary search found the index entry we want. If they are different, the binary search continues on the left/right partitions.



Comments

Popular posts from this blog

Running sp_updatestats on AWS RDS database

Part of the maintenance tasks that I perform on a MSSQL Content Manager database is to run stored procedure sp_updatestats . exec sp_updatestats However, that is not supported on an AWS RDS instance. The error message below indicates that only the sa  account can perform this: Msg 15247 , Level 16 , State 1 , Procedure sp_updatestats, Line 15 [Batch Start Line 0 ] User does not have permission to perform this action. Instead there are several posts that suggest using UPDATE STATISTICS instead: https://dba.stackexchange.com/questions/145982/sp-updatestats-vs-update-statistics I stumbled upon the following post from 2008 (!!!), https://social.msdn.microsoft.com/Forums/sqlserver/en-US/186e3db0-fe37-4c31-b017-8e7c24d19697/spupdatestats-fails-to-run-with-permission-error-under-dbopriveleged-user , which describes a way to wrap the call to sp_updatestats and execute it under a different user: create procedure dbo.sp_updstats with execute as 'dbo' as...

I Have Gone Dark

Maybe it's the Holidays, but my mood has gone pretty dark. That is, regarding the look and feel of my computer and Tridion CME, of course. What I did was to dim the lights on the operating system, so I installed Placebo themes for Windows 7 . I went for the Ashtray look -- great name :) My VM looks now like this: But, once you change the theme on Windows, you should 'match' the theme of your applications. Some skin easily, some not. The Office suite has an in-built scheme, which can be set to Black , but it doesn't actually dim the ribbon tool bars -- it looks quite weird. Yahoo Messenger is skinnable, but you can't change the big white panels where you actually 'chat'. Skype is not skinnable at all. For Chrome, there are plenty of grey themes. Now i'm using Pro Grey . But then I got into changing the theme of websites. While very few offer skinnable interfaces (as GMail does), I had to find a way to darken the websites... Enter Stylish -- a pl...

REL Standard Tag Library

The RSTL is a library of REL tags providing standard functionality such as iterating collections, conditionals, imports, assignments, XML XSLT transformations, formatting dates, etc. RSTL distributable is available on my Google Code page under  REL Standard Tag Library . Always use the latest JAR . This post describes each RSTL tag in the library explaining its functionality, attributes and providing examples. For understanding the way expressions are evaluated, please read my post about the  Expression Language used by REL Standard Tag Library . <c:choose> / <c:when> / <c:otherwise> Syntax:     <c:choose>         <c:when test="expr1">             Do something         </c:when>         <c:when test="expr2">             Do something else         </c:when...