This post if part of a series about the File System Toolkit - a custom content delivery API for SDL Tridion.
In the previous post Criteria for Dynamic Queries, I presented index classes that allow us to execute performant lookups for keys in indexes on the file system. This post presents the logic behind these indexes.
The requirement for the Toolkit API was not to use a database or a third party product such as search engines or indexers. In order to do searches for content on a file system where we had JSON files filled with properties we want to search on, we need to have indexes of those properties. It would be impossible to do performant searches without such indexes. This meant creating my own indexes and indexing logic (put, get) backed by a fast searching algorithm. In other words, it meant writing my own simple database engine.
The rationale looks like this:
The binarySearch method performs a binary search on a given random access file. The file is the index file. The returned IndexEntry contains the key and values objects read from the index, as well as the start and end positions in the index file where the index entry is located.
In order to make the index lookup perform fast, we need to read parts of the index in memory for faster access. Java offers a neat feature for this called MappedByteBuffer where one can map part of a file to memory. All file operations would then be performed in memory instead, yielding a great performance boost. More information about memory buffer is available in follow-up post Tricks with Memory Byte Buffer.
The binarySearch method below is the one that performs the actual binary search. Given the nature of the random access file, we don't know where in the file a certain index entry begins. Each index entry has an arbitrary length due to its number of TcmUris in the value, so we can't know where it ends. Therefore, the binary search algorithm must be adapted a bit in order to work.
When we perform the middle split in binary search, we need to know which entry is at that position. We need to find where the beginning of that index entry is, because that is where we read the index key from. The method readIndexEntry reads backwards in the index from the 'current' position until it finds the 'entry-separator' -- i.e. a special character that delimits two index entries.
The index file format of an index entry is the following:
key KeySeparator value ValueSeparator value EntrySeparator
where there can be a variable number of values and each separator is a standard ASCII character.
The compareKey method above performs a comparison between the searched key and the found key. If they are the same, the binary search found the index entry we want. If they are different, the binary search continues on the left/right partitions.
In the previous post Criteria for Dynamic Queries, I presented index classes that allow us to execute performant lookups for keys in indexes on the file system. This post presents the logic behind these indexes.
The requirement for the Toolkit API was not to use a database or a third party product such as search engines or indexers. In order to do searches for content on a file system where we had JSON files filled with properties we want to search on, we need to have indexes of those properties. It would be impossible to do performant searches without such indexes. This meant creating my own indexes and indexing logic (put, get) backed by a fast searching algorithm. In other words, it meant writing my own simple database engine.
The rationale looks like this:
- Each query must be as fast as possible. The fastest way to query would be to lookup a key in a dictionary and to retrieve the associated value. For large collections, this can't be done in memory; therefore we must have our dictionary on disk as a file. That's something we don't have out of the box.
- In order to retrieve values fast, we need to do a binary search. This requires the list of keys to be ordered, let's say in ascending order. Next to the key we need, there is a list of values. Each value represents the TcmUri of the item we need. As such we need an index file containing key-value tuples, ordered ascending by key.
- The index can change at any moment due to publish/unpublish activities. New items must be inserted at the right location in the index in order to maintain its ascending property. Removing items will either remove a value from the list of values associated to a key or, if the last value of a key is removed, it will remove the key entry as well.
For example consider we have a CustomMeta Key-StringValue index. It must contain as index key the concatenation of CustomMeta key with CustomMeta string-value. The index values associated with this index key are the list of TcmUris of items that contain such CustomMeta.
Component tcm:1-2 with CustomMeta key=Type, value=Article would be represented in the index as:
TypeArticle=tcm:1-2
Component tcm:1-3 with the same CustomMeta key=Type, value=Article would be represented in the index as:
TypeArticle=tcm:1-3
When indexing both Components, the index would have the following entry:
TypeArticle=tcm:1-2,tcm:1-3
Other Components in the index would be represented according to their CustomMeta. For example:
TypeArticle=tcm:1-2,tcm:1-3
TypeProduct=tcm:1-4
The index is ordered ascending according to the index key (i.e. TypeArticle, TypeProduct).
The same Component might appear several times in the index, but next to other CustomMeta key-value entries:
AuthorJohn=tcm:1-2
TypeArticle=tcm:1-2,tcm:1-3
TypeProduct=tcm:1-4
Component tcm:1-2 has both CustomMeta Type=Article and Author=John. Hence it appears in the index twice.
Index lookups are done using binary search for a key-value tuple. For example, a query for CustomMeta key=Type and value=Article, results in an index lookup for TypeArticle, which yields tcm:1-2 and tcm:1-3.
IndexAbstract Class
The base class of the index implementation offers the following operations. The write operations put, remove are performed by the Deployer extension only. The get operations are performed mainly by the dynamic query criteria:
public abstract Set<String> get(String key); public abstract Set<String> get(String key, Filter filter); public abstract Set<String> get(String startKey, String endKey, Filter filter); public abstract boolean put(String key, String value); public abstract boolean remove(String key); public abstract boolean remove(String key, String value);
IndexImpl Class
The class offers an implementation of the get, put, remove methods in IndexAbstract. The simplest get(String) method retrieves all values associated to an index key.public Set<String> get(String key, Filter filter) { RandomAccessFile raf = new RandomAccessFile(file, "rw"); FileChannel fileChannel = raf.getChannel(); IndexEntry indexEntry = binarySearch(fileChannel, key, filter); return indexEntry.getValues(); }
The binarySearch method performs a binary search on a given random access file. The file is the index file. The returned IndexEntry contains the key and values objects read from the index, as well as the start and end positions in the index file where the index entry is located.
In order to make the index lookup perform fast, we need to read parts of the index in memory for faster access. Java offers a neat feature for this called MappedByteBuffer where one can map part of a file to memory. All file operations would then be performed in memory instead, yielding a great performance boost. More information about memory buffer is available in follow-up post Tricks with Memory Byte Buffer.
private IndexEntry binarySearch(FileChannel fileChannel, String key, Filter filter) { MemoryBuffer buffer = MemoryBufferFactory.INSTANCE.getBuffer(fileChannel, FileChannel.MapMode.READ_ONLY); return binarySearch(buffer, key, filter, 0, buffer.capacity()); }
The binarySearch method below is the one that performs the actual binary search. Given the nature of the random access file, we don't know where in the file a certain index entry begins. Each index entry has an arbitrary length due to its number of TcmUris in the value, so we can't know where it ends. Therefore, the binary search algorithm must be adapted a bit in order to work.
private IndexEntry binarySearch(MemoryBuffer buffer, String key, Filter filter, long low, long high) { if (low > high) { return new IndexEntry(key, low); } long middle = (low + high) >>> 1; IndexEntry indexEntry = readIndexEntry(buffer, key, middle, filter); if (indexEntry.isEmpty()) { return indexEntry; } int compare = compareKey(key, indexEntry.getKey()); if (compare == 0) { return indexEntry; } else if (compare > 0) { return binarySearch(buffer, key, filter, indexEntry.getEndPosition() + 1, high); } else { return binarySearch(buffer, key, filter, low, indexEntry.getStartPosition() - 1); } }
When we perform the middle split in binary search, we need to know which entry is at that position. We need to find where the beginning of that index entry is, because that is where we read the index key from. The method readIndexEntry reads backwards in the index from the 'current' position until it finds the 'entry-separator' -- i.e. a special character that delimits two index entries.
The index file format of an index entry is the following:
key KeySeparator value ValueSeparator value EntrySeparator
where there can be a variable number of values and each separator is a standard ASCII character.
The compareKey method above performs a comparison between the searched key and the found key. If they are the same, the binary search found the index entry we want. If they are different, the binary search continues on the left/right partitions.
Comments