Algorithms In Context #9: Auto-Completion

Guide to designing an algorithm for search suggestions

Can Bayar
8 min readMar 7, 2022
Photo by Lucia Macedo on Unsplash

A normal person would probably start this discussion by saying that it is amazing how fast Google suggestions work or simply stating the obvious that we are all using it. I won’t. We are not trying to fulfill the 500 words requirement for our essay homework, you know the deal…

Before we start talking about the algorithm, let’s first talk about the data. The challenge in this problem is not coming from searching for matches among a set of entries but comes from the amount of data we need to go through. Just think about how much data Google or your favourite search engine stores. The data they are dealing with is massive, yet the search results are listed almost simultaneously.

A search engine must return the results with very low latency, otherwise, no one would use it. Consequently, our algorithm must go through a massive amount of data very quickly and fetch the results. Where the data is stored is of importance here. We will utilize SQL databases to store the data as you can all expect. The need for multiple instances will be omitted for simplicity.

Let’s start with a more simple example first and gradually build the problem up, as we usually do in the series.

A Phonebook Analogy

Designing a phonebook application is similar to designing a search engine in a lot of ways. We have a limited number of entries and a search box above them. With each character we enter, it fetches the matching contacts. If you enter the letter ‘j’, you will get James Hetfield and Jon Bon Jovi in the table below. If you enter ‘ja’, the search results will be down to James Hetfield alone.

If you only have a limited number of entries, fetching the results is not really a big deal. Maybe you can read all entries and do substring operations. Or maybe you can utilize the database and run a like query:

SELECT * FROM phonebook WHERE name LIKE 'JA%'

Both these approaches would return very quickly if you have, let’s say 10000 entries. But what if you had 1 billion entries? What about 1 trillion? It is obvious that you can’t go through all the entries one by one in that situation.

Notice that, we are dealing with String elements here, so we can utilize their nature. If the data is sorted, we can run a binary search by making lexicographic comparisons. This approach is effective to some extent but it will fall short as we introduce extra requirements to the problem. Another disadvantage of binary search is that lexicographic comparison gets inefficient as the search phrase gets longer.

Next, we introduce a new data structure.

Trie Data Structure

Trie (pronounced like try) is a specialized search tree that is generally used to search strings. Its name apparently comes from the word retrieval, which is not a very creative name choice. Simply put, each node corresponds to a character and each path from the root corresponds to a string.

You can see the trie I have drawn for the examples above. Each node marked with blue colour corresponds to a word. If you follow the characters from the root to the marked nodes, you can see how the word is constructed. Each node simply stores the child characters and a flag that marks whether or not it is a word.

struct TrieNode {
unordered_map<char, TrieNode*> children;
bool endOfWord = false;
};

Here is how you can implement the insertion function:

void insert(TrieNode* root, string key) {    TrieNode* current = root;
for (char ch : key) {

if (current->children.count(ch) == 0) {
current->children[ch] = new TrieNode();
}
current = current->children[ch];
}
current->endOfWord = true;
}

The search function is actually very similar:

TrieNode* search(TrieNode* root, string key) {    TrieNode* current = root;
for (char ch : key) {

if (current->children.count(ch) == 0) {
return null;
}
current = current->children[ch];
}
return current;
}

Notice that the performance of both functions is bounded by the length of the input string in the trie data structure. Since the possibility of searching for anything more than 50 words is very low, we can simply assume that the algorithm works inO(1).

There are two problems in the trie section on HackerRank. I highly recommend solving them when you finish reading, they are very easy.

Let’s return to our phonebook example. We can easily update the node structure to store the phone numbers in the blue nodes.

struct TrieNode {
unordered_map<char, TrieNode*> children;
string phoneNumber;
bool isContact;
};

So far we have only found the node corresponding to our search key. What we need is the contacts matching the key, which means the blue nodes in the subtree where the node returned from the search function is the root. Running DFS or BFS algorithms starting from this node should suffice for this example. You may want to limit the number of contacts returned depending on your needs, in order not to search the entire subtree.

So far, so good… Now we can turn our attention back to our search engine.

Search Engine Constraints

When designing a search engine algorithm, we need to consider additional constraints. The most important ones are:

  • We will be dealing with a massive amount of data
  • Search results shall be ordered from the most popular to the least

We will try to satisfy both constraints in this section.

Dealing with Data

How many people use Google daily? How many search queries are submitted in one second? How many terabytes of data does Google store for users’ search history? It should be obvious that you cannot keep updating the trie for every single query under such a heavy load, even when the insert operation works very fast.

Luckily, nothing forces us to update our trie immediately for each incoming request. Real-time updates are not very important for a search engine. There won’t be any drastic changes in data that big in an hour or day.

Instead, you can build a new trie every week (or whatever) using the current data. The newly built trie will replace the old one and since it is built in the background independently from the living procedure, it doesn’t really matter how much time or processing power it consumes. I mean, if you were Google.

Ordering by Popularity

When you enter a keyword into Google search, you don’t see random results matching your query. You will see the most popular things searched by other users. Therefore, we need to store how many times each keyphrase is searched in the database. I am just over-simplifying here, pretty sure Google doesn’t have an SQL table with query and frequency columns.

Therefore, we also need to update our nodes to store the frequency.

struct TrieNode {
unordered_map<char, TrieNode*> children;
int frequency;
bool isWord;
};

Our trie is updated to store the frequencies. The frequency of each keyword is given next to it.

The additional constraint to return the most popular queries creates a new problem to solve. How will we return the most popular searches when a query is entered? Running a DFS or BFS might be an option when designing a phonebook, but it is not the case for the search engine since it would be a huge bottleneck considering the size of the subtree to be searched.

One simple solution could be employing some sort of caching mechanism. Each node in the trie could store the most popular n searches in that subtree so you can immediately return the values in the cache when a key phrase is entered. It wouldn’t be space-efficient but it would work fast.

Ternary Search Tree

We have seen that the trie data structure is simple but space-inefficient. A ternary search tree is like the optimized version of a trie. Plus, its name is way cooler. In a trie, each node has 26 children (or the number of characters you support), whereas in a ternary search tree nodes have only three children. It works similarly to a binary search tree.

It took me quite a while to understand how the ternary search tree works. Mainly because the nerds writing on the subject have terrible communication skills but I’m not here to judge. Here is how the tree is structured:

  • The character on the left child is alphabetically smaller than the parent.
  • The character on the right child is alphabetically larger than the parent.
  • The character on the middle child leads to the subtree for the next character on the string.

Here is how our trie from before reorganized as a ternary search tree:

The only thing you really need to see here is that, whenever you go to the left or right from a node, you skip reading that node. In the left-most part of the tree, we will skip the letter t when forming the string add.

Ternary search trees are more space-efficient in comparison with tries but at the cost of speed. So if you are short on space, use a ternary search tree. Otherwise, trie should be the better option.

There is one final additional constraint I would like to mention: spell-checking. A good search engine not only finds the matches for your query but also suggests spelling corrections. Both trie and ternary search tree-based approaches can be extended to provide typo tolerance by cutting out the branches that differ too much from the original string during DFS/BFS.

The tree-based algorithms used in the real systems are much more complex than the ones we have studied, but this should be a good starting point. By the way, everyone uses machine learning algorithms now, so most of the things we have talked about in this post are now obsolete. Am I a bad person?

--

--

Can Bayar

Senior Software Engineer, Rock Vocalist & Guitarist