Algorithms In Context #10: Disjoint Sets

Can Bayar
8 min readJun 9, 2022

There is a certain kind of graph problem that troubled me for some time. They shared the similar characteristics and although there were ways solve them, it did not satisfy me for some reason. Until I discovered a new data structure that I was unaware before… And yes, I draw the figures in Powerpoint.

Imagine an undirected graph consisting of multiple disconnected subgraphs as shown in the figure above. This graph could symbolize different types of networks: a connection map of people on social media, cities connected by railroads, countries divided by the oceans and so on… A graph can be represented in various ways and how you should represent it depends on the problem you are trying to solve. There is no one-size-fits-all solution when you are dealing with graphs.

We will only be focusing on certain types of problems in this post. Regarding our graph, an example question might be to find whether or not there is a path between two nodes. Or you may need to find how many nodes are there in each of the connected subgraphs. You can always run a DFS or BFS if you have a single query but what if your graph is constantly evolving? Running a new search each time a new node or edge is added does not seem like a refined solution. Plus, it is boring. Instead, we will be using a new data structure called disjoint sets.

Disjoint Set Data Structure

The idea for the disjoint set (also called union-find) data structure actually comes from the set theory. In mathematics two sets are disjoint if they have no element in common. Using the same analogy, we will represent each connected subgraph as a different set as shown in the figure below and ditch the standard graph representations for our specific purposes.

Let’s see how we can store a graph as a disjoint set then. I have drawn a more simple graph below to use as an example. In a disjoint set data structure, a node from each set is selected as a representative. The representative nodes are marked red and indicates that the nodes with the same representative belong to the same set.

Notice that the graph above is actually a forest, a collection of disjoint trees. In a disjoint set, you don’t actually need to store every edge in the graph. Only a single path from each node to the representative suffice, which forms up a forest, a graph with no cycles. Each node points to its parent and if you follow the parents recursively until the root node, you will eventually reach the representative from every node in the set. The representatives point to themselves. You can use an array to keep track of the parents:

A disjoint set has two operations, find and union:

  1. Find: Finds and returns the representative for the given node.
  2. Union: Merges two disjoint sets into a single one.

In the example above the nodes are numbered sequentially so we were able to use an array. In order to handle other cases, we will use a map to store the parents. It is better to form the actual tree but a map shall suffice for now.

unordered_map<int, int> parents;

Find operation starts from the given node and recursively checks the parents until it reaches the representative. If a parent of a node is equal to itself, it means we found the representative.

int find(int node) {    if (parents[node] == node) {
return node;
}
int parent = parents[node];
return find(parent);
}

Union operation uses the find operation internally. It takes two nodes as input and finds their representatives. Then merges the sets by setting one’s parent to the other one.

void union(int a, int b) {    int rootA = find(a);
int rootB = find(b);
if (rootA != rootB) {
parents[rootB] = rootA;
}
}

Note that, if two nodes share the same representative, they belong to the same set so no union operation is needed.

Let’s go through an example to understand how these operations work. In the beginning, each node is its own representative. As we see new edges, we make calls to the union function to merge the sets. Assume that we want to add a new edge between 3 and 4 in our example forest.

union(3, 4);

The find operation calls return representatives 1 for 3 and 5 for 4. The second set is merged under the first one, the parent of 5 is now 1 and all nodes are merged under a single set.

I am aware that the explanation is a little bit shallow but the algorithm is really straightforward. If you compare the latest version of the array with the previous one, you will see that the only change is on node 5, everything else stays exactly the same. If you are struggling to understand, just draw an empty graph and add edges one by one while following the algorithm.

Performance Optimization

The performance of both operations depends on the depth of the node in the current tree. If you are not careful, you may assume that it runs in O(log n). However, it strictly depends on how balanced the trees are and our current algorithm does have any mechanisms to keep it balanced. For example, if we change the order of the parameters in our previous example

union(4, 3);

the resulting tree will be different

and it will not perform as well as the previous tree. In the worst-case scenario, performance will drop down to O(n) since nodes will degenerate into longer and longer chains.

Ranked Union

This is obviously something we don’t want. What we want is to keep the trees as balanced as possible and we can achieve this by doing a simple change to the union operation. We can choose to attach the smaller tree to the bigger one according to a rank heuristic. The heuristic you choose can be the depth of the tree or the number of nodes in the tree or something entirely different depending on your needs. Whichever you choose, the essence of the change is to keep the trees balanced.

We will use the depth option to improve our algorithm. Again, it might be better to use an actual tree on the implementation but for the sake of simplicity, we will keep using maps. We will introduce a second map to store the ranks (depths). Initially, all elements are assigned to 1, meaning a single node has a depth of 1.

unordered_map<int, int> parents;
unordered_map<int, int> ranks;

Now we will update the union function to always attach the smaller tree to the bigger one. If both trees have the same depth, we will increase the rank.

void union(int a, int b) {    int rootA = find(a);
int rootB = find(b);
if (rootA == rootB) {
return;
}
if (ranks[rootA] > ranks[rootB]) {
parents[rootB] = rootA;
} else if (ranks[rootA] < ranks[rootB]) {
parents[rootA] = rootB;
} else {
parents[rootA] = rootB;
ranks[rootB]++;
}

}

Implementation using different heuristics is very similar and you can easily implement it yourself. Since you keep the trees balanced, the runtime complexity will drop to O(log n) on the worst-case scenario.

Path Compression

There is one more optimization we can do, this time on the find operation. It is called path compression and aims to flatten the tree on each call to find operation. When you make a call to the find operation, you recursively go through a number of nodes until you reach the representative. You can simply attach every node in this search path directly to the representative so the following calls to the find operation run faster.

Take the latest tree we discussed for example. If you run the find operation on node 3, you will go through 3–2–1–5 consequently. We attach every node on this path to representative 5 in order to acquire the tree below.

This way we will acquire much flatter trees. Here is the updated version of the find operation:

int find(int node) {    if (parents[node] == node) {
return node;
}
int parent = parents[node];
parents[node] = find(parent);
return parents[node];

}

Note that you will need to update the ranks as this will change the depth of your tree. You will now have near-constant time queries. Your worst case is still running in O(log n) but if you run a lot of operations, amortized time gets closer to O(1) .

DFS vs. Union-Find

I believe making a comparison between DFS and disjoint sets is beneficial, so here it goes. First, assume that you have a static graph, no changes will occur on the graph whatsoever. You can use DFS to precompute the connected components in a single run and then run the queries in O(1). DFS algorithm runs in O(V + E) , where V is the number of vertices and E is the number of edges. Let’s say that number of queries you need to run is Q. Your total runtime to answer all queries will be O(V + E + Q).

In the union-find algorithm, you start by creating a set of size 1 for each node, which takes O(V). You need to make a call to the union function for each edge in the graph. We previously mentioned it takes near-constant time with the optimization but in reality, it takes O(ƒ(V)) where ƒ(V) is something called inverse Ackermann function, which I have no idea but it is apparently a very slowly growing function and can be neglected. Nevertheless, all union operations in total take O(∂(V)·E). The same goes for the query operations and in total it becomes O(V + ∂(V)·(E + Q)), which is not faster than running a DFS.

However, as I have mentioned before, if your graph is being updated with new nodes and edges, the DFS algorithm becomes useless because you can no longer precompute all the connected components. So we can conclude that the union-find algorithm is the obvious choice if your graph dynamically changes. Note that, disjoint sets do not support a deletion operation, so you may need much more complex data structures if you encounter that case.

If you actually read up to this point, thanks for your patience. I will be in the quarantine for five more days but it was quite productive so far. And if you want to practice the concept here, there is a dedicated section on HackerRank for disjoint set data structure. Take care!

--

--

Can Bayar

Senior Software Engineer, Rock Vocalist & Guitarist