fbpx
Wikipedia

Disjoint-set data structure

In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets. It provides operations for adding new sets, merging sets (replacing them by their union), and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.

Disjoint-set/Union-find Forest
Typemultiway tree
Invented1964
Invented byBernard A. Galler and Michael J. Fischer
Time complexity in big O notation
Operation Average Worst case
Search O(α(n))[1] (amortized) O(α(n))[1] (amortized)
Insert O(1)[1] O(1)[1]
Space complexity
Space O(n)[1] O(n)[1]

While there are several ways of implementing disjoint-set data structures, in practice they are often identified with a particular implementation called a disjoint-set forest. This is a specialized type of forest which performs unions and finds in near-constant amortized time. To perform a sequence of m addition, union, or find operations on a disjoint-set forest with n nodes requires total time O(mα(n)), where α(n) is the extremely slow-growing inverse Ackermann function. Disjoint-set forests do not guarantee this performance on a per-operation basis. Individual union and find operations can take longer than a constant times α(n) time, but each operation causes the disjoint-set forest to adjust itself so that successive operations are faster. Disjoint-set forests are both asymptotically optimal and practically efficient.

Disjoint-set data structures play a key role in Kruskal's algorithm for finding the minimum spanning tree of a graph. The importance of minimum spanning trees means that disjoint-set data structures underlie a wide variety of algorithms. In addition, disjoint-set data structures also have applications to symbolic computation, as well as in compilers, especially for register allocation problems.

History edit

Disjoint-set forests were first described by Bernard A. Galler and Michael J. Fischer in 1964.[2] In 1973, their time complexity was bounded to  , the iterated logarithm of  , by Hopcroft and Ullman.[3] In 1975, Robert Tarjan was the first to prove the   (inverse Ackermann function) upper bound on the algorithm's time complexity,[4]. He also proved it to be tight. In 1979, he showed that this was the lower bound for a certain class of algorithms, that include the Galler-Fischer structure.[5] In 1989, Fredman and Saks showed that   (amortized) words of   bits must be accessed by any disjoint-set data structure per operation,[6] thereby proving the optimality of the data structure in this model.

In 1991, Galil and Italiano published a survey of data structures for disjoint-sets.[7]

In 1994, Richard J. Anderson and Heather Woll described a parallelized version of Union–Find that never needs to block.[8]

In 2007, Sylvain Conchon and Jean-Christophe Filliâtre developed a semi-persistent version of the disjoint-set forest data structure and formalized its correctness using the proof assistant Coq.[9] "Semi-persistent" means that previous versions of the structure are efficiently retained, but accessing previous versions of the data structure invalidates later ones. Their fastest implementation achieves performance almost as efficient as the non-persistent algorithm. They do not perform a complexity analysis.

Variants of disjoint-set data structures with better performance on a restricted class of problems have also been considered. Gabow and Tarjan showed that if the possible unions are restricted in certain ways, then a truly linear time algorithm is possible.[10]

Representation edit

Each node in a disjoint-set forest consists of a pointer and some auxiliary information, either a size or a rank (but not both). The pointers are used to make parent pointer trees, where each node that is not the root of a tree points to its parent. To distinguish root nodes from others, their parent pointers have invalid values, such as a circular reference to the node or a sentinel value. Each tree represents a set stored in the forest, with the members of the set being the nodes in the tree. Root nodes provide set representatives: Two nodes are in the same set if and only if the roots of the trees containing the nodes are equal.

Nodes in the forest can be stored in any way convenient to the application, but a common technique is to store them in an array. In this case, parents can be indicated by their array index. Every array entry requires Θ(log n) bits of storage for the parent pointer. A comparable or lesser amount of storage is required for the rest of the entry, so the number of bits required to store the forest is Θ(n log n). If an implementation uses fixed size nodes (thereby limiting the maximum size of the forest that can be stored), then the necessary storage is linear in n.

Operations edit

Disjoint-set data structures support three operations: Making a new set containing a new element; Finding the representative of the set containing a given element; and Merging two sets.

Making new sets edit

The MakeSet operation adds a new element into a new set containing only the new element, and the new set is added to the data structure. If the data structure is instead viewed as a partition of a set, then the MakeSet operation enlarges the set by adding the new element, and it extends the existing partition by putting the new element into a new subset containing only the new element.

In a disjoint-set forest, MakeSet initializes the node's parent pointer and the node's size or rank. If a root is represented by a node that points to itself, then adding an element can be described using the following pseudocode:

function MakeSet(x) is if x is not already in the forest then x.parent := x x.size := 1 // if nodes store size x.rank := 0 // if nodes store rank end if end function 

This operation has constant time complexity. In particular, initializing a disjoint-set forest with n nodes requires O(n) time.

In practice, MakeSet must be preceded by an operation that allocates memory to hold x. As long as memory allocation is an amortized constant-time operation, as it is for a good dynamic array implementation, it does not change the asymptotic performance of the random-set forest.

Finding set representatives edit

The Find operation follows the chain of parent pointers from a specified query node x until it reaches a root element. This root element represents the set to which x belongs and may be x itself. Find returns the root element it reaches.

Performing a Find operation presents an important opportunity for improving the forest. The time in a Find operation is spent chasing parent pointers, so a flatter tree leads to faster Find operations. When a Find is executed, there is no faster way to reach the root than by following each parent pointer in succession. However, the parent pointers visited during this search can be updated to point closer to the root. Because every element visited on the way to a root is part of the same set, this does not change the sets stored in the forest. But it makes future Find operations faster, not only for the nodes between the query node and the root, but also for their descendants. This updating is an important part of the disjoint-set forest's amortized performance guarantee.

There are several algorithms for Find that achieve the asymptotically optimal time complexity. One family of algorithms, known as path compression, makes every node between the query node and the root point to the root. Path compression can be implemented using a simple recursion as follows:

function Find(x) is if x.parent ≠ x then x.parent := Find(x.parent) return x.parent else return x end if end function 

This implementation makes two passes, one up the tree and one back down. It requires enough scratch memory to store the path from the query node to the root (in the above pseudocode, the path is implicitly represented using the call stack). This can be decreased to a constant amount of memory by performing both passes in the same direction. The constant memory implementation walks from the query node to the root twice, once to find the root and once to update pointers:

function Find(x) is root := x while root.parent ≠ root do root := root.parent end while while x.parent ≠ root do parent := x.parent x.parent := root x := parent end while return root end function 

Tarjan and Van Leeuwen also developed one-pass Find algorithms that retain the same worst-case complexity but are more efficient in practice.[4] These are called path splitting and path halving. Both of these update the parent pointers of nodes on the path between the query node and the root. Path splitting replaces every parent pointer on that path by a pointer to the node's grandparent:

function Find(x) is while x.parent ≠ x do (x, x.parent) := (x.parent, x.parent.parent) end while return x end function 

Path halving works similarly but replaces only every other parent pointer:

function Find(x) is while x.parent ≠ x do x.parent := x.parent.parent x := x.parent end while return x end function 

Merging two sets edit

 
MakeSet creates 8 singletons.
 
After some operations of Union, some sets are grouped together.

The operation Union(x, y) replaces the set containing x and the set containing y with their union. Union first uses Find to determine the roots of the trees containing x and y. If the roots are the same, there is nothing more to do. Otherwise, the two trees must be merged. This is done by either setting the parent pointer of x's root to y's, or setting the parent pointer of y's root to x's.

The choice of which node becomes the parent has consequences for the complexity of future operations on the tree. If it is done carelessly, trees can become excessively tall. For example, suppose that Union always made the tree containing x a subtree of the tree containing y. Begin with a forest that has just been initialized with elements   and execute Union(1, 2), Union(2, 3), ..., Union(n - 1, n). The resulting forest contains a single tree whose root is n, and the path from 1 to n passes through every node in the tree. For this forest, the time to run Find(1) is O(n).

In an efficient implementation, tree height is controlled using union by size or union by rank. Both of these require a node to store information besides just its parent pointer. This information is used to decide which root becomes the new parent. Both strategies ensure that trees do not become too deep.

Union by size edit

In the case of union by size, a node stores its size, which is simply its number of descendants (including the node itself). When the trees with roots x and y are merged, the node with more descendants becomes the parent. If the two nodes have the same number of descendants, then either one can become the parent. In both cases, the size of the new parent node is set to its new total number of descendants.

function Union(x, y) is // Replace nodes by roots x := Find(x) y := Find(y) if x = y then return // x and y are already in the same set end if // If necessary, swap variables to ensure that // x has at least as many descendants as y if x.size < y.size then (x, y) := (y, x) end if // Make x the new root y.parent := x // Update the size of x x.size := x.size + y.size end function 

The number of bits necessary to store the size is clearly the number of bits necessary to store n. This adds a constant factor to the forest's required storage.

Union by rank edit

For union by rank, a node stores its rank, which is an upper bound for its height. When a node is initialized, its rank is set to zero. To merge trees with roots x and y, first compare their ranks. If the ranks are different, then the larger rank tree becomes the parent, and the ranks of x and y do not change. If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one. While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights. The height of a node can change during a Find operation, so storing ranks avoids the extra effort of keeping the height correct. In pseudocode, union by rank is:

function Union(x, y) is // Replace nodes by roots x := Find(x) y := Find(y) if x = y then return // x and y are already in the same set end if // If necessary, rename variables to ensure that // x has rank at least as large as that of y if x.rank < y.rank then (x, y) := (y, x) end if // Make x the new root y.parent := x // If necessary, increment the rank of x if x.rank = y.rank then x.rank := x.rank + 1 end if end function 

It can be shown that every node has rank   or less.[11] Consequently each rank can be stored in O(log log n) bits and all the ranks can be stored in O(n log log n) bits. This makes the ranks an asymptotically negligible portion of the forest's size.

It is clear from the above implementations that the size and rank of a node do not matter unless a node is the root of a tree. Once a node becomes a child, its size and rank are never accessed again.

Time complexity edit

A disjoint-set forest implementation in which Find does not update parent pointers, and in which Union does not attempt to control tree heights, can have trees with height O(n). In such a situation, the Find and Union operations require O(n) time.

If an implementation uses path compression alone, then a sequence of n MakeSet operations, followed by up to n − 1 Union operations and f Find operations, has a worst-case running time of  .[11]

Using union by rank, but without updating parent pointers during Find, gives a running time of   for m operations of any type, up to n of which are MakeSet operations.[11]

The combination of path compression, splitting, or halving, with union by size or by rank, reduces the running time for m operations of any type, up to n of which are MakeSet operations, to  .[4][5] This makes the amortized running time of each operation  . This is asymptotically optimal, meaning that every disjoint set data structure must use   amortized time per operation.[6] Here, the function   is the inverse Ackermann function. The inverse Ackermann function grows extraordinarily slowly, so this factor is 4 or less for any n that can actually be written in the physical universe. This makes disjoint-set operations practically amortized constant time.

Proof of O(m log* n) time complexity of Union-Find edit

The precise analysis of the performance of a disjoint-set forest is somewhat intricate. However, there is a much simpler analysis that proves that the amortized time for any m Find or Union operations on a disjoint-set forest containing n objects is O(mlog* n), where log* denotes the iterated logarithm.[12][13][14][15]

Lemma 1: As the find function follows the path along to the root, the rank of node it encounters is increasing.

Proof

claim that as Find and Union operations are applied to the data set, this fact remains true over time. Initially when each node is the root of its own tree, it's trivially true. The only case when the rank of a node might be changed is when the Union by Rank operation is applied. In this case, a tree with smaller rank will be attached to a tree with greater rank, rather than vice versa. And during the find operation, all nodes visited along the path will be attached to the root, which has larger rank than its children, so this operation won't change this fact either.

Lemma 2: A node u which is root of a subtree with rank r has at least   nodes.

Proof

Initially when each node is the root of its own tree, it's trivially true. Assume that a node u with rank r has at least 2r nodes. Then when two trees with rank r are merged using the operation Union by Rank, a tree with rank r + 1 results, the root of which has at least   nodes.

 

Lemma 3: The maximum number of nodes of rank r is at most  

Proof

From lemma 2, we know that a node u which is root of a subtree with rank r has at least   nodes. We will get the maximum number of nodes of rank r when each node with rank r is the root of a tree that has exactly   nodes. In this case, the number of nodes of rank r is  

For convenience, we define "bucket" here: a bucket is a set that contains vertices with particular ranks.

We create some buckets and put vertices into the buckets according to their ranks inductively. That is, vertices with rank 0 go into the zeroth bucket, vertices with rank 1 go into the first bucket, vertices with ranks 2 and 3 go into the second bucket. If the B-th bucket contains vertices with ranks from interval   then the (B+1)st bucket will contain vertices with ranks from interval  

 
Proof of   Union Find

We can make two observations about the buckets.

  1. The total number of buckets is at most log*n
    Proof: When we go from one bucket to the next, we add one more two to the power, that is, the next bucket to   will be  
  2. The maximum number of elements in bucket   is at most  
    Proof: The maximum number of elements in bucket   is at most  

Let F represent the list of "find" operations performed, and let

 
 
 

Then the total cost of m finds is  

Since each find operation makes exactly one traversal that leads to a root, we have T1 = O(m).

Also, from the bound above on the number of buckets, we have T2 = O(mlog*n).

For T3, suppose we are traversing an edge from u to v, where u and v have rank in the bucket [B, 2B − 1] and v is not the root (at the time of this traversing, otherwise the traversal would be accounted for in T1). Fix u and consider the sequence   that take the role of v in different find operations. Because of path compression and not accounting for the edge to a root, this sequence contains only different nodes and because of Lemma 1 we know that the ranks of the nodes in this sequence are strictly increasing. By both of the nodes being in the bucket we can conclude that the length k of the sequence (the number of times node u is attached to a different root in the same bucket) is at most the number of ranks in the buckets B, that is, at most  

Therefore,  

From Observations 1 and 2, we can conclude that  

Therefore,  

Other structures edit

Better worst-case time per operation edit

The worst-case time of the Find operation in trees with Union by rank or Union by weight is   (i.e., it is   and this bound is tight). In 1985, N. Blum gave an implementation of the operations that does not use path compression, but compresses trees during  . His implementation runs in   time per operation,[16] and thus in comparison with Galler and Fischer's structure it has a better worst-case time per operation, but inferior amortized time. In 1999, Alstrup et al. gave a structure that has optimal worst-case time   together with inverse-Ackermann amortized time.[17]

Deletion edit

The regular implementation as disjoint-set forests does not react favorably to the deletion of elements, in the sense that the time for Find will not improve as a result of the decrease in the number of elements. However, there exist modern implementations that allow for constant-time deletion and where the time-bound for Find depends on the current number of elements[18][19]

Applications edit

 
A demo for Union-Find when using Kruskal's algorithm to find minimum spanning tree.

Disjoint-set data structures model the partitioning of a set, for example to keep track of the connected components of an undirected graph. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle. The Union–Find algorithm is used in high-performance implementations of unification.[20]

This data structure is used by the Boost Graph Library to implement its Incremental Connected Components functionality. It is also a key component in implementing Kruskal's algorithm to find the minimum spanning tree of a graph.

The Hoshen-Kopelman algorithm uses a Union-Find in the algorithm.

See also edit

References edit

  1. ^ a b c d e f Tarjan, Robert Endre (1975). "Efficiency of a Good But Not Linear Set Union Algorithm". Journal of the ACM. 22 (2): 215–225. doi:10.1145/321879.321884. hdl:1813/5942. S2CID 11105749.
  2. ^ Galler, Bernard A.; Fischer, Michael J. (May 1964). "An improved equivalence algorithm". Communications of the ACM. 7 (5): 301–303. doi:10.1145/364099.364331. S2CID 9034016.. The paper originating disjoint-set forests.
  3. ^ Hopcroft, J. E.; Ullman, J. D. (1973). "Set Merging Algorithms". SIAM Journal on Computing. 2 (4): 294–303. doi:10.1137/0202024.
  4. ^ a b c Tarjan, Robert E.; van Leeuwen, Jan (1984). "Worst-case analysis of set union algorithms". Journal of the ACM. 31 (2): 245–281. doi:10.1145/62.2160. S2CID 5363073.
  5. ^ a b Tarjan, Robert Endre (1979). "A class of algorithms which require non-linear time to maintain disjoint sets". Journal of Computer and System Sciences. 18 (2): 110–127. doi:10.1016/0022-0000(79)90042-4.
  6. ^ a b Fredman, M.; Saks, M. (May 1989). "The cell probe complexity of dynamic data structures". Proceedings of the twenty-first annual ACM symposium on Theory of computing - STOC '89. pp. 345–354. doi:10.1145/73007.73040. ISBN 0897913078. S2CID 13470414. Theorem 5: Any CPROBE(log n) implementation of the set union problem requires Ω(m α(m, n)) time to execute m Find's and n−1 Union's, beginning with n singleton sets.
  7. ^ Galil, Z.; Italiano, G. (1991). "Data structures and algorithms for disjoint set union problems". ACM Computing Surveys. 23 (3): 319–344. doi:10.1145/116873.116878. S2CID 207160759.
  8. ^ Anderson, Richard J.; Woll, Heather (1994). Wait-free Parallel Algorithms for the Union-Find Problem. 23rd ACM Symposium on Theory of Computing. pp. 370–380.
  9. ^ Conchon, Sylvain; Filliâtre, Jean-Christophe (October 2007). "A Persistent Union-Find Data Structure". ACM SIGPLAN Workshop on ML. Freiburg, Germany.
  10. ^ Harold N. Gabow, Robert Endre Tarjan, "A linear-time algorithm for a special case of disjoint set union," Journal of Computer and System Sciences, Volume 30, Issue 2, 1985, pp. 209–221, ISSN 0022-0000, https://doi.org/10.1016/0022-0000(85)90014-5
  11. ^ a b c Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009). "Chapter 21: Data structures for Disjoint Sets". Introduction to Algorithms (Third ed.). MIT Press. pp. 571–572. ISBN 978-0-262-03384-8.
  12. ^ Raimund Seidel, Micha Sharir. "Top-down analysis of path compression", SIAM J. Comput. 34(3):515–525, 2005
  13. ^ Tarjan, Robert Endre (1975). "Efficiency of a Good But Not Linear Set Union Algorithm". Journal of the ACM. 22 (2): 215–225. doi:10.1145/321879.321884. hdl:1813/5942. S2CID 11105749.
  14. ^ Hopcroft, J. E.; Ullman, J. D. (1973). "Set Merging Algorithms". SIAM Journal on Computing. 2 (4): 294–303. doi:10.1137/0202024.
  15. ^ Robert E. Tarjan and Jan van Leeuwen. Worst-case analysis of set union algorithms. Journal of the ACM, 31(2):245–281, 1984.
  16. ^ Blum, Norbert (1985). "On the Single-Operation Worst-Case Time Complexity of the Disjoint Set Union Problem". 2nd Symp. On Theoretical Aspects of Computer Science: 32–38.
  17. ^ Alstrup, Stephen; Ben-Amram, Amir M.; Rauhe, Theis (1999). "Worst-case and amortised optimality in union-find (Extended abstract)". Proceedings of the thirty-first annual ACM symposium on Theory of Computing. pp. 499–506. doi:10.1145/301250.301383. ISBN 1581130678. S2CID 100111.
  18. ^ Alstrup, Stephen; Thorup, Mikkel; Gørtz, Inge Li; Rauhe, Theis; Zwick, Uri (2014). "Union-Find with Constant Time Deletions". ACM Transactions on Algorithms. 11 (1): 6:1–6:28. doi:10.1145/2636922. S2CID 12767012.
  19. ^ Ben-Amram, Amir M.; Yoffe, Simon (2011). "A simple and efficient Union-Find-Delete algorithm". Theoretical Computer Science. 412 (4–5): 487–492. doi:10.1016/j.tcs.2010.11.005.
  20. ^ Knight, Kevin (1989). "Unification: A multidisciplinary survey" (PDF). ACM Computing Surveys. 21: 93–124. doi:10.1145/62029.62030. S2CID 14619034.

External links edit

  • C++ implementation, part of the Boost C++ libraries
  • Java implementation, part of JGraphT library
  • Javascript implementation
  • Python implementation

disjoint, data, structure, computer, science, disjoint, data, structure, also, called, union, find, data, structure, merge, find, data, structure, that, stores, collection, disjoint, overlapping, sets, equivalently, stores, partition, into, disjoint, subsets, . In computer science a disjoint set data structure also called a union find data structure or merge find set is a data structure that stores a collection of disjoint non overlapping sets Equivalently it stores a partition of a set into disjoint subsets It provides operations for adding new sets merging sets replacing them by their union and finding a representative member of a set The last operation makes it possible to find out efficiently if any two elements are in the same or different sets Disjoint set Union find ForestTypemultiway treeInvented1964Invented byBernard A Galler and Michael J FischerTime complexity in big O notationOperationAverageWorst caseSearchO a n 1 amortized O a n 1 amortized InsertO 1 1 O 1 1 Space complexitySpaceO n 1 O n 1 While there are several ways of implementing disjoint set data structures in practice they are often identified with a particular implementation called a disjoint set forest This is a specialized type of forest which performs unions and finds in near constant amortized time To perform a sequence of m addition union or find operations on a disjoint set forest with n nodes requires total time O ma n where a n is the extremely slow growing inverse Ackermann function Disjoint set forests do not guarantee this performance on a per operation basis Individual union and find operations can take longer than a constant times a n time but each operation causes the disjoint set forest to adjust itself so that successive operations are faster Disjoint set forests are both asymptotically optimal and practically efficient Disjoint set data structures play a key role in Kruskal s algorithm for finding the minimum spanning tree of a graph The importance of minimum spanning trees means that disjoint set data structures underlie a wide variety of algorithms In addition disjoint set data structures also have applications to symbolic computation as well as in compilers especially for register allocation problems Contents 1 History 2 Representation 3 Operations 3 1 Making new sets 3 2 Finding set representatives 3 3 Merging two sets 3 3 1 Union by size 3 3 2 Union by rank 4 Time complexity 4 1 Proof of O m log n time complexity of Union Find 5 Other structures 5 1 Better worst case time per operation 5 2 Deletion 6 Applications 7 See also 8 References 9 External linksHistory editDisjoint set forests were first described by Bernard A Galler and Michael J Fischer in 1964 2 In 1973 their time complexity was bounded to O log n displaystyle O log n nbsp the iterated logarithm of n displaystyle n nbsp by Hopcroft and Ullman 3 In 1975 Robert Tarjan was the first to prove the O m a n displaystyle O m alpha n nbsp inverse Ackermann function upper bound on the algorithm s time complexity 4 He also proved it to be tight In 1979 he showed that this was the lower bound for a certain class of algorithms that include the Galler Fischer structure 5 In 1989 Fredman and Saks showed that W a n displaystyle Omega alpha n nbsp amortized words of O log n displaystyle O log n nbsp bits must be accessed by any disjoint set data structure per operation 6 thereby proving the optimality of the data structure in this model In 1991 Galil and Italiano published a survey of data structures for disjoint sets 7 In 1994 Richard J Anderson and Heather Woll described a parallelized version of Union Find that never needs to block 8 In 2007 Sylvain Conchon and Jean Christophe Filliatre developed a semi persistent version of the disjoint set forest data structure and formalized its correctness using the proof assistant Coq 9 Semi persistent means that previous versions of the structure are efficiently retained but accessing previous versions of the data structure invalidates later ones Their fastest implementation achieves performance almost as efficient as the non persistent algorithm They do not perform a complexity analysis Variants of disjoint set data structures with better performance on a restricted class of problems have also been considered Gabow and Tarjan showed that if the possible unions are restricted in certain ways then a truly linear time algorithm is possible 10 Representation editEach node in a disjoint set forest consists of a pointer and some auxiliary information either a size or a rank but not both The pointers are used to make parent pointer trees where each node that is not the root of a tree points to its parent To distinguish root nodes from others their parent pointers have invalid values such as a circular reference to the node or a sentinel value Each tree represents a set stored in the forest with the members of the set being the nodes in the tree Root nodes provide set representatives Two nodes are in the same set if and only if the roots of the trees containing the nodes are equal Nodes in the forest can be stored in any way convenient to the application but a common technique is to store them in an array In this case parents can be indicated by their array index Every array entry requires 8 log n bits of storage for the parent pointer A comparable or lesser amount of storage is required for the rest of the entry so the number of bits required to store the forest is 8 n log n If an implementation uses fixed size nodes thereby limiting the maximum size of the forest that can be stored then the necessary storage is linear in n Operations editDisjoint set data structures support three operations Making a new set containing a new element Finding the representative of the set containing a given element and Merging two sets Making new sets edit The MakeSet operation adds a new element into a new set containing only the new element and the new set is added to the data structure If the data structure is instead viewed as a partition of a set then the MakeSet operation enlarges the set by adding the new element and it extends the existing partition by putting the new element into a new subset containing only the new element In a disjoint set forest MakeSet initializes the node s parent pointer and the node s size or rank If a root is represented by a node that points to itself then adding an element can be described using the following pseudocode function MakeSet x is if x is not already in the forest then x parent x x size 1 if nodes store size x rank 0 if nodes store rank end if end function This operation has constant time complexity In particular initializing a disjoint set forest with n nodes requires O n time In practice MakeSet must be preceded by an operation that allocates memory to hold x As long as memory allocation is an amortized constant time operation as it is for a good dynamic array implementation it does not change the asymptotic performance of the random set forest Finding set representatives edit The Find operation follows the chain of parent pointers from a specified query node x until it reaches a root element This root element represents the set to which x belongs and may be x itself Find returns the root element it reaches Performing a Find operation presents an important opportunity for improving the forest The time in a Find operation is spent chasing parent pointers so a flatter tree leads to faster Find operations When a Find is executed there is no faster way to reach the root than by following each parent pointer in succession However the parent pointers visited during this search can be updated to point closer to the root Because every element visited on the way to a root is part of the same set this does not change the sets stored in the forest But it makes future Find operations faster not only for the nodes between the query node and the root but also for their descendants This updating is an important part of the disjoint set forest s amortized performance guarantee There are several algorithms for Find that achieve the asymptotically optimal time complexity One family of algorithms known as path compression makes every node between the query node and the root point to the root Path compression can be implemented using a simple recursion as follows function Find x is if x parent x then x parent Find x parent return x parent else return x end if end function This implementation makes two passes one up the tree and one back down It requires enough scratch memory to store the path from the query node to the root in the above pseudocode the path is implicitly represented using the call stack This can be decreased to a constant amount of memory by performing both passes in the same direction The constant memory implementation walks from the query node to the root twice once to find the root and once to update pointers function Find x is root x while root parent root do root root parent end while while x parent root do parent x parent x parent root x parent end while return root end function Tarjan and Van Leeuwen also developed one pass Find algorithms that retain the same worst case complexity but are more efficient in practice 4 These are called path splitting and path halving Both of these update the parent pointers of nodes on the path between the query node and the root Path splitting replaces every parent pointer on that path by a pointer to the node s grandparent function Find x is while x parent x do x x parent x parent x parent parent end while return x end function Path halving works similarly but replaces only every other parent pointer function Find x is while x parent x do x parent x parent parent x x parent end while return x end function Merging two sets edit nbsp MakeSet creates 8 singletons nbsp After some operations of Union some sets are grouped together The operation Union i x i i y i replaces the set containing x and the set containing y with their union Union first uses Find to determine the roots of the trees containing x and y If the roots are the same there is nothing more to do Otherwise the two trees must be merged This is done by either setting the parent pointer of x s root to y s or setting the parent pointer of y s root to x s The choice of which node becomes the parent has consequences for the complexity of future operations on the tree If it is done carelessly trees can become excessively tall For example suppose that Union always made the tree containing x a subtree of the tree containing y Begin with a forest that has just been initialized with elements 1 2 3 n displaystyle 1 2 3 ldots n nbsp and execute span class texhtml Union 1 2 span span class texhtml Union 2 3 span span class texhtml Union i n i 1 i n i span The resulting forest contains a single tree whose root is n and the path from 1 to n passes through every node in the tree For this forest the time to run Find 1 is O n In an efficient implementation tree height is controlled using union by size or union by rank Both of these require a node to store information besides just its parent pointer This information is used to decide which root becomes the new parent Both strategies ensure that trees do not become too deep Union by size edit In the case of union by size a node stores its size which is simply its number of descendants including the node itself When the trees with roots x and y are merged the node with more descendants becomes the parent If the two nodes have the same number of descendants then either one can become the parent In both cases the size of the new parent node is set to its new total number of descendants function Union x y is Replace nodes by roots x Find x y Find y if x y then return x and y are already in the same set end if If necessary swap variables to ensure that x has at least as many descendants as y if x size lt y size then x y y x end if Make x the new root y parent x Update the size of x x size x size y size end function The number of bits necessary to store the size is clearly the number of bits necessary to store n This adds a constant factor to the forest s required storage Union by rank edit For union by rank a node stores its rank which is an upper bound for its height When a node is initialized its rank is set to zero To merge trees with roots x and y first compare their ranks If the ranks are different then the larger rank tree becomes the parent and the ranks of x and y do not change If the ranks are the same then either one can become the parent but the new parent s rank is incremented by one While the rank of a node is clearly related to its height storing ranks is more efficient than storing heights The height of a node can change during a Find operation so storing ranks avoids the extra effort of keeping the height correct In pseudocode union by rank is function Union x y is Replace nodes by roots x Find x y Find y if x y then return x and y are already in the same set end if If necessary rename variables to ensure that x has rank at least as large as that of y if x rank lt y rank then x y y x end if Make x the new root y parent x If necessary increment the rank of x if x rank y rank then x rank x rank 1 end if end function It can be shown that every node has rank log n displaystyle lfloor log n rfloor nbsp or less 11 Consequently each rank can be stored in O log log n bits and all the ranks can be stored in O n log log n bits This makes the ranks an asymptotically negligible portion of the forest s size It is clear from the above implementations that the size and rank of a node do not matter unless a node is the root of a tree Once a node becomes a child its size and rank are never accessed again Time complexity editA disjoint set forest implementation in which Find does not update parent pointers and in which Union does not attempt to control tree heights can have trees with height O n In such a situation the Find and Union operations require O n time If an implementation uses path compression alone then a sequence of n MakeSet operations followed by up to n 1 Union operations and f Find operations has a worst case running time of 8 n f 1 log 2 f n n displaystyle Theta n f cdot left 1 log 2 f n n right nbsp 11 Using union by rank but without updating parent pointers during Find gives a running time of 8 m log n displaystyle Theta m log n nbsp for m operations of any type up to n of which are MakeSet operations 11 The combination of path compression splitting or halving with union by size or by rank reduces the running time for m operations of any type up to n of which are MakeSet operations to 8 m a n displaystyle Theta m alpha n nbsp 4 5 This makes the amortized running time of each operation 8 a n displaystyle Theta alpha n nbsp This is asymptotically optimal meaning that every disjoint set data structure must use W a n displaystyle Omega alpha n nbsp amortized time per operation 6 Here the function a n displaystyle alpha n nbsp is the inverse Ackermann function The inverse Ackermann function grows extraordinarily slowly so this factor is 4 or less for any n that can actually be written in the physical universe This makes disjoint set operations practically amortized constant time Proof of O m log n time complexity of Union Find edit The precise analysis of the performance of a disjoint set forest is somewhat intricate However there is a much simpler analysis that proves that the amortized time for any m Find or Union operations on a disjoint set forest containing n objects is O mlog n where log denotes the iterated logarithm 12 13 14 15 Lemma 1 As the find function follows the path along to the root the rank of node it encounters is increasing Proof claim that as Find and Union operations are applied to the data set this fact remains true over time Initially when each node is the root of its own tree it s trivially true The only case when the rank of a node might be changed is when the Union by Rank operation is applied In this case a tree with smaller rank will be attached to a tree with greater rank rather than vice versa And during the find operation all nodes visited along the path will be attached to the root which has larger rank than its children so this operation won t change this fact either Lemma 2 A node u which is root of a subtree with rank r has at least 2 r displaystyle 2 r nbsp nodes Proof Initially when each node is the root of its own tree it s trivially true Assume that a node u with rank r has at least 2r nodes Then when two trees with rank r are merged using the operation Union by Rank a tree with rank r 1 results the root of which has at least 2 r 2 r 2 r 1 displaystyle 2 r 2 r 2 r 1 nbsp nodes nbsp Lemma 3 The maximum number of nodes of rank r is at most n 2 r displaystyle frac n 2 r nbsp Proof From lemma 2 we know that a node u which is root of a subtree with rank r has at least 2 r displaystyle 2 r nbsp nodes We will get the maximum number of nodes of rank r when each node with rank r is the root of a tree that has exactly 2 r displaystyle 2 r nbsp nodes In this case the number of nodes of rank r is n 2 r displaystyle frac n 2 r nbsp For convenience we define bucket here a bucket is a set that contains vertices with particular ranks We create some buckets and put vertices into the buckets according to their ranks inductively That is vertices with rank 0 go into the zeroth bucket vertices with rank 1 go into the first bucket vertices with ranks 2 and 3 go into the second bucket If the B th bucket contains vertices with ranks from interval r 2 r 1 r R 1 displaystyle left r 2 r 1 right r R 1 nbsp then the B 1 st bucket will contain vertices with ranks from interval R 2 R 1 displaystyle left R 2 R 1 right nbsp nbsp Proof of O log n displaystyle O log n nbsp Union FindWe can make two observations about the buckets The total number of buckets is at most log n Proof When we go from one bucket to the next we add one more two to the power that is the next bucket to B 2 B 1 displaystyle left B 2 B 1 right nbsp will be 2 B 2 2 B 1 displaystyle left 2 B 2 2 B 1 right nbsp The maximum number of elements in bucket B 2 B 1 displaystyle left B 2 B 1 right nbsp is at most 2 n 2 B displaystyle frac 2n 2 B nbsp Proof The maximum number of elements in bucket B 2 B 1 displaystyle left B 2 B 1 right nbsp is at most n 2 B n 2 B 1 n 2 B 2 n 2 2 B 1 2 n 2 B displaystyle frac n 2 B frac n 2 B 1 frac n 2 B 2 cdots frac n 2 2 B 1 leq frac 2n 2 B nbsp Let F represent the list of find operations performed and letT 1 F link to the root displaystyle T 1 sum F text link to the root nbsp T 2 F number of links traversed where the buckets are different displaystyle T 2 sum F text number of links traversed where the buckets are different nbsp T 3 F number of links traversed where the buckets are the same displaystyle T 3 sum F text number of links traversed where the buckets are the same nbsp Then the total cost of m finds is T T 1 T 2 T 3 displaystyle T T 1 T 2 T 3 nbsp Since each find operation makes exactly one traversal that leads to a root we have T1 O m Also from the bound above on the number of buckets we have T2 O mlog n For T3 suppose we are traversing an edge from u to v where u and v have rank in the bucket B 2B 1 and v is not the root at the time of this traversing otherwise the traversal would be accounted for in T1 Fix u and consider the sequence v 1 v 2 v k displaystyle v 1 v 2 ldots v k nbsp that take the role of v in different find operations Because of path compression and not accounting for the edge to a root this sequence contains only different nodes and because of Lemma 1 we know that the ranks of the nodes in this sequence are strictly increasing By both of the nodes being in the bucket we can conclude that the length k of the sequence the number of times node u is attached to a different root in the same bucket is at most the number of ranks in the buckets B that is at most 2 B 1 B lt 2 B displaystyle 2 B 1 B lt 2 B nbsp Therefore T 3 B 2 B 1 u 2 B displaystyle T 3 leq sum B 2 B 1 sum u 2 B nbsp From Observations 1 and 2 we can conclude that T 3 B 2 B 2 n 2 B 2 n log n textstyle T 3 leq sum B 2 B frac 2n 2 B leq 2n log n nbsp Therefore T T 1 T 2 T 3 O m log n displaystyle T T 1 T 2 T 3 O m log n nbsp Other structures editBetter worst case time per operation edit The worst case time of the Find operation in trees with Union by rank or Union by weight is 8 log n displaystyle Theta log n nbsp i e it is O log n displaystyle O log n nbsp and this bound is tight In 1985 N Blum gave an implementation of the operations that does not use path compression but compresses trees during u n i o n displaystyle union nbsp His implementation runs in O log n log log n displaystyle O log n log log n nbsp time per operation 16 and thus in comparison with Galler and Fischer s structure it has a better worst case time per operation but inferior amortized time In 1999 Alstrup et al gave a structure that has optimal worst case time O log n log log n displaystyle O log n log log n nbsp together with inverse Ackermann amortized time 17 Deletion edit The regular implementation as disjoint set forests does not react favorably to the deletion of elements in the sense that the time for Find will not improve as a result of the decrease in the number of elements However there exist modern implementations that allow for constant time deletion and where the time bound for Find depends on the current number of elements 18 19 Applications edit nbsp A demo for Union Find when using Kruskal s algorithm to find minimum spanning tree Disjoint set data structures model the partitioning of a set for example to keep track of the connected components of an undirected graph This model can then be used to determine whether two vertices belong to the same component or whether adding an edge between them would result in a cycle The Union Find algorithm is used in high performance implementations of unification 20 This data structure is used by the Boost Graph Library to implement its Incremental Connected Components functionality It is also a key component in implementing Kruskal s algorithm to find the minimum spanning tree of a graph The Hoshen Kopelman algorithm uses a Union Find in the algorithm See also editPartition refinement a different data structure for maintaining disjoint sets with updates that split sets apart rather than merging them together Dynamic connectivityReferences edit a b c d e f Tarjan Robert Endre 1975 Efficiency of a Good But Not Linear Set Union Algorithm Journal of the ACM 22 2 215 225 doi 10 1145 321879 321884 hdl 1813 5942 S2CID 11105749 Galler Bernard A Fischer Michael J May 1964 An improved equivalence algorithm Communications of the ACM 7 5 301 303 doi 10 1145 364099 364331 S2CID 9034016 The paper originating disjoint set forests Hopcroft J E Ullman J D 1973 Set Merging Algorithms SIAM Journal on Computing 2 4 294 303 doi 10 1137 0202024 a b c Tarjan Robert E van Leeuwen Jan 1984 Worst case analysis of set union algorithms Journal of the ACM 31 2 245 281 doi 10 1145 62 2160 S2CID 5363073 a b Tarjan Robert Endre 1979 A class of algorithms which require non linear time to maintain disjoint sets Journal of Computer and System Sciences 18 2 110 127 doi 10 1016 0022 0000 79 90042 4 a b Fredman M Saks M May 1989 The cell probe complexity of dynamic data structures Proceedings of the twenty first annual ACM symposium on Theory of computing STOC 89 pp 345 354 doi 10 1145 73007 73040 ISBN 0897913078 S2CID 13470414 Theorem 5 Any CPROBE log n implementation of the set union problem requires W m a m n time to execute m Find s and n 1 Union s beginning with n singleton sets Galil Z Italiano G 1991 Data structures and algorithms for disjoint set union problems ACM Computing Surveys 23 3 319 344 doi 10 1145 116873 116878 S2CID 207160759 Anderson Richard J Woll Heather 1994 Wait free Parallel Algorithms for the Union Find Problem 23rd ACM Symposium on Theory of Computing pp 370 380 Conchon Sylvain Filliatre Jean Christophe October 2007 A Persistent Union Find Data Structure ACM SIGPLAN Workshop on ML Freiburg Germany Harold N Gabow Robert Endre Tarjan A linear time algorithm for a special case of disjoint set union Journal of Computer and System Sciences Volume 30 Issue 2 1985 pp 209 221 ISSN 0022 0000 https doi org 10 1016 0022 0000 85 90014 5 a b c Cormen Thomas H Leiserson Charles E Rivest Ronald L Stein Clifford 2009 Chapter 21 Data structures for Disjoint Sets Introduction to Algorithms Third ed MIT Press pp 571 572 ISBN 978 0 262 03384 8 Raimund Seidel Micha Sharir Top down analysis of path compression SIAM J Comput 34 3 515 525 2005 Tarjan Robert Endre 1975 Efficiency of a Good But Not Linear Set Union Algorithm Journal of the ACM 22 2 215 225 doi 10 1145 321879 321884 hdl 1813 5942 S2CID 11105749 Hopcroft J E Ullman J D 1973 Set Merging Algorithms SIAM Journal on Computing 2 4 294 303 doi 10 1137 0202024 Robert E Tarjan and Jan van Leeuwen Worst case analysis of set union algorithms Journal of the ACM 31 2 245 281 1984 Blum Norbert 1985 On the Single Operation Worst Case Time Complexity of the Disjoint Set Union Problem 2nd Symp On Theoretical Aspects of Computer Science 32 38 Alstrup Stephen Ben Amram Amir M Rauhe Theis 1999 Worst case and amortised optimality in union find Extended abstract Proceedings of the thirty first annual ACM symposium on Theory of Computing pp 499 506 doi 10 1145 301250 301383 ISBN 1581130678 S2CID 100111 Alstrup Stephen Thorup Mikkel Gortz Inge Li Rauhe Theis Zwick Uri 2014 Union Find with Constant Time Deletions ACM Transactions on Algorithms 11 1 6 1 6 28 doi 10 1145 2636922 S2CID 12767012 Ben Amram Amir M Yoffe Simon 2011 A simple and efficient Union Find Delete algorithm Theoretical Computer Science 412 4 5 487 492 doi 10 1016 j tcs 2010 11 005 Knight Kevin 1989 Unification A multidisciplinary survey PDF ACM Computing Surveys 21 93 124 doi 10 1145 62029 62030 S2CID 14619034 External links editC implementation part of the Boost C libraries Java implementation part of JGraphT library Javascript implementation Python implementation Retrieved from https en wikipedia org w index php title Disjoint set data structure amp oldid 1184811860, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.