A set union find algorithm


I have thousands of lines of 1 to 100 numbers, every line define a group of numbers and a relationship among them. I need to get the sets of related numbers.

Little Example: If I have this 7 lines of data

T1 T2
T6 T1
T5 T4
T3 T4 T7

I need a not so slow algorithm to know that the sets here are:

T1 T2 T6 (because T1 is related with T2 in the first line and T1 related with T6 in the line 5)
T3 T4 T5 T7 (because T5 is with T4 in line 6 and T3 is with T4 and T7 in line 7)

but when you have very big sets is painfully slow to do a search of a T(x) in every big set, and do unions of sets... etc.

Do you have a hint to do this in a not so brute force manner?

I'm trying to do this in Python.

1/10/2015 3:16:51 PM

Accepted Answer

Once you have built the data structure, exactly what queries do you want to run against it? Show us your existing code. What is a T(x)? You talk about "groups of numbers" but your sample data shows T1, T2, etc; please explain.

Have you read this: http://en.wikipedia.org/wiki/Disjoint-set_data_structure

Try looking at this Python implementation: http://code.activestate.com/recipes/215912-union-find-data-structure/

OR you can lash up something rather simple and understandable yourself, e.g.

[Update: totally new code]

class DisjointSet(object):

    def __init__(self):
        self.leader = {} # maps a member to the group's leader
        self.group = {} # maps a group leader to the group (which is a set)

    def add(self, a, b):
        leadera = self.leader.get(a)
        leaderb = self.leader.get(b)
        if leadera is not None:
            if leaderb is not None:
                if leadera == leaderb: return # nothing to do
                groupa = self.group[leadera]
                groupb = self.group[leaderb]
                if len(groupa) < len(groupb):
                    a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa
                groupa |= groupb
                del self.group[leaderb]
                for k in groupb:
                    self.leader[k] = leadera
                self.leader[b] = leadera
            if leaderb is not None:
                self.leader[a] = leaderb
                self.leader[a] = self.leader[b] = a
                self.group[a] = set([a, b])

data = """T1 T2
T3 T4
T5 T1
T3 T6
T7 T8
T3 T7
T1 T9"""
# data is chosen to demonstrate each of 5 paths in the code
from pprint import pprint as pp
ds = DisjointSet()
for line in data.splitlines():
    x, y = line.split()
    ds.add(x, y)
    print x, y

and here is the output from the last step:

T1 T9
{'T1': 'T1',
 'T2': 'T1',
 'T3': 'T3',
 'T4': 'T3',
 'T5': 'T1',
 'T6': 'T3',
 'T7': 'T3',
 'T8': 'T3',
 'T9': 'T1',
 'TA': 'T1'}
{'T1': set(['T1', 'T2', 'T5', 'T9', 'TA']),
 'T3': set(['T3', 'T4', 'T6', 'T7', 'T8'])}
6/18/2010 1:02:03 PM

Treat your numbers T1, T2, etc. as graph vertices. Any two numbers appearing together on a line are joined by an edge. Then your problem amounts to finding all the connected components in this graph. You can do this by starting with T1, then doing a breadth-first or depth-first search to find all vertices reachable from that point. Mark all these vertices as belonging to equivalence class T1. Then find the next unmarked vertex Ti, find all the yet-unmarked nodes reachable from there, and label them as belonging to equivalence class Ti. Continue until all the vertices are marked.

For a graph with n vertices and e edges, this algorithm requires O(e) time and space to build the adjacency lists, and O(n) time and space to identify all the connected components once the graph structure is built.

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow