Cluster a stream of items with constraints

45 views Asked by At

I'm looking to partition a sequence (non-repeatable stream) of items coming in. I'm sure this is some standard k-partite algo from graph theory, but I cannot remember its name / approach – can you help?

Each item is of the form (x1, x2, x3, x4, …, xn), i.e. a tuple of n components, where x1 comes from a set of possible values X1.

Example: n=3, X1={A, B, C, -}, X2={0, 1, 2, -}, X3={green, red, blue, -}. Example items: (A, 0, green), (C, 2, blue), (A, 2, green), (A, -, -), (-, 1, red), etc.

The - value is a special value that means "this value is not known yet". Each item has at least one value known, i.e. (-, -, -) is not possible.

Now the clustering must group together all items that have no conflict in their values. For example (A, -, blue) can be grouped with (A, 2, -) because the only overlap is in the first component, where both share the same value A => both items should end up in the same cluster.

But (A, -, blue) cannot be grouped with (B, -, blue) because there's a conflict A != B.

Similarly (A, -, -) cannot be directly grouped with (-, -, green) because they share no known common value.

The clustering must be transitive, in the sense that (A, -, -) and (-, -, green) should end up in the same cluster if there's another item (A, -, green) that connects them. All three should become a part of the same cluster.

In practice, n is ~10 and each Xi set of possible values for component i is in the billions. So the potential set of all items is practically infinite. I have an input stream of a few hundred million of such items that I need to group together quickly, as they come in.

A greedy algo is OK, in case the constraints above don't lead to a unique clustering solution. But these constraint must be satisfied:

  1. No two items in a cluster contain conflicting values.
  2. Different clusters differ in at least one conflicting value.
  3. In order to merge two items, they must share at least one value.
1

There are 1 answers

4
micans On

Indeed, the clustering solution is not necessarily unique. Consider (A,-,blue) (A,2,-) and (A,3,-). The first pairs with either of the last, but the last two are in conflict.

This problem is known as stream clustering (wiki link) or online clustering (a useful query term to search for algorithms).

For a greedy approach, you could, given a new item, suffice with picking the first admissible cluster that you already have. Each cluster that you have can be represented by a single tuple since you allow no conflicts. A tuple may start out with any number of unknowns, e.g. (A, -, -). Any new item must be compatible with this, so the cluster representation can be updated to e.g. (A, -, green) and then later (A, 14, green).

The algorithm is then to check a new item against the current cluster representations (this sub-step may require and/or allow optimisation by using appropriate datastructures), add it to an existing cluster (and update the representation if needed) if it is not conflicting, start a new cluster otherwise.

Unless you have more specific requirements on the clustering process I don't think you need any graph-theory here. One aspect is whether you can encode all your sets using integers - if possible this will be quicker than dealing with string comparisons.