I want to make a network using connectedComponents.
My vertex and edge dataframes look like below:
vertex = spark.createDataFrame(
[
('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 5),
('b', 6),
('b', 7),
('b', 8),
('b', 3),
('c', 2),
('c', 3)
],['property', 'id'])
edge = spark.createDataFrame(
[
(1,2),
(2,3),
(1,5),
(5,6),
(6,7),
(7,8),
(5,6)
],['src', 'dst'])
I create GraphFrame with vertex and edge, and run connectedComponents but all ids have same component.
What I need is separate component value for each "property" value:
result.show()
+---------+---+---------+
|property | id|component|
+---------+---+---------+
| a| 1| 1|
| a| 2| 1|
| a| 3| 1|
| b| 1| 2|
| b| 5| 2|
| b| 6| 2|
| b| 7| 2|
| b| 8| 2|
| b| 3| 3|
| c| 2| 4|
| c| 3| 4|
+---------+---+---------+
I have some silly idea that I can make the result by executing connectedComponents for each property using for loop. But real vertex data is very large size(countDistinct('property') shows over 30M. ), and using for loop seems very inefficient.
Is there any other way to do this?