I have some problems with some processing that I must do in an R environment in Databricks. I have two dataframes.
df1: contains the rows and columns (index1 and index2) and the value to be assigned
| index1 | index2 | value |
|---|---|---|
| 142342315 | 1423423341 | A |
| 142342315 | 14234235325 | A |
| 1423423166 | 14234235325 | C |
| 1423423166 | 1423423341 | C |
| ... | ... | ... |
df2: generated by an array with all combinations of index
| rowsname | 142342315 | 1423423166 | 1423423341 | 14234235325 |
|---|---|---|---|---|
| 142342315 | NA | NA | NA | NA |
| 1423423166 | NA | NA | NA | NA |
| 1423423341 | NA | NA | NA | NA |
| 14234235325 | NA | NA | NA | NA |
The first thing that came to mind is to make a loop to assign the value to each location in the dataframe. Something like that:
for (i in 1:nrow(df1)) {
df2[df1$index_1[i], df1$index_2[i]] <- df1$temp[i]
df2[df1$index_2[i], df1$index_1[i]] <- df1$temp[i]
}
But, the problem is that it involves a large number of combinations, several hours of execution.
I tried using sparseMatrix but my indices cannot be converted to integers in this case. Can you think of a way to do it optimally?
I tried for loops, it's so slowly. I tried using sparseMatrix but my indices cannot be converted to integers in this case.
If I understand it correctly, you are looking to produce a square matrix with all names across both dimensions. I think your initial approach should work, if you preallocate the output matrix and replace the for loop with some vectorized version. How about the following?
Content of variable "res":