Below is a JOIN that I have done. I believe it's not efficient. My u-sql job is taking a long time on this, it's been over six hours. I have lots of data too. My input is 2 GB (millions of folders) LHS & RHS in example below.
I need to find the sub directory count, depth wise where in my example the root is adl://x.azuredatalakestore.net/
Final outcome expected is below, where each level in the directory has the sum of all child directories.
| WorkDir| Depth | Dir | NumberOfSubDirectories | IsDirectory|
| -----------------------------------------------------------------------------------------|
| / | 0 |adl://x.azuredatalakestore.net/ | 5 | True |
| / | 1 |adl://x.azuredatalakestore.net/backup/ | 4 | True |
| / | 2 |adl://x.azuredatalakestore.net/backup/jenkins/ | 0 | True |
| / | 2 |adl://x.azuredatalakestore.net/backup/viewer/ | 2 | True |
| / | 3 |adl://x.azuredatalakestore.net/backup/viewer/2018/ | 1 | True |
| / | 4 |adl://x.azuredatalakestore.net/backup/viewer/2018/07/ | 0 | True |
Input for join. Left Hand side (LHS) and Right Hand Side (RHS) is same table
| WorkDir | Depth | Dir | IsDirectory | |
|---|---|---|---|---|
| / | 0 | adl://x.azuredatalakestore.net/ | TRUE | |
| / | 1 | adl://x.azuredatalakestore.net/backup/ | TRUE | |
| / | 2 | adl://x.azuredatalakestore.net/backup/jenkins/ | TRUE | |
| / | 2 | adl://x.azuredatalakestore.net/backup/viewer/ | TRUE | |
| / | 3 | adl://x.azuredatalakestore.net/backup/viewer/2018/ | TRUE | |
| / | 4 | adl://x.azuredatalakestore.net/backup/viewer/2018/07/ | TRUE |
My Join statement
// Get Sub Directory from Directory list
@DirWithSubDir =
SELECT
b.WorkDir AS WorkDirD,
b.Depth AS DepthD,
b.Dir AS DirD,
b.IsDirectory AS IsDirectoryD,
COUNT(b.Dir)-1 AS NumberOfSubDirectoriesD
FROM @stream_information AS a
JOIN
@stream_information AS b
ON a.IsDirectory == b.IsDirectory
WHERE a.Dir.Contains(b.Dir)
GROUP BY b.WorkDir,
b.Depth,
b.Dir,
b.IsDirectory
;
I have another join statement where I get the data in bytes of the folders, it's similar to this, that is slow too.
We can make combination of all Directories from each line and then do an inner join on the same recordset. The "Dir.Contains(b.Dir)" join was a cross join and hence very expensive.