I'm pretty new to pyspark moving from mssql. I've done a lot of reading and understand the basics. But wanted opinion on how to approach this in Azure Synapse.
The parquet files are in gen2 datalake. The cluster size is medium and 3 nodes, with 8 cores.
The parquet files hold a dataset of over 300 columns. Now I'm only interested in 26 of these.
There is an ID column. 24 columns that contain string columns, where the entry can be one of 5000 entries, 5-6 character long codes. The 26th column is a concatenation of the 24 columns as a comma separated string.
The transformation that I want to perform, is the following.
I have a lookup dataframe which contains the 1000 codes that I want to match against entries in the 24 columns. If a match is found, it is written to a new dataframe /rdd. My aim is to eliminate as many rows as possible so that the process is fast and efficient and can be wholly completed in memory before being written to a new parquet file.
What I've tried so far.
1.Explode and split the concatenated column along with the ID column, then use regexp_extract to match against the new column before discarding the unmatched entries.
- Use stack on the 24 columns, then use regexp_extract in the same way.
My issue with the two methods, is that the exploded and stacked versions produce close to 4 billion rows, 80% of which will be junk.
My final option, in an attempt to reduce the size of the lookup column, is to perform regexp_extract on the concatenated field, if no match is found discard the rows and then perform the explode/stack on the remaining rows. Only caveat that I can think of, is that, as there are 24 columns, a matching entry is likely to be found in one of them and as a result it won't reduce the number of rows to process significantly enough.
Is there another method I can use? In SQL I would just pivot, join and discard nulls that are produced from the join. But when I do this in pyspark and use debugging for each step, I can see that this action produces too many rows to be efficient. Out of the 5 billion entries, there are only around 20 million entries that I want to keep.
I'm using broadcast joins and caching for the lookup file and partitioning albeit as regexp_extract is an embarrassingly parallel process, I'm raising the urge to sort the data before writing it back out.
Don't need syntax, just alternative methods of how I can do this. Any pointers will be greatly received. Thank you