I have an input file that has following structure,
col1, col2, col3
line1filed1,line1filed2.1\
line1filed2.2, line1filed3
line2filed1,line2filed2.1\
line2filed2.2, line2filed3
line3filed1, line3filed2, line3filed3
line4filed1,line4filed2,
line5filed1,,line5filed3
The output dataframe has to be,
col1, col2, col3
[line1filed1,line1filed2.1 line1filed2.2, line1filed3]
[line2filed1,line2filed2.1 line2filed2.2, line2filed3]
[line3filed1, line3filed2, line3filed3]
[line4filed1,line4filed2, null]
[line5filed1, null, line5filed3]
I'm trying to do
spark
.read
.option("multiLine", "true")
.option("escape", "\\")
.csv("path to file")
Some solutions suggest going wholeTextFiles, but it is also mentioned that wholeTextFiles is not an optimal solution.
What would be the right way to do this?
P.S: I do have an input production file of 50GB.
I've tried this piece of code.
I think it can be improved but maybe it can give you some clues to resolve your problem.