I am reading the data in Spark inside for loop and performing joins and writing the data into the path in append mode.
for (partition <- partitionlist) {
var df = spark.read.parquet("path")
var df2 = df.join(anotherdf, col("col1") === col("col1"))
df2.write.mode("SaveMode.Append").partitionBy("partitionColumn").format("parquet").save("anotherpath")
}
In the above sample code, we are using spark 2.X version. Since spark 2 write APIs are not consistent, Is it possible that in case of any iteration, if the stages/task go in retries(in writing to the path) and get successful after a few retries, Is it possible that we see the data redundancy in the written data of that for loop's iteration where retry happened?
EDIT: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 is being used.