How to determine the function of tasks on each stage in an Apache Spark application?

42 views Asked by At

I am conducting a study comparing the execution time of Bloom Filter Join operation on two environments: Apache Spark Cluster and Apache Spark. I have compared the overall time of the two environments, but I want to compare specific "tasks on each stage" to see which computation has the most significant difference.

I have taken a screenshot of the DAG of Stage 0 and the list of tasks executed in Stage 0.

DAG.png

Task.png

I write programs

val appName = "scenario4-study2b"

val spark = SparkSession.builder()
    .appName(appName)
    .getOrCreate()

val s3accessKeyAws = "**"
val s3secretKeyAws = "**"
val connectionTimeOut = "600000"
val s3endPointLoc: String = "http://192.168.1.100"

val sc = spark.sparkContext

// Read file from S3 with capacity is 70GB
var rddL: RDD[String] = spark.sparkContext.emptyRDD[String]
for (index < -16 to 29) {
    val fileName = f "$index%02d"
    rddL = Tools.readS3A(sc, fileName).union(rddL)
}

// Create filter with BF
val BF = Tools.rdd2BF(rddL)

// Read file from S3 with capacity is 80GB
var coutRS: Long = 0
for (index < -0 to 15) {
    val fileName = f "$index%02d"
    coutRS = coutRS + Tools.readS3A(sc, fileName).filter(item => BF.contains(item)).count()
}

print("\nResult: " + coutRS + "\n\n")

I have questions:

  1. Can we determine which tasks are responsible for executing each step scheduled on the DAG during the processing?
  2. Is it possible to know the function of each task (e.g., what is task ID 0 responsible for? What is task ID 1 responsible for? ... )?
0

There are 0 answers