how to enable schema evolution on autoloader

59 views Asked by At

I want to do schemaEvolution on autoloader in order to addNewColumns as they are arrived and ingested

The autoloader settingsSpark readstreamThe error I get When I display

When I click display the Stream supposed to fail and tell me that there are unknown columns

Its supposed to give me this error on this picture

1

There are 1 answers

0
DileeprajnarayanThumula On

The schema location directory keeps track of your data schema over time

Know more about Configure schema inference and evolution in Auto Loader

When you specify a target directory for the option cloudFiles.schemaLocation, it enables schema inference and evolution.

You can use the same directory for checkpointLocation if you prefer.

The following is the syntax:

(spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "parquet")
  .option("cloudFiles.schemaLocation", "<path-to-checkpoint>")
  .load("<path-to-source-data>")
  .writeStream
  .option("checkpointLocation", "<path-to-checkpoint>")
  .start("<path_to_target")
)

Results:

   df = (spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", schema_loc)
      .load(Source_data_loc)
      .writeStream
      .option("checkpointLocation", schema_loc)
      .start(target_data_loc))
      

enter image description here