I am currently trying to run Archives Unleashed (which makes use of PySpark) in a Jupyter Notebook in order to work with some web archives. When I run the following code, I get the error message "'JavaPackage' object is not callable":
from aut import *
WebArchive(sc, sqlContext, "path/to/warcs") \
.webpages() \
.select("crawl_date", "domain", "url", "content") \
.write \
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
.format("csv") \
.option("escape", "\"") \
.option("encoding", "utf-8") \
.save("plain-text-df/")
Here is the link to the package documentation in case it is helpful: https://aut.docs.archivesunleashed.org/docs/text-analysis
I have tried following the PySpark setup steps from this notebook, to no avail (the same error message repeats): https://github.com/archivesunleashed/notebooks/blob/main/Parquet%20Examples/parquet_text_analyis.ipynb
I have made sure the PySparkContext was set up correctly using this code I found in another StackOverflow question:
from pyspark.sql import SparkSession,SQLContext
spark = SparkSession.builder.appName("Basics").getOrCreate()
sc=spark.sparkContext
sqlContext = SQLContext(sc)