Install pandera[pyspark]-compatible version of pydantic on Databricks cluster

265 views Asked by At

I have an issue with installing pandera[pyspark] and pydantic on my Databricks cluster.

Background

I have a data pipeline in Databricks notebooks. It is completely written in PySpark, so for my data validation I want to use the pyspark version of pandera.

Issue

When I include the latest PyPi versions of pandera[pyspark] (0.17.2) and pydantic (2.4.2) in the configuration of my cluster, they install. However, when I want to import something from pandera.pyspark (for example, from pandera.pyspark import DataFrameSchema) I get the following error: AttributeError: Module 'pydantic' has no attribute '__version__'. I have looked at this SO question, but according to the github issue linked in the answer it seems to be fixed in the newer versions of pydantic.

Strangely, if I don't install the libraries on the cluster and instead run the following lines in my data validation notebook:

!pip install --upgrade pydantic
!pip install pandera[pyspark]

Then I can import DataFrameSchema without any issues and run the validation script. Still, these lines result in the latest PyPi versions for pandera and pydantic as well.

Question

How can it be that the same combination of pandera and pydantic versions give issues in the cluster configuration, but not when installed in a notebook cell? Might there be an alternative way of installing these packages correctly to my cluster?

0

There are 0 answers