Nesting DataFrame needed for validating data with Pandera DataFrameModel

143 views Asked by At

I have a DataFrame schema that I have created using Pandera's DataFrame model called 'params'. This DataFrame is basically DataFrame of floats, for which I need to do some validation on before it is used my application.

I have another DataFrame, let's call it params2 for which the one of the validation checks being run on params needs access to.

I can not store params2 in the columns of params. How can I pass this DataFrame along to the validation checks being run on params?

So far I tried creating a custom class called SecretDF that inherits from pandas DataFrame, and creates an empty DataFrame hidden inside the class. I added additional methods that would allow accessing that DataFrame, but the Pandera checks can't see that this method is defined on the DataFrame.

Below is a minimal reproducible example of the problem.

import pandas as pd
import pandera as pa

# Example dataframes
params = pd.DataFrame({
    'value': [1.0, 2.0, 3.0]
})

params2 = pd.DataFrame({
    'reference_value': [0.5, 2.5, 3.5]
})

# A custom class inheriting from pd.DataFrame to hold a secret dataframe
class SecretDF(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        self._secret_df = None
        super().__init__(*args, **kwargs)

    def set_secret(self, df):
        self._secret_df = df

    def get_secret(self):
        return self._secret_df

# Using DataFrameModel to define schema
class ParamsSchema(pa.DataFrameModel):
    value: pa.Column[float] = pa.Field(gt=0, check_name=True, nullable=False)
    # Check function to validate values of params based on params2

    @pa.dataframe_check
    def validate_based_on_secret_df(pls, df: pd.DataFrame):
        if secret_df is None:
            return True
        return (df[‘value’] < df._secret_df["reference_value"]).all()
    

# Try to validate
params_with_secret = SecretDF(params)
params_with_secret.set_secret(params2)

try:
    pa_params = pd.DataFrame[ParamsSchema](
    params_with_secret
    )
    print("Validation passed")
except Exception as e:
    print(f"Validation failed: {e}")

This does not work, as df won’t have the attribute _secret_df in the check validate_based_on_secret_df. For my specific application, having the check column be inside of params is NOT an option, this is a simplified example from my repo. Creating the functions and checks on the fly is not an option either, as the schema is hardcoded in a .py file. What can be done here?

0

There are 0 answers