Fill nulls in Python Polars lazyframe by groups, conditional on the number of unique values in each group

82 views Asked by At

I have a large (~300M rows x 44 cols) dataframe and I need to fill in null values in certain ways depending on the characteristics of each group.

For example, say we have

lf = pl.LazyFrame(
    {'group':(1,1,1,2,2,2,3,3,3),
     'val':('yes', None, 'no', '2', '2', '2', 'answer', None, 'answer')
     }
)
┌───────┬────────┐
│ group ┆ val    │
│ ---   ┆ ---    │
│ i64   ┆ str    │
╞═══════╪════════╡
│ 1     ┆ yes    │
│ 1     ┆ null   │
│ 1     ┆ no     │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 3     ┆ answer │
│ 3     ┆ null   │
│ 3     ┆ answer │
└───────┴────────┘

I want to fill in nulls if and only if the group contains a single non-null unique value in the other cells, since in my context that's the expectation of the data and the presense of more than one unique value (or all nulls) in the group signals another issue that will be handled differently.

I'm able to fill null values for each group with the following:

filled_lf = (
    lf
    .with_columns(
        pl.col('val')
        .fill_null(pl.col('val').unique().first().over('group')).alias('filled_val')
        )
    )

However, for one, it seems that pl.col('val').unique() includes 'null' as one of the values, and the ordering is stochastic so choosing the first value on the list has inconsistent results. Secondly, it doesn't include the condition I need.

Desired result:

┌───────┬────────┬────────────┐
│ group ┆ val    ┆ filled_val │
│ ---   ┆ ---    ┆ ---        │
│ i64   ┆ str    ┆ str        │
╞═══════╪════════╪════════════╡
│ 1     ┆ yes    ┆ yes        │
│ 1     ┆ null   ┆ null       │
│ 1     ┆ no     ┆ no         │
│ 2     ┆ 2      ┆ 2          │
│ 2     ┆ 2      ┆ 2          │
│ 2     ┆ 2      ┆ 2          │
│ 3     ┆ answer ┆ answer     │
│ 3     ┆ null   ┆ answer     │
│ 3     ┆ answer ┆ answer     │
└───────┴────────┴────────────┘

Pandas 3.12 Polars 0.20.1

Thanks in advance for your advice!

2

There are 2 answers

1
jqurious On BEST ANSWER

You can add:

unique = pl.col("val").drop_nulls().unique(maintain_order=True)

df.with_columns(
   pl.when(unique.len().over("group") == 1)
     .then(pl.col("val").fill_null(unique.first().over("group")))
     .otherwise(pl.col("val"))
     .alias("filled")
)
shape: (9, 3)
┌───────┬────────┬────────┐
│ group ┆ val    ┆ filled │
│ ---   ┆ ---    ┆ ---    │
│ i64   ┆ str    ┆ str    │
╞═══════╪════════╪════════╡
│ 1     ┆ yes    ┆ yes    │
│ 1     ┆ null   ┆ null   │
│ 1     ┆ no     ┆ no     │
│ 2     ┆ 2      ┆ 2      │
│ 2     ┆ 2      ┆ 2      │
│ 2     ┆ 2      ┆ 2      │
│ 3     ┆ answer ┆ answer │
│ 3     ┆ null   ┆ answer │
│ 3     ┆ answer ┆ answer │
└───────┴────────┴────────┘
3
Roman Pekar On

You can use drop_nulls() and n_unique() to calculate amount of unique values within the column. After that, you don't even need to use fill_null(), cause you know that you want your whole column to have the same value:

lf.with_columns(
    pl.when(
        pl.col('val').drop_nulls().n_unique() == 1
    ).then(
        pl.col('val').drop_nulls().first()
    ).otherwise(
        pl.col('val')
    ).over('group')
)

┌───────┬────────┐
│ group ┆ val    │
│ ---   ┆ ---    │
│ i64   ┆ str    │
╞═══════╪════════╡
│ 1     ┆ yes    │
│ 1     ┆ null   │
│ 1     ┆ no     │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 3     ┆ answer │
│ 3     ┆ answer │
│ 3     ┆ answer │
└───────┴────────┘

Alternatively, for your case you don't even need to calculate amount of unique values. You only care if there's one unique value within the group, so you can use min() and max() within the group:

lf.with_columns(
    pl.when(
        pl.col('val').min() == pl.col('val').max()
    ).then(
        pl.col('val').max()
        # alternatively you can use
        # pl.col('val').fill_null('max')
    ).otherwise(
        pl.col('val')
    ).over('group')
)

┌───────┬────────┐
│ group ┆ val    │
│ ---   ┆ ---    │
│ i64   ┆ str    │
╞═══════╪════════╡
│ 1     ┆ yes    │
│ 1     ┆ null   │
│ 1     ┆ no     │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 3     ┆ answer │
│ 3     ┆ answer │
│ 3     ┆ answer │
└───────┴────────┘

update

as mentioned in the comments, both ways fail on the full dataset. I don't know what exact problem is, but I also see possible logical problem in my code - the over expression is applied to the full when.then.otherwise expression, while it doesn't have to be applied to otherwise part of it. It seems to be working (kudos to polars expression parser) but I see how it might be problematic. So one suggestion might be to move over clause inside of the when.then.otherwise expression parts to make parser' life easier:

lf.with_columns(
    pl.when(
        pl.col('val').drop_nulls().n_unique().over('group') == 1
    ).then(
        pl.col('val').drop_nulls().first().over('group')
    ).otherwise(
        pl.col('val')
    )
)