Fill nulls in Python Polars lazyframe by groups, conditional on the number of unique values in each group

Question

Fill nulls in Python Polars lazyframe by groups, conditional on the number of unique values in each group

82 views Asked by epistemetrica At 27 January 2024 at 05:54

I have a large (~300M rows x 44 cols) dataframe and I need to fill in null values in certain ways depending on the characteristics of each group.

For example, say we have

lf = pl.LazyFrame(
    {'group':(1,1,1,2,2,2,3,3,3),
     'val':('yes', None, 'no', '2', '2', '2', 'answer', None, 'answer')
     }
)

┌───────┬────────┐
│ group ┆ val    │
│ ---   ┆ ---    │
│ i64   ┆ str    │
╞═══════╪════════╡
│ 1     ┆ yes    │
│ 1     ┆ null   │
│ 1     ┆ no     │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 3     ┆ answer │
│ 3     ┆ null   │
│ 3     ┆ answer │
└───────┴────────┘

I want to fill in nulls if and only if the group contains a single non-null unique value in the other cells, since in my context that's the expectation of the data and the presense of more than one unique value (or all nulls) in the group signals another issue that will be handled differently.

I'm able to fill null values for each group with the following:

filled_lf = (
    lf
    .with_columns(
        pl.col('val')
        .fill_null(pl.col('val').unique().first().over('group')).alias('filled_val')
        )
    )

However, for one, it seems that pl.col('val').unique() includes 'null' as one of the values, and the ordering is stochastic so choosing the first value on the list has inconsistent results. Secondly, it doesn't include the condition I need.

Desired result:

┌───────┬────────┬────────────┐
│ group ┆ val    ┆ filled_val │
│ ---   ┆ ---    ┆ ---        │
│ i64   ┆ str    ┆ str        │
╞═══════╪════════╪════════════╡
│ 1     ┆ yes    ┆ yes        │
│ 1     ┆ null   ┆ null       │
│ 1     ┆ no     ┆ no         │
│ 2     ┆ 2      ┆ 2          │
│ 2     ┆ 2      ┆ 2          │
│ 2     ┆ 2      ┆ 2          │
│ 3     ┆ answer ┆ answer     │
│ 3     ┆ null   ┆ answer     │
│ 3     ┆ answer ┆ answer     │
└───────┴────────┴────────────┘

Pandas 3.12 Polars 0.20.1

Thanks in advance for your advice!

Original Q&A

There are 2 answers

Roman Pekar On 27 January 2024 at 13:18

You can use drop_nulls() and n_unique() to calculate amount of unique values within the column. After that, you don't even need to use fill_null(), cause you know that you want your whole column to have the same value:

lf.with_columns(
    pl.when(
        pl.col('val').drop_nulls().n_unique() == 1
    ).then(
        pl.col('val').drop_nulls().first()
    ).otherwise(
        pl.col('val')
    ).over('group')
)

┌───────┬────────┐
│ group ┆ val    │
│ ---   ┆ ---    │
│ i64   ┆ str    │
╞═══════╪════════╡
│ 1     ┆ yes    │
│ 1     ┆ null   │
│ 1     ┆ no     │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 3     ┆ answer │
│ 3     ┆ answer │
│ 3     ┆ answer │
└───────┴────────┘

Alternatively, for your case you don't even need to calculate amount of unique values. You only care if there's one unique value within the group, so you can use min() and max() within the group:

lf.with_columns(
    pl.when(
        pl.col('val').min() == pl.col('val').max()
    ).then(
        pl.col('val').max()
        # alternatively you can use
        # pl.col('val').fill_null('max')
    ).otherwise(
        pl.col('val')
    ).over('group')
)

┌───────┬────────┐
│ group ┆ val    │
│ ---   ┆ ---    │
│ i64   ┆ str    │
╞═══════╪════════╡
│ 1     ┆ yes    │
│ 1     ┆ null   │
│ 1     ┆ no     │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 3     ┆ answer │
│ 3     ┆ answer │
│ 3     ┆ answer │
└───────┴────────┘

update

as mentioned in the comments, both ways fail on the full dataset. I don't know what exact problem is, but I also see possible logical problem in my code - the over expression is applied to the full when.then.otherwise expression, while it doesn't have to be applied to otherwise part of it. It seems to be working (kudos to polars expression parser) but I see how it might be problematic. So one suggestion might be to move over clause inside of the when.then.otherwise expression parts to make parser' life easier:

lf.with_columns(
    pl.when(
        pl.col('val').drop_nulls().n_unique().over('group') == 1
    ).then(
        pl.col('val').drop_nulls().first().over('group')
    ).otherwise(
        pl.col('val')
    )
)

**jqurious** · Accepted Answer · 2024-01-27T11:20:03+00:00

You can add:

.drop_nulls()
.unique() with maintain_order=True
when/then/otherwise to implement the conditional count/len logic

unique = pl.col("val").drop_nulls().unique(maintain_order=True)

df.with_columns(
   pl.when(unique.len().over("group") == 1)
     .then(pl.col("val").fill_null(unique.first().over("group")))
     .otherwise(pl.col("val"))
     .alias("filled")
)

shape: (9, 3)
┌───────┬────────┬────────┐
│ group ┆ val    ┆ filled │
│ ---   ┆ ---    ┆ ---    │
│ i64   ┆ str    ┆ str    │
╞═══════╪════════╪════════╡
│ 1     ┆ yes    ┆ yes    │
│ 1     ┆ null   ┆ null   │
│ 1     ┆ no     ┆ no     │
│ 2     ┆ 2      ┆ 2      │
│ 2     ┆ 2      ┆ 2      │
│ 2     ┆ 2      ┆ 2      │
│ 3     ┆ answer ┆ answer │
│ 3     ┆ null   ┆ answer │
│ 3     ┆ answer ┆ answer │
└───────┴────────┴────────┘

TechQA.

Fill nulls in Python Polars lazyframe by groups, conditional on the number of unique values in each group

There are 2 answers

Related Questions in PYTHON

Related Questions in WINDOW-FUNCTIONS

Related Questions in LARGE-DATA

Related Questions in PYTHON-POLARS

Related Questions in UNIQUE-VALUES

Popular Questions

Trending Questions