Not filling NaN values in df when using mode of df

36 views Asked by At

I have this dataframe for which I want to fill in the nan value:

quant = ['satisfaction_level', 'last_evaluation', 'average_monthly_hours']
cat = ['number_projects', 'time_spend_company', 'work_accident', 'promotion_last_5_years', 'position', 'salary']
X_train[quant] = X_train[quant].fillna(X_train[quant].median())
X_train[cat] = X_train[cat].fillna(X_train[cat].mode())

X_test[quant] = X_test[quant].fillna(X_test[quant].median())
X_test[cat] = X_test[cat].fillna(X_test[cat].mode())

Even when I use .loc instead, it does not fill the NaN and if there are not two modes as the mode()[0] returns an error.

However this does not fill in the categorical nan values though the documentation states that the value parameter can be a dataframe as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

Other methods such as doing it by column works.

Any idea ?

1

There are 1 answers

0
DataJanitor On BEST ANSWER

The reason is that .mode() returns a DataFrame, so you need to use .mode().iloc[0]:

for column in cat:
    X_train[column].fillna(X_train[column].mode().iloc[0], inplace=True)
    X_test[column].fillna(X_test[column].mode().iloc[0], inplace=True)

This .mode().iloc[0] method will not give an error if there are not multiple modes. The mode() function always returns a DataFrame, even if there is only one mode. The .iloc[0] simply selects the first mode from this DataFrame. So, if there is only one mode, .mode().iloc[0] will return that mode. If there are multiple modes, .mode().iloc[0] will return the first one. When you use .mode()[0] on a pandas.DataFrame, it can lead to issues if there are not multiple modes. This is because .mode() returns a DataFrame, and when you use [0], it tries to access the first column of this DataFrame, not the first row.

I tested this snipped with the following code:

import pandas as pd
import numpy as np

df_train = pd.DataFrame({
    'satisfaction_level': [0.38, 0.80, 0.11, 0.72, 0.37, np.nan],
    'last_evaluation': [0.53, 0.86, 0.88, 0.87, 0.52, 0.50],
    'average_monthly_hours': [157, np.nan, 272, 223, 159, 153],
    'number_projects': [2, 5, 7, 5, 2, np.nan],
    'time_spend_company': [3, 6, 4, 5, 3, 3],
    'work_accident': [0, 0, 0, 0, 0, 1],
    'promotion_last_5_years': [0, 0, 0, 0, 0, 0],
    'position': ['sales', 'sales', 'sales', 'sales', 'sales', np.nan],
    'salary': ['low', 'medium', 'medium', 'low', 'low', 'low']
})

df_test = pd.DataFrame({
    'satisfaction_level': [0.41, 0.10, 0.92, 0.89, 0.42, 0.45],
    'last_evaluation': [0.50, 0.77, 0.85, 0.91, 0.53, np.nan],
    'average_monthly_hours': [160, 255, 262, 234, 158, 145],
    'number_projects': [2, 6, 5, 5, 2, 2],
    'time_spend_company': [3, 4, 5, 5, 3, 2],
    'work_accident': [0, 0, 0, 0, 1, 0],
    'promotion_last_5_years': [0, 0, 0, 0, 0, 0],
    'position': ['sales', 'sales', 'sales', 'sales', 'sales', 'sales'],
    'salary': ['low', 'medium', 'medium', 'low', 'low', np.nan]
})

quant = ['satisfaction_level', 'last_evaluation', 'average_monthly_hours']
cat = ['number_projects', 'time_spend_company', 'work_accident', 'promotion_last_5_years', 'position', 'salary']

# Fill NaN values in quantitative columns with median
df_train[quant] = df_train[quant].fillna(df_train[quant].median())
df_test[quant] = df_test[quant].fillna(df_test[quant].median())

# Fill NaN values in categorical columns with mode
for column in cat:
    df_train[column].fillna(df_train[column].mode().iloc[0], inplace=True)
    df_test[column].fillna(df_test[column].mode().iloc[0], inplace=True)

print(df_train)
print(df_test)
satisfaction_level last_evaluation average_monthly_hours number_projects time_spend_company work_accident promotion_last_5_years position salary
0 0.38 0.53 157 2 3 0 0 sales low
1 0.8 0.86 159 5 6 0 0 sales medium
2 0.11 0.88 272 7 4 0 0 sales medium
3 0.72 0.87 223 5 5 0 0 sales low
4 0.37 0.52 159 2 3 0 0 sales low
5 0.38 0.5 153 2 3 1 0 sales low
satisfaction_level last_evaluation average_monthly_hours number_projects time_spend_company work_accident promotion_last_5_years position salary
0 0.41 0.5 160 2 3 0 0 sales low
1 0.1 0.77 255 6 4 0 0 sales medium
2 0.92 0.85 262 5 5 0 0 sales medium
3 0.89 0.91 234 5 5 0 0 sales low
4 0.42 0.53 158 2 3 1 0 sales low
5 0.45 0.77 145 2 2 0 0 sales low