Populate NaN cells in dataframe table based on reference table based on specific row(s) and column values

119 views Asked by At

I have two tables. The first reference table is below:

| Name | Target  | Bonus |
|------|--------:|------:|
| Joe  |      40 |    46 |
| Phil |      38 |    42 |
| Dean |      65 |    70 |

The Python code to generate the table is:

# Data for the table
data = {
    'Name': ['Joe', 'Phil', 'Dean'],
    'Target': [40, 38, 65],
    'Bonus': [46, 42, 70]
}

# Creating the DataFrame
ref = pd.DataFrame(data)

My second table is below:

| week       | Metrics | Joe | Dean |
|------------|---------|----:|-----:|
| 11/6/2023  | Target  |  40 |   65 |
| 11/6/2023  | Bonus   |  46 |   70 |
| 11/6/2023  | Score   |  33 |   71 |
| 11/13/2023 | Target  |  40 |  NaN |
| 11/13/2023 | Bonus   |  46 |  NaN |
| 11/13/2023 | Score   |  45 |  NaN |
| 11/20/2023 | Target  |  40 |   65 |
| 11/20/2023 | Bonus   |  46 |   70 |
| 11/20/2023 | Score   |  35 |   68 |
| 11/27/2023 | Target  | NaN |   65 |
| 11/27/2023 | Bonus   | NaN |   70 |
| 11/27/2023 | Score   | NaN |   44 |
| 12/4/2023  | Target  |  40 |   65 |
| 12/4/2023  | Bonus   |  46 |   70 |
| 12/4/2023  | Score   |  42 |   66 |

The Python code to generate this table is:

# Data for the new table
data = {
    'week': ['11/6/2023', '11/6/2023', '11/6/2023', '11/13/2023', '11/13/2023', '11/13/2023',
             '11/20/2023', '11/20/2023', '11/20/2023', '11/27/2023', '11/27/2023', '11/27/2023',
             '12/4/2023', '12/4/2023', '12/4/2023'],
    'Metrics': ['Target', 'Bonus', 'Score', 'Target', 'Bonus', 'Score',
                'Target', 'Bonus', 'Score', 'Target', 'Bonus', 'Score',
                'Target', 'Bonus', 'Score'],
    'Joe': [40, 46, 33, 40, 46, 45, 40, 46, 35, None, None, None, 40, 46, 42],
    'Dean': [65, 70, 71, None, None, None, 65, 70, 68, 65, 70, 44, 65, 70, 66]
}

# Creating the DataFrame
df = pd.DataFrame(data)

As you can see Dean has a week where his Target, Bonus, and Score cells are blank. So does Joe in a later week. In these specific instances where the cell is NaN I want to populate them using the following rules:

  • Get Target and Bonus cell values for each person from the first reference table and populate the NaN cell accordingly.
  • Set the Score cell equal to the Target cell value for the person.

My desired output table would look like this:

| week       | Metrics | Joe | Dean |
|------------|---------|----:|-----:|
| 11/6/2023  | Target  |  40 |   65 |
| 11/6/2023  | Bonus   |  46 |   70 |
| 11/6/2023  | Score   |  33 |   71 |
| 11/13/2023 | Target  |  40 |   65 |
| 11/13/2023 | Bonus   |  46 |   70 |
| 11/13/2023 | Score   |  45 |   65 |
| 11/20/2023 | Target  |  40 |   65 |
| 11/20/2023 | Bonus   |  46 |   70 |
| 11/20/2023 | Score   |  35 |   68 |
| 11/27/2023 | Target  |  40 |   65 |
| 11/27/2023 | Bonus   |  46 |   70 |
| 11/27/2023 | Score   |  40 |   44 |
| 12/4/2023  | Target  |  40 |   65 |
| 12/4/2023  | Bonus   |  46 |   70 |
| 12/4/2023  | Score   |  42 |   66 |
4

There are 4 answers

9
PaulS On BEST ANSWER

Only one block of NaN per column at most

Another possible solution, which loops through the df columns corresponding to each person and, for each block of NaN (identified by loc), assigns the corresponding block of values in ref (also identified by loc):

names = ['Joe', 'Dean']

d = ref.assign(Score = ref['Target'])

for x in names:
    df.loc[df[x].isna(), x] = d.loc[d['Name'].eq(x), 'Target':'Score'].T.values

General case

In case there is more than a single block of NaN per person, we need to change the code slightly:

names = ['Joe', 'Dean']

d = ref.assign(Score = ref['Target'])

for x in names:
    n_blocks = df[x].isna().sum() // 3
    df.loc[df[x].isna(), x] = np.tile(d.loc[d['Name'].eq(x), 'Target':'Score']
                                      .values.flatten(), n_blocks)

Edit

To satisfy the new requirement of the OP: Instead of order Target, Bonus and Score, it is needed the order Bonus, Target and Score. In such a case, we need to readjust the previous code:

names = ['Joe', 'Dean']

d = ref.assign(Score = ref['Target'])
d = d[['Name', 'Bonus', 'Target', 'Score']]

for x in names:
    n_blocks = df[x].isna().sum() // 3
    df.loc[df[x].isna(), x] = np.tile(d.loc[d['Name'].eq(x), 'Bonus':'Score']
                                      .values.flatten(), n_blocks)

Output:

          week Metrics   Joe  Dean
0    11/6/2023  Target  40.0  65.0
1    11/6/2023   Bonus  46.0  70.0
2    11/6/2023   Score  33.0  71.0
3   11/13/2023  Target  40.0  65.0
4   11/13/2023   Bonus  46.0  70.0
5   11/13/2023   Score  45.0  65.0
6   11/20/2023  Target  40.0  65.0
7   11/20/2023   Bonus  46.0  70.0
8   11/20/2023   Score  35.0  68.0
9   11/27/2023  Target  40.0  65.0
10  11/27/2023   Bonus  46.0  70.0
11  11/27/2023   Score  40.0  44.0
12   12/4/2023  Target  40.0  65.0
13   12/4/2023   Bonus  46.0  70.0
14   12/4/2023   Score  42.0  66.0
0
Saša Petaliankar On

I have changed the name of second data frame to df2, since we cannot them have the same name:

# Iterate over each row in df2
for i, row in df2.iterrows():
    # For each person
    for person in ['Joe', 'Dean']:
        # If the value is NaN
        if pd.isnull(row[person]):
            # If the metric is 'Score', use the 'Target' value
            if row['Metrics'] == 'Score':
                value = df.loc[df['Name'] == person, 'Target'].values[0]
            # Otherwise, check if the metric exists in df and use its value
            elif row['Metrics'] in df.columns:
                value = df.loc[df['Name'] == person, row['Metrics']].values[0]
            else:
                continue  # Skip if the metric is not in df and is not 'Score'
            # Replace the NaN value in df2
            df2.at[i, person] = value

This should work for your purpose.

0
ombk On
data = pd.DataFrame({
        'Name': ['Joe', 'Phil', 'Dean'],
        'Target': [40, 38, 65],
        'Bonus': [46, 42, 70]
    })
data["Score"] = data["Target"]
transposed = data.set_index('Name').transpose().rename(columns=data['Name'].to_dict())
#
Name    Joe  Phil  Dean
Target   40    38    65
Bonus    46    42    70
Score    40    38    65
#

data2 = data2.merge(transposed[["Joe","Dean"]], how="left", right_index=True, left_on="Metrics", suffixes=("","_filler"))
data2["Joe"] = data2["Joe"].fillna(data2["Joe_filler"])
data2["Dean"] = data2["Dean"].fillna(data2["Dean_filler"])
data2.drop(columns=["Joe_filler","Dean_filler"])
#
          week Metrics   Joe  Dean  Joe_filler  Dean_filler
0    11/6/2023  Target  40.0  65.0          40           65
1    11/6/2023   Bonus  46.0  70.0          46           70
2    11/6/2023   Score  33.0  71.0          40           65
3   11/13/2023  Target  40.0  65.0          40           65
4   11/13/2023   Bonus  46.0  70.0          46           70
5   11/13/2023   Score  45.0  65.0          40           65
6   11/20/2023  Target  40.0  65.0          40           65
7   11/20/2023   Bonus  46.0  70.0          46           70
8   11/20/2023   Score  35.0  68.0          40           65
9   11/27/2023  Target  40.0  65.0          40           65
10  11/27/2023   Bonus  46.0  70.0          46           70
11  11/27/2023   Score  40.0  44.0          40           65
12   12/4/2023  Target  40.0  65.0          40           65
13   12/4/2023   Bonus  46.0  70.0          46           70
14   12/4/2023   Score  42.0  66.0          40           65

I kept the last two columns (didnt drop inplace) so you can see whats happening with those left merges and how the fillna works.

There should be a cleaner solution and a more compact one, but as an inspiration this could be useful for you.

# solution 2
data = pd.DataFrame({
    'Name': ['Joe', 'Phil', 'Dean'],
    'Target': [40, 38, 65],
    'Bonus': [46, 42, 70]
})
data["Score"] = data["Target"]

transposed = data.set_index('Name').transpose().rename(columns=data['Name'].to_dict())
data2["Joe"] = np.where(data2["Joe"].isna(), data2["Metrics"].map(transposed["Joe"].to_dict()),data2["Joe"])
data2["Dean"] = np.where(data2["Dean"].isna(), data2["Metrics"].map(transposed["Dean"].to_dict()),data2["Dean"])

0
ziying35 On

try this:

1. Create a reference map from 'ref':

ref['Score'] = ref['Target']
ref.set_index('Name', inplace=True)
ref_map = ref.to_dict('index')
print(ref_map)
>>>
{'Joe': {'Target': 40, 'Bonus': 46, 'Score': 40},
 'Phil': {'Target': 38, 'Bonus': 42, 'Score': 38},
 'Dean': {'Target': 65, 'Bonus': 70, 'Score': 65}}

2. Fill the 'Metrics' for each specified person's column in DataFrame 'df'

columns_to_fill = ['Joe', 'Dean']

df[columns_to_fill] = df[columns_to_fill].apply(lambda x: x.fillna(df['Metrics']))
print(df)
>>>
          week Metrics     Joe    Dean
0    11/6/2023  Target    40.0    65.0
1    11/6/2023   Bonus    46.0    70.0
2    11/6/2023   Score    33.0    71.0
3   11/13/2023  Target    40.0  Target
4   11/13/2023   Bonus    46.0   Bonus
5   11/13/2023   Score    45.0   Score
6   11/20/2023  Target    40.0    65.0
7   11/20/2023   Bonus    46.0    70.0
8   11/20/2023   Score    35.0    68.0
9   11/27/2023  Target  Target    65.0
10  11/27/2023   Bonus   Bonus    70.0
11  11/27/2023   Score   Score    44.0
12   12/4/2023  Target    40.0    65.0
13   12/4/2023   Bonus    46.0    70.0
14   12/4/2023   Score    42.0    66.0

3. Replace values in 'df' using the reference map 'ref_map'

result = df.replace(ref_map)
print(result)
>>>
          week Metrics   Joe  Dean
0    11/6/2023  Target  40.0  65.0
1    11/6/2023   Bonus  46.0  70.0
2    11/6/2023   Score  33.0  71.0
3   11/13/2023  Target  40.0  65.0
4   11/13/2023   Bonus  46.0  70.0
5   11/13/2023   Score  45.0  65.0
6   11/20/2023  Target  40.0  65.0
7   11/20/2023   Bonus  46.0  70.0
8   11/20/2023   Score  35.0  68.0
9   11/27/2023  Target  40.0  65.0
10  11/27/2023   Bonus  46.0  70.0
11  11/27/2023   Score  40.0  44.0
12   12/4/2023  Target  40.0  65.0
13   12/4/2023   Bonus  46.0  70.0
14   12/4/2023   Score  42.0  66.0