Pandas DataFrame, get 3 max values in the row and their column names

180 views Asked by At

on the forum exists a lot of examples, how is it possible to find the max value of the row with the corresponding column names. Some examples are here or here

What I want to do, is some specific modification of the above examples. My dataframe looks like this, where all columns are numerated from left to the right (this order is very important):

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10
  0   0   1   2   2   0   0   0   0    0
  4   4   0   4   4   1   0   0   0    0
  0   0   1   2   3   0   0   0   0    0

Now, I want to create 6 new columns at the end of every row with the column name and the biggest values in the row.

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
  0   0   1   2   2   0   0   0   0    0
  4   4   0   4   4   1   0   0   0    0
  0   0   1   2   3   0   0   0   0    0

If some row has more then only 1 max (for example the value 2 in the first row), I want to save in the column Max1 only one column name with the smallest index. In this case the second biggest value is also 2, but the corresponding column has greater index. It means, It is necessary to save in the "Max(y)" column only one column name. This is the main condition. In the case, if some row has more the 3 max values, it is necessary to save only 3 column names with the smallest indices. So the final output should be look like this DF:

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
  0   0   1   2   2   0   0   0   0    0  x_4       2  x_5       2  x_3       1
  4   4   0   4   4   1   0   0   0    0  x_1       4  x_2       4  x_4       4
  0   0   1   2   3   0   0   0   0    0  x_5       3  x_4       2  x_3       1

So summarized we have the next result: in the first row 4 < 5, it means 4 comes first (anyway the second 2 comes immediately in the next column). in the second row 1 < 2 < 4 < 5, we have only 3 columns, so 5 is missing in the final result. in the third row, indices don't play any role, because we have strictly different values in the row. This is also the main condition.

5

There are 5 answers

0
canaytore On BEST ANSWER

With the following code chunk it first creates a copy of the dataframe df_copy where the column names are replaced with their corresponding numeric index (as the order is important as you mentioned). Then it applies a function to each row to get the indices of the top 3 max values. These indices are then mapped back to the original column names. Finally, it gets the values for these columns, and of course the columns are reordered as intended.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'x_1': [0, 4, 0],
    'x_2': [0, 4, 0],
    'x_3': [1, 0, 1],
    'x_4': [2, 4, 2],
    'x_5': [2, 4, 3],
    'x_6': [0, 1, 0],
    'x_7': [0, 0, 0],
    'x_8': [0, 0, 0],
    'x_9': [0, 0, 0],
    'x_10': [0, 0, 0]
})

# Create a copy of the dataframe and replace column names with their corresponding numeric index
df_copy = df.copy()
df_copy.columns = np.arange(len(df.columns))

# Apply a function to each row (axis=1) to get the indices of the top 3 max values
df[['Max1', 'Max2', 'Max3']] = df_copy.apply(lambda row: row.nlargest(3).index, axis=1, result_type='expand')

# Map the numeric indices back to column names
df[['Max1', 'Max2', 'Max3']] = df[['Max1', 'Max2', 'Max3']].applymap(lambda x: df.columns[int(x)])

# Get the values
df[['ValMax1', 'ValMax2', 'ValMax3']] = df.apply(lambda row: [row[row['Max1']], row[row['Max2']], row[row['Max3']]], axis=1, result_type='expand')

# Reorder the columns
column_order = ['x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'Max1', 'ValMax1', 'Max2', 'ValMax2', 'Max3', 'ValMax3']
df = df[column_order]
df

Result (as intended):

x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
  0   0   1   2   2   0   0   0   0    0  x_4       2  x_5       2  x_3       1
  4   4   0   4   4   1   0   0   0    0  x_1       4  x_2       4  x_4       4
  0   0   1   2   3   0   0   0   0    0  x_5       3  x_4       2  x_3       1
0
Scott Boston On

You can also try something like this:

import pandas as pd

# Using @canaytore dataframe setup
df = pd.DataFrame({
'x_1': [0, 4, 0],
'x_2': [0, 4, 0],
'x_3': [1, 0, 1],
'x_4': [2, 4, 2],
'x_5': [2, 4, 3],
'x_6': [0, 1, 0],
'x_7': [0, 0, 0],
'x_8': [0, 0, 0],
'x_9': [0, 0, 0],
'x_10': [0, 0, 0]
 })

n = 4 #Top N values
dfr = df.T.rank(method='first', ascending=False)\
          .stack().astype('int')\
          .rename('place').loc[lambda x: x<=n]\
          .reset_index()\
          .pivot(index='level_1', columns='place', values='level_0')\
          .add_prefix('Max')

idx = dfr.stack().reset_index(level=0).to_numpy().tolist()

dfv = df.stack().loc[idx]
dfv = pd.DataFrame(dfv.to_numpy().reshape(-1,n), 
                   columns=[f'Max{i}Value' for i in range(1,n+1)])

df_out = pd.concat([df, pd.concat([dfr, dfv], axis=1).sort_index(axis=1)], axis=1)

print(df_out)

Output:

   x_1  x_2  x_3  x_4  x_5  x_6  x_7  x_8  x_9  x_10 Max1  Max1Value Max2  Max2Value Max3  Max3Value Max4  Max4Value
0    0    0    1    2    2    0    0    0    0     0  x_4          2  x_5          2  x_3          1  x_1          0
1    4    4    0    4    4    1    0    0    0     0  x_1          4  x_2          4  x_4          4  x_5          4
2    0    0    1    2    3    0    0    0    0     0  x_5          3  x_4          2  x_3          1  x_1          0
0
wjandrea On

It seems to make more sense to do this in NumPy, then get the column names at the end.

I wrote a function you can use to get the top n indexes of an array. It works by using np.nanargmax then masking the values as NaN before doing it again. (There may be better ways to do this, but this is just what I thought of first.)

def argmax_n(arr: np.array, n: int, axis=None):
    arr = arr.astype('float')
    argmaxes = []
    for _ in range(n):
        argmax = np.nanargmax(arr, axis=axis, keepdims=True)
        argmaxes.append(argmax)
        np.put_along_axis(arr, argmax, np.NAN, axis=axis)
    return argmaxes

Used like so:

a = df.to_numpy()
argmax_3 = argmax_n(a, 3, axis=1)

Then you can build the DataFrame you want and .join it with the original if needed.

max_data = {}
for i, arg in enumerate(argmax_3, start=1):
    max_data[f'Max{i}'] = df.columns[arg.flatten()]
    max_data[f'ValMax{i}'] = np.take_along_axis(a, arg, axis=1).flatten()

pd.DataFrame(max_data)
  Max1  ValMax1 Max2  ValMax2 Max3  ValMax3
0  x_4        2  x_5        2  x_3        1
1  x_1        4  x_2        4  x_4        4
2  x_5        3  x_4        2  x_3        1
1
mozway On

For an efficient approach you need to vectorize, for that use 's argpartition and indexing:

import numpy as np

N = 3

# convert to arrays
# and reverse to preserve order
# of min index in case of a tie
cols = df.columns.to_numpy()[::-1]
a = df.loc[:, ::-1].to_numpy()

# get the top N indices
idx = np.argpartition(a, -N)[:, :-N-1:-1]

# get the top names 
names = cols[idx]

# get the top values
values = np.take_along_axis(a, idx, axis=1)
# or
values = a[np.arange(len(a))[:,None], idx]

# assign to new columns
df[[f'{x}{i+1}' for i in range(N) for x in ['Max', 'ValMax']]
  ] = (np.dstack([names,  values])
         .reshape(len(df), -1)
       )

Output:

   x_1  x_2  x_3  x_4  x_5  x_6  x_7  x_8  x_9  x_10 Max1 ValMax1 Max2 ValMax2 Max3 ValMax3
0    0    0    1    2    2    0    0    0    0     0  x_4       2  x_5       2  x_3       1
1    4    4    0    4    4    1    0    0    0     0  x_1       4  x_2       4  x_4       4
2    0    0    1    2    3    0    0    0    0     0  x_5       3  x_4       2  x_3       1
2
BENY On

You may try with rank

rnk = df.rank(method = 'first',axis=1)>=8
value = df.to_numpy()[rnk].reshape(3,-1)
name = rnk.dot(rnk.columns+',').str[:-1].str.split(',')

Create the df you can use the same method like mozway