Python: Generate range of values for groupby values

82 views Asked by At

I am trying to generate test data using python. Let me know if i can achieve something like this. I will input a number, say 20, i need 20 value of rows to be generated. one column contains range of values for each of the group. Values should be generated within that range. (some doesnt have range of values and only string or values with commas which should be separated) Sample input and expected output below.

Sample input:

  VarName           Label  Score
0    LtoV  0.00<=to>=0.68     20
1    LtoV     0.69<=to>=2     33
2    LtoV         2toHigh     40
3     Age           0to20     36
4     Age          21to40     15
5     Age          41to60     50
6     Indicator     A         30
7     Indicator     B         40
8     Indicator     C         40
9     Oc Code       100,20,30 10
10    Oc Code       5,10,16   20

Output:

when i give value as 20, i need 20 rows to be generated for LtoV with column2 having any values between 0.00 to high and based on the value the corresponding scores should be given. For string column, the values should just be repeated for the 20 rows. For column with values separated as ",", it should separate the values and have each value put in a row. Sameway 20 rows for age also.

SNo    VarName    column2     Score
    
  1       LtoV       0.00        20     
  2       LtoV       0.42        20
  3       LtoV       1           33
  4       LtoV       2.5         40
...
 20       LtoV      50           40
`1`    Indicator     A         30
 2    Indicator     B         40
 3    Indicator     C         40
 4    Indicator     A         30
 5    Indicator     B         40
 6     Indicator     C         40
 7     Indicator     A         30
 8     Indicator     B         40
....
 20     Indicator     C         40`
 1    Oc Code       100        10
 2    Oc Code       20         10
 3    Oc Code       30         10
 4    Oc Code       5          20
5    Oc Code       10          20
6    Oc Code       15         20
....
20    Oc Code       5          20


generate 20 such values for LtoV and 20 such values for Age and based on the range corresponding value for score.
Is this feasible?

********************
1

There are 1 answers

3
mozway On

You could extract the lower/upper bound of the range, optionally set up a default low/high if missing (else dropna). Then sample 20 rows per group with replace=True and vectorially generate the random values with numpy.random.uniform:

df = pd.DataFrame({'VarName': ['LtoV', 'LtoV', 'LtoV', 'Age', 'Age', 'Age'],
                   'Label': ['0.00<=to>=0.68', '0.69<=to>=2', '2toHigh', '0to20', '21to40', '41to60'],
                   'Score': [20, 33, 40, 36, 15, 50]})

ranges = (df['Label']
          .str.extract(r'^(\d+\.?\d*)?\D+(\d+\.?\d*)?$')
          .astype(float)
          .fillna({0: 0, 1: 100}) # default low/high
          .groupby(df['VarName']).sample(n=20, replace=True)
         )

low, high = ranges.to_numpy().T

out = (df.reindex(ranges.index)
         .assign(column2=np.random.uniform(low, high))
         .sort_index()
      )

Example output:

  VarName           Label  Score    column2
0    LtoV  0.00<=to>=0.68     20   0.670814
0    LtoV  0.00<=to>=0.68     20   0.222490
0    LtoV  0.00<=to>=0.68     20   0.364282
0    LtoV  0.00<=to>=0.68     20   0.607478
0    LtoV  0.00<=to>=0.68     20   0.633055
0    LtoV  0.00<=to>=0.68     20   0.675987
0    LtoV  0.00<=to>=0.68     20   0.531158
1    LtoV     0.69<=to>=2     33   1.843399
1    LtoV     0.69<=to>=2     33   1.940886
1    LtoV     0.69<=to>=2     33   1.760578
1    LtoV     0.69<=to>=2     33   1.110946
1    LtoV     0.69<=to>=2     33   1.724802
2    LtoV         2toHigh     40  19.508126
2    LtoV         2toHigh     40  43.618950
2    LtoV         2toHigh     40  62.808270
2    LtoV         2toHigh     40  95.246445
2    LtoV         2toHigh     40   2.536909
2    LtoV         2toHigh     40  29.841093
2    LtoV         2toHigh     40  73.958545
2    LtoV         2toHigh     40  89.782031
3     Age           0to20     36  12.593195
3     Age           0to20     36   0.110518
3     Age           0to20     36   5.308487
3     Age           0to20     36  12.914926
3     Age           0to20     36  17.198194
3     Age           0to20     36  17.529561
3     Age           0to20     36   0.954200
4     Age          21to40     15  28.722856
4     Age          21to40     15  30.446264
4     Age          21to40     15  26.812708
4     Age          21to40     15  34.001827
4     Age          21to40     15  30.052331
4     Age          21to40     15  28.470239
4     Age          21to40     15  30.371726
5     Age          41to60     50  50.095150
5     Age          41to60     50  57.618241
5     Age          41to60     50  56.214911
5     Age          41to60     50  58.106158
5     Age          41to60     50  41.220661
5     Age          41to60     50  49.831084

Intermediate:

(df['Label']
.str.extract(r'^(\d+\.?\d*)?\D+(\d+\.?\d*)?$')
.astype(float)
.fillna({0: 0, 1: 100}) # default low/high
)

       0       1
0   0.00    0.68
1   0.69    2.00
2   2.00  100.00
3   0.00   20.00
4  21.00   40.00
5  41.00   60.00