I am attempting to convert some code my organization uses from pandas dataframes to pandas api on spark dataframes. We have run into a problem when we try to convert our pivot functions where pandas api on spark does not allow pivot operations on string columns. I have recreated a small example to demonstrate the problem. The following pandas operation completes successfully.
import pandas as pd
pd_df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': [10, 11, 12, 13, 14, 15]})
pd_df = pd_df.pivot(columns='bar').sort_index()
pd_df.head()
The output for this is the following
| index | foo | - | - | baz | - | - | zoo | - | - |
|---|---|---|---|---|---|---|---|---|---|
| bar | A | B | C | A | B | C | A | B | C |
| 0 | one | NaN | NaN | 1.0 | NaN | NaN | 10.0 | NaN | NaN |
| 1 | NaN | one | NaN | NaN | 2.0 | NaN | NaN | 11.0 | NaN |
| 2 | NaN | NaN | one | NaN | NaN | 3.0 | NaN | NaN | 12.0 |
| 3 | two | NaN | NaN | 4.0 | NaN | NaN | 13.0 | NaN | NaN |
| 4 | NaN | two | NaN | NaN | 5.0 | NaN | NaN | 14.0 | NaN |
| 5 | NaN | NaN | two | NaN | NaN | 6.0 | NaN | NaN | 15.0 |
Note the bar,A,B,C row represent our multi-indexed column names and is part of the header, not the data.
While the pandas runs without a problem, our pandas api on spark pivot fails with the error "values should be a numeric type". This is due to our 'foo' column containing strings.
import pyspark.pandas as ps
ps_df = ps.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': [10, 11, 12, 13, 14, 15]})
ps_df = ps_df.pivot(columns='bar', values=['foo', 'baz', 'zoo']).sort_index()
ps_df.head()
I am trying to figure out how to reproduce the output from the pandas pivot call using pandas api on spark. Is there any way to do this? I have seen this question (Pivot String column on Pyspark Dataframe) but it uses pyspark rather than pandas api on spark.
I want to stick to pandas api on spark as much as possible as much of our team isn't familiar with spark so I was hoping to find a solution that would not require that. Does anyone know a way to perform this operation in pandas api on spark?
I was able to find a way around this by converting creating a dictionary for each column containing the unique values for that column, and assigning each unique value using an integer. From there we can use the regular pivot function then convert back. This only has been tested with one pivot column which was sufficient for my purposes. It was also only tested with a column named as index so that may be necessary too.
Note that the runtime is quite slow, mainly due to the dictionary creation step. It took about 20 minutes to create 4 dictionaries from our non-numeric columns, with a total of about 4000 elements across all dictionaries. The rest of the steps run in under a minute so other than this it does work and gets the desired answer.
Also note that we rename the columns so they do not have underscores, then rename them to the original names at the end. This is because the pivot operation does not like columns with underscore names so the rename avoids this.