pandas dataset transformation to normalize the data

223 views Asked by At

I have a csv file like this: Input DataFrame

I want to transform it into a pandas dataframe like this: Output DataFrame

Basically i'm trying to normalize the dataset to populate a sql table.

I have used json_normalize to create a separate dataset from genres column but I'm at a loss over how to transform both the columns as shown in the above depiction.

Some suggestions would be highly appreciated.

1

There are 1 answers

1
ManojK On BEST ANSWER

If the genre_id is the only numeric value (as shown in the picture), you can use the following:

#find all occurrences of digits in the column and convert the list items to comma separated string.
df['genre_id'] = df['genres'].str.findall(r'(\d+)').apply(', '.join)

#use pandas.DataFrame.explode to generate new genre_ids by comma separating them.
df = df.assign(genre_id = df.genre_id.str.split(',')).explode('genre_id') 

#finally remove the extra space
df['genre_id']  = df['genre_id'].str.lstrip() 

#if required create a new dataframe with these 2 columns only
df = df[['id','genre_id']]