I'm extracting information from URLs (RSS feeds) to create one big data frame with all the data I need for sentiment analysis. I did a function to get each URL in a dictionary, use a parser, and then put the entries on a DataFrame. After 5 iterations, I get the error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
I'm using a dictionary like {'name': 'url'} with the code below:
def extract_content(urls):
df_final = pd.DataFrame()
for url in urls.values():
xml = feedparser.parse(url)
entries = xml['entries']
df = pd.DataFrame(entries)
if 'media_content' in df.columns:
df.rename(columns = {'media_content': 'content'}, inplace = True)
if 'content' not in df.columns:
df.rename(columns={'summary': 'content'}, inplace=True)
df = df[['title', 'link', 'published', 'published_parsed', 'content']]
df_final = pd.concat([df_final, df]).reset_index(drop = True)
return df_final
How can I fix it?
I tried reset_index() but still doesn't work.
Possible duplicated column name
I think that it comes from a duplicate column name. For example, the following code reproduces the error:
In this code, I first rename column 'C' into column 'A' in the DataFrame df. It doesn't throw any error during renaming even if there is already a column named 'A', but it throws the error 'InvalidIndexError: Reindexing only valid with uniquely valued Index objects' during concatenation due to the duplicated column name. I think it is what happens in your case when you rename the column 'media_content' into the column name 'content'. You have not put any check that the column name 'content' already exists in the DataFrame df. If a column name 'content' already exists, it will then produce the reported error during concatenation.
I see two possible solutions here:
Solution 1
You remove the duplicated column before concatenation:
This outputs without error the expected output (we keep only the first column names 'A'):
Solution 2
You only rename a column if the desired name doesn't already exist as a column name in the DataFrame df: