Translating the whole dataframe with googletrans is taking too long

406 views Asked by At

I'm trying to translate all the elements of a dataframe from Persian to English. I'm using the code below, but it takes long time to run. Is there any quick way?

import pandas as pd
exl_file = 'data.xlsx'
df = pd.read_excel(exl_file)

import googletrans
from googletrans import Translator
translator = Translator()

df_en = df.copy()

df_en.rename(columns=lambda x: translator.translate(x).text, inplace=True)

df_en.columns

translations = {}
for column in df_en.columns:
    unique_elements = df_en[column].unique()
    for element in unique_elements:
        translations[element] = translator.translate(element).text

df_en.replace(translations, inplace = True)

df_en.to_csv('en_data.csv', index=False)
2

There are 2 answers

0
Anas Altarazi On

I am not the expert of this domain, but I can suggest some enhancement here.

  • if the excel file's size is too large, try to split it into segments so you can translate each segment and push it to the next step while translating the next segment.
  • don't copy the df, instead of it try to use a mapping function to change the collection from type to type.
  • if there is a way to add a caching feature to the translator lib it will be great for the performance.
0
valentinmk On

Beside of pandas optimisations, that could be done. I believe most slowest part is fetching results from Google translate api.

But the library already provide bulk translate feature, try to use it in your script.

https://py-googletrans.readthedocs.io/en/latest/#advanced-usage-bulk