I have a name column with different languages.
In the end, I need English.
When I use just one variable it works but how can I do this for the complete dataframe for 1 or more columns?
from deep_translator import GoogleTranslator
reader = "df"
NAME_ORIG = "ich suche den namen"
translated = GoogleTranslator(source='auto', target='en').translate(NAME_ORIG))
Simple approach using UDF
As suggested here use an udf:
Using mapPartitions
The udf approach creates a new
GoogleTranslatorobject for each row.The documentation of deep-translator says
Reusing the translator object can be achieved by using mapPartitions. Only one translator object per partition will be created:
Using mapPartitions with batches
The deep-translator api offers a translate_batch function. This function can also be used by preparing the batches inside of
mapPartitions:Using
GoogleTranslator.translate_batchinstead ofGoogleTranslator.translatemay or may not improve the performance further.In all three approaches the output is the same: