I'm trying to translate a column using the google translator API. When applying the UDF Function it shows the following error: PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object.
However, when using the translator on a single string, it works properly.
Here is my current code:
!pip install googletrans==3.1.0a0
from googletrans import Translator
# Sample data
data = [("Dies ist der Text auf Deutsch.",),
("Este es el texto en español.",),
("Ceci est le texte en français.",),
("Questo è il testo in italiano.",)]
# Define the schema with a single StringType column named "TEXT"
schema = StructType([StructField("TEXT", StringType(), True)])
# Create the DataFrame with the specified schema
df = spark.createDataFrame(data, schema)
# Function to translate to english
def translate_to_english(text):
# Tokenize text
tokens = nlp(text)
# Initialize an empty string to store the translated text
translated_text = ''
for token in tokens:
try:
# Translate the token text
translated = translator.translate(token.text, dest='en')
# Append the translated word and a space
translated_text += translated.text + ' '
except:
# If translation fails, use the original word
translated_text += token.text + ' '
# Remove the trailing space and return the translated text
return translated_text.strip()
# Register the UDF
translate_to_english_udf = udf(translate_to_english, StringType())
# Apply the UDF to dataframe
df = df.withColumn("TRANSLATED_TEXT", translate_to_english_udf(df["TEXT"]))