I'm trying to wrangle a messy database that has a plethora of middle names and initials in the first names field. I'm having issues dealing with about 200k Vietnamese names. It's not the special characters that are my problem as they're not present. This issue is the syntax of the names. Here is an example of the names giving me issues:
- Anh Quoc Hoang
- Anh Tuan Dao
- Anh Tuyet Thi
I'm currently using from nameparser import HumanName to assist, but it does a terrible job with the Viet names. I don't know the culture well enough to know if they are correct or not, and there are too many edge cases so my current script just ignores the name all together.
Can you offer any guidance on how to better handle Viet names? Here is my current script:
import glob
import pandas as pd
import numpy as np
import os
from nameparser import HumanName
extended_last_names = [
"Nguyen", "Tran", "Le", "Pham", "Hoang", "Huynh", "Phan", "Vu", "Vo",
"Dang", "Bui", "Do", "Ho", "Ngo", "Duong", "Ly", "Luu", "Quach", "Trinh",
"Phung", "Dinh", "Doan", "Dai", "Giang", "Ha", "Han", "Kieu", "Lai",
"Lam", "Luong", "Mai", "Nghiem", "Phi", "Quang", "Quoc", "Ta", "Thach",
"Thai", "Thao", "Thieu", "Tho", "Tong", "Trang", "Trieu", "Truong",
"Tuan", "Van", "Vinh", "Vuong", "Xuan", "Thanh", "Tung", "Quyen", "Chau",
"Kha", "Khanh", "Minh", "Nam", "Quan", "Hieu", "Hai", "Hien", "Hung",
"Huong", "Khang", "Khoi", "Linh", "Nhat", "Quynh", "Son", "Thuy", "Tien",
"Anh", "Bach", "Bang", "Binh", "Chien", "Cong", "Cuong", "Duy", "Gia",
"Hao", "Kiet", "Loc", "Long", "Nhan", "Phuc", "Sang", "Tam", "Thang",
"Thien", "Toan", "Trung", "Tuyet", "Vien", "Yen", "Wang", "Li", "Zhang",
"Liu", "Chen", "Yang", "Huang", "Zhao", "Wu", "Zhou", "Xu", "Sun", "Ma",
"Zhu", "Hu", "Guo", "He", "Gao", "Lin", "Luo"
]
# Function to parse a name
def parse_name(first_name, last_name):
try:
# If the last name is in the extended list, ignore the first name
if last_name in extended_last_names:
return np.nan, np.nan, last_name
# Parse the first name using HumanName
parsed = HumanName(first_name)
# Return the parsed names
return parsed.first, parsed.middle, parsed.last
except Exception as e:
print(f"Error parsing name '{first_name} {last_name}': {e}")
return np.nan, np.nan, np.nan
# Any file ending in csv
files = glob.glob('*.csv')
for file in files:
# Skip files with '_clean' in the name
if '_clean' in file:
continue
df = pd.read_csv(file, encoding='ISO-8859-1', low_memory=False)
# Apply the function and create new columns for first, middle, and last names
df[['CnBio_First_Name', 'CnBio_Middle_Name', 'CnBio_Last_Name']] = df.apply(
lambda x: pd.Series(parse_name(x['CnBio_First_Name'], x['CnBio_Last_Name'])), axis=1)
df[['CnSpSpBio_First_Name', 'CnSpSpBio_Middle_Name', 'CnSpSpBio_Last_Name']] = df.apply(
lambda x: pd.Series(parse_name(x['CnSpSpBio_First_Name'], x['CnSpSpBio_Last_Name'])), axis=1)
base, ext = os.path.splitext(file)
# Insert '_clean' before the extension
new_file = base + '_clean' + ext
# Save the DataFrame to the new file name in the current working directory
df.to_csv(f'{new_file}', index=False, encoding='ISO-8859-1')