Data wrangling viet names with python

50 views Asked by At

I'm trying to wrangle a messy database that has a plethora of middle names and initials in the first names field. I'm having issues dealing with about 200k Vietnamese names. It's not the special characters that are my problem as they're not present. This issue is the syntax of the names. Here is an example of the names giving me issues:

  • Anh Quoc Hoang
  • Anh Tuan Dao
  • Anh Tuyet Thi

I'm currently using from nameparser import HumanName to assist, but it does a terrible job with the Viet names. I don't know the culture well enough to know if they are correct or not, and there are too many edge cases so my current script just ignores the name all together.

Can you offer any guidance on how to better handle Viet names? Here is my current script:

import glob
import pandas as pd
import numpy as np
import os
from nameparser import HumanName

extended_last_names = [
    "Nguyen", "Tran", "Le", "Pham", "Hoang", "Huynh", "Phan", "Vu", "Vo",
    "Dang", "Bui", "Do", "Ho", "Ngo", "Duong", "Ly", "Luu", "Quach", "Trinh",
    "Phung", "Dinh", "Doan", "Dai", "Giang", "Ha", "Han", "Kieu", "Lai",
    "Lam", "Luong", "Mai", "Nghiem", "Phi", "Quang", "Quoc", "Ta", "Thach",
    "Thai", "Thao", "Thieu", "Tho", "Tong", "Trang", "Trieu", "Truong",
    "Tuan", "Van", "Vinh", "Vuong", "Xuan", "Thanh", "Tung", "Quyen", "Chau",
    "Kha", "Khanh", "Minh", "Nam", "Quan", "Hieu", "Hai", "Hien", "Hung",
    "Huong", "Khang", "Khoi", "Linh", "Nhat", "Quynh", "Son", "Thuy", "Tien",
    "Anh", "Bach", "Bang", "Binh", "Chien", "Cong", "Cuong", "Duy", "Gia",
    "Hao", "Kiet", "Loc", "Long", "Nhan", "Phuc", "Sang", "Tam", "Thang",
    "Thien", "Toan", "Trung", "Tuyet", "Vien", "Yen", "Wang", "Li", "Zhang",
    "Liu", "Chen", "Yang", "Huang", "Zhao", "Wu", "Zhou", "Xu", "Sun", "Ma",
    "Zhu", "Hu", "Guo", "He", "Gao", "Lin", "Luo"
]

# Function to parse a name
def parse_name(first_name, last_name):
    try:
        # If the last name is in the extended list, ignore the first name
        if last_name in extended_last_names:
            return np.nan, np.nan, last_name
        
        # Parse the first name using HumanName
        parsed = HumanName(first_name)
        # Return the parsed names
        return parsed.first, parsed.middle, parsed.last
    except Exception as e:
        print(f"Error parsing name '{first_name} {last_name}': {e}")
        return np.nan, np.nan, np.nan

# Any file ending in csv
files = glob.glob('*.csv')
for file in files:
    # Skip files with '_clean' in the name
    if '_clean' in file:
        continue

    df = pd.read_csv(file, encoding='ISO-8859-1', low_memory=False)

    # Apply the function and create new columns for first, middle, and last names
    df[['CnBio_First_Name', 'CnBio_Middle_Name', 'CnBio_Last_Name']] = df.apply(
        lambda x: pd.Series(parse_name(x['CnBio_First_Name'], x['CnBio_Last_Name'])), axis=1)
    df[['CnSpSpBio_First_Name', 'CnSpSpBio_Middle_Name', 'CnSpSpBio_Last_Name']] = df.apply(
        lambda x: pd.Series(parse_name(x['CnSpSpBio_First_Name'], x['CnSpSpBio_Last_Name'])), axis=1)

    base, ext = os.path.splitext(file)

    # Insert '_clean' before the extension
    new_file = base + '_clean' + ext
    # Save the DataFrame to the new file name in the current working directory
    df.to_csv(f'{new_file}', index=False, encoding='ISO-8859-1')

0

There are 0 answers