Conditional Concatenation for Exchange Contact Imports

54 views Asked by At

*CSV contains fake information

Python on Kaggle w/ Pandas + Numpy

Context: Goal is to import/update Exchange contacts. Users want to search for parent email by their student(s).

Input (see CSV screenshot):

 - 'Student ID' column (7-digit number)
 - 'Parent ID' column (7-digit number)
 - 'Student First Name' column (str)
 - 'Student Last Name' column (str)
 - 'Parent First Name' column (str)
 - 'Parent Last Name' column (str)

Desired Output (see CSV screenshot):

 - 'Contacts' column
 *'Contacts' column follows this pattern:
      - One Student: "studentLastName, studentFirstName (parentLastName, parentFirstName)"
      - Multiple Students: "studentLastName, student1FirstName, student2FirstName, & student3FirstName (parentLastName,> parentFirstName)"

Input CSV

Assumptions:

  • use student/parent IDs to evaluate conditions
    • Many students/guardians share last names and are not related
    • Many students have >2 guardians
    • Many students do not share last names with one or more guardians
    • Some students are siblings with different legal last names

#1. What I tried: Nested for loop with an if statement

for i in range(14):
    for j in range(14):
        if contacts.iloc[i]['parentID'] == contacts.iloc[j]['parentID'] and i!=j:
            hold = [contacts.iloc[i]['studentFirstName'], contacts.iloc[j]['studentFirstName']]
    print(hold)

#What I was expecting: list of tuples of student first names of siblings where the corresponding parentIDs match in different rows

#Example Output:

[[Uganda, Tiramisu], [Uganda, Tiramisu], [Ethernet, Persimmon], [Ethernet, Niagara], [Persimmon, Niagara]]

#I am new to DataFrames. I want to treat everything like lists.

#2.What I tried: Nested for loop with numpy.where():

for i in range(14):
    for j in range(14):
        print(i, j)
        print(contacts.iloc[i]['parentID'], contacts.iloc[j]['parentID'])
        np.where(contacts.iloc[i]['parentID'] == contacts.iloc[j]['parentID'],
                    print(contacts.iloc[i]['studentFirstName'] + ' and ' + contacts.iloc[j]['studentFirstName']), print('1'))

#What I was expecting: student1FirstName and student2FirstName where parentIDs match, else 1

#I don't understand broadcasting, and feel like that and/or printing results is breaking this.

#Considering alternatives: pd.agg(), pd.groupby(), lambda functions, list comprehension

#I am a beginner to Pandas/Numpy/Python. When practicing I think more easily with for loops, then try to convert them to list comprehension. It seems there are many ways to solve this problem.

1

There are 1 answers

0
Max888 On

This is my understanding of what you are asking for:

import pandas as pd

data = {
    'Student ID': [1, 1, 4, 4, 4, 5, 5, 5, 9, 9, 10, 10, 22, 22],
    'Parent ID':  [2, 3, 6, 7, 8, 6, 7, 8, 3, 4, 3,  4,  3, 4],
    'Student First Name': ['Native', 'Native', 'Uganda', 'Uganda','Uganda','Tiramisu', 'Tiramisu', 'Tiramisu', 'Ethernet', 'Ethernet', 'Persimon', 'Persimon', 'Niagra', 'Niagra'],
    'Student Last Name': ['Americans', 'Americans', 'Socrates', 'Socrates','Socrates','Socrates', 'Socrates', 'Socrates', 'Regression', 'Regression', 'Regression', 'Regression', 'Regression', 'Regression'],
    'Parent First Name': ['Femur', 'Gigantic', 'USBC', 'Lei','Expo','USBC', 'Lei', 'Expo', 'Monitor', 'Isopropyl', 'Monitor', 'Isopropyl', 'Monitor', 'Isopropyl'],
    'Parent Last Name': ['Americans', 'Peanut', 'Socrates', 'Socrates','Socrates','Socrates', 'Socrates', 'Socrates', 'Chocolate Chips', 'Bits n Bits', 'Chocolate Chips', 'Bits n Bits', 'Chocolate Chips', 'Bits n Bits'],
    'parent_email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

# Group parent emails by students
grouped_emails = df.groupby(['Student ID', 'Student First Name', 'Student Last Name'])['parent_email'].apply(list).reset_index()
grouped_emails.to_excel("grouped_emails.xlsx", index=False)

# Display grouped emails
print(grouped_emails)

enter image description here