Extract strings from dataframe object column with .split() and .strip()

200 views Asked by At

I have a dataframe 'merged_tstbr' with over 17000 rows and a column 'ID'. In this column multiple entries appear as one long string, separated by ';' and \n.

Here an example of what a cell in this column looks like:

XDX_AbCd_XY_2415;\
XDX_AbCd_XY_3335;\
XDX_AbCd_XY_3351;\
XDX_AbCd_XY_3354

I have used the below script to check for all rows in 'ID' column and add to a new list 'tstbr_anf' if the ID is in my_list:

my_list = ['XDX_AbCd_XY_3351', 'XDX_AbCd_XY_3335'] # (and so on, 250 elements in total)
for index, row in merged_tstbr.iterrows():
    row_values = [value.strip() for value in row['ID'].split('\n')]       
    matching = [value for value in row_values if value in my_list]
    tstbr_anf.extend(matching)

Out of 250 elements, 6 are not showing in txtbr_anf.

I checked the presence of these in the column 'ID' in excel, and they are here, but somehow not getting added to txtbr_anf. They are present in my_list as well.

What can be the possible reasons for this.

Follow-up from How to search in a column of a dataframe for any and all value in a column another dataframe in Python using Pandas

2

There are 2 answers

0
OCa On

Input data I've used to try and reproduce your issue

  • It is based on the content of a cell that you posted,
  • I have added a second row with simply modified digits.
my_list = ['XDX_AbCd_XY_3351', 'XDX_AbCd_XY_3335']

merged_tstbr = pd.DataFrame({'ID': ['XDX_AbCd_XY_2415;\nXDX_AbCd_XY_3335;\nXDX_AbCd_XY_3351;\nXDX_AbCd_XY_3354',
                                    'XDX_AbCd_XY_1234;\nXDX_AbCd_XY_4565;\nXDX_AbCd_XY_3579;\nXDX_AbCd_XY_7592']})

Each cell in your input dataframe contains a number of IDs together in a string:

                                                  ID
0  XDX_AbCd_XY_2415;\nXDX_AbCd_XY_3335;\nXDX_AbCd...
1  XDX_AbCd_XY_1234;\nXDX_AbCd_XY_4565;\nXDX_AbCd...

Please confirm this matches your input, otherwise provide a dataframe constructor.


Problem:

You seem to believe that .strip() removes semicolons, but it does not, unless you force it to. You could verify that with a print (commented out in the code below): the row_values you were testing membership for, still have their semicolon about them:

['XDX_AbCd_XY_2415;', 'XDX_AbCd_XY_3335;', 'XDX_AbCd_XY_3351;', 'XDX_AbCd_XY_3354']

Quick fix:

  1. Either add it in the .split(), then .strip() is not needed:
for index, row in merged_tstbr.iterrows():
    row_values = [value for value in row['ID'].split(';\n')]       
    #print(row_values)
    matching = [value for value in row_values if value in my_list]
    tstbr_anf.extend(matching)
  1. Or force the .strip() to care:
    row_values = [value.strip(';') for value in row['ID'].split('\n')]

Both will output tstbr_anf as:

['XDX_AbCd_XY_3335', 'XDX_AbCd_XY_3351']

Now why this problem only arises with 6 IDs, we can only know when you provide larger input data. It could be that semicolons are present in only few cells.

If your data looks anything like in your previous question How to search in a column of a dataframe for any and all value in a column another dataframe in Python using Pandas, then indeed not all IDs have semicolons but it could also be other undesired characters.


Alternative suggestion

The two-step process below will extract those IDs and retain, in a list, those only present in my_list. You can then directly assign to tstbr_anf.

  1. Clean up IDs
    • generates a list of lists, one sublist by cell from that dataframe column)
    • acounts for semicolons possibly not being present in all cases
L = merged_tstbr['ID'].str.split(pat = ';\n|\n', regex=True)

0    [XDX_AbCd_XY_2415, XDX_AbCd_XY_3335, XDX_AbCd_...
1    [XDX_AbCd_XY_1234, XDX_AbCd_XY_4565, XDX_AbCd_...
  1. Next step does two things at once:
    • flatten the "list of sublists" into a simple list
    • and filter according to my_list
my_set = set(my_list) # sets allow faster membership search

tstbr_anf = [item 
             for sublist in L
             for item in sublist 
             if item in my_set]

Final output:

['XDX_AbCd_XY_3335', 'XDX_AbCd_XY_3351']

As you can see, we've gotten rid of the intermediate matching list, which was the temporary product of each row iteration in your code.

References

9
mozway On

Assuming you want to find the global intersection of your list and all items in the Series, I would use a set intersection and a regex split:

my_list = ['XDX_AbCd_XY_3351', 'XDX_AbCd_XY_3335']
my_set = set(my_list)

out = my_set.intersection(merged_tstbr['ID'].str.split('[;\n]+').explode())

# or
out = my_set.intersection(set().union(*merged_tstbr['ID'].str.split('[;\n]+')))

Example output: {'XDX_AbCd_XY_3335', 'XDX_AbCd_XY_3351'}

If you want the intersection per row:

import re

my_set = set(my_list)
out = [my_set.intersection(re.split('[;\n]+', s)) for s in merged_tstbr['ID']]

For a new column:

merged_tstbr['new'] = [my_set.intersection(re.split('[;\n]+', s))
                       for s in merged_tstbr['ID']]

Example output:

                                                                          ID                                   new
0  XDX_AbCd_XY_2415;\nXDX_AbCd_XY_3335;\nXDX_AbCd_XY_3351;\nXDX_AbCd_XY_3354  {XDX_AbCd_XY_3335, XDX_AbCd_XY_3351}
1                     XDX_AbCd_XY_2416;\nXDX_AbCd_XY_3336;\nXDX_AbCd_XY_3351                    {XDX_AbCd_XY_3351}

Used input:

merged_tstbr = pd.DataFrame({'ID': ['XDX_AbCd_XY_2415;\nXDX_AbCd_XY_3335;\nXDX_AbCd_XY_3351;\nXDX_AbCd_XY_3354',
                                    'XDX_AbCd_XY_2416;\nXDX_AbCd_XY_3336;\nXDX_AbCd_XY_3351']})