import dirty csv file with unwanted characters, strings

111 views Asked by At

I would like to import csv files with pandas. Normally my data is given in the form:

a,b,c,d
a1,b1,c1,d1
a2,b2,c2,d2

where a,b,c,d is the header. I can easily use the pandas.read_csv here. However, now I have data stored like this:

"a;b;c;d"
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""

How can I clean this up in the most efficient way? How can I remove the string around the entire row so that it can detect the columns? And then how to remove all the "?

Thanks a lot for any help!!

I am not sure what to do. enter image description here

3

There are 3 answers

3
Timeless On BEST ANSWER

Here is one option with read_csv (and I'm sure we can make it better) :

df = (
        pd.read_csv("input.csv", sep=r";|;\\?", engine="python")
            .pipe(lambda df_: df_.set_axis(df_.columns.str.strip('"'), axis=1))
            .replace(r'[\\"]', "", regex=True)

     )

Output :

​
print(df)
​
    a   b   c   d
0  a1  b1  c1  d1
1  a2  b2  c2  d2
4
crabpeople On

You can use sed to breakdown the file into your chosen format.

For a simple example matching your issue using sed:

$ cat file 
"a1a1;"a1a1";"a1a1";"a1a1""
$ cat file | sed 's/"//g'
a1a1;a1a1;a1a1;a1a1

sed 's/"//g' This will replace all " chars with nothing, the g at the end tells sed to do this for every " char and not just the first found.

I see you edited the question, here is an update to the new text output:

$ cat file
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""
$ cat file | sed 's/"//g' | sed 's|\\||g' 
a1;b1;c1;d1
a2;b2;c2;d2
4
Luuk On

When you need/want to do it in Python:

Just removing the leading and ending quotes:



file1 = open('abcd.csv',"r")
file2 = open('abcd-new.csv',"w")
lines = file1.readlines()

for line in lines:
    if (line.startswith("\"") and line.endswith("\"")):
         line = line[1:len(line)-1] 
    print(line)
    file2.write(line)
file2.close()

and when you also need to replace the \":



file1 = open('abcd.csv',"r")
file2 = open('abcd-new.csv',"w")
lines = file1.readlines()

for line in lines:
    if (line.startswith("\"") and line.endswith("\"")):
         line = line[1:len(line)-1] 
    line = line.replace("\"","")
    line = line.replace("\\","")
    print(line)
    file2.write(line)
file2.close()