Sqlalchemy - PostgreSQL - UnicodeDecodeError

63 views Asked by At

I am trying to add data from a csv to a PostgreSQL DB using the python lib. sqlalchemy, however I always receive the error: 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 79: invalid start byte' when executing

main.drop_duplicates().to_sql('main', engine, if_exists='replace', index=False)

(full code below).

What I did so far:

  • changed in postgresql.conf: client_encoding = utf8
  • tried with and without ?charset=utf8" after the database URI, as suggested here
  • added .read_csv(…, encoding_errors="replace") (thanks for the comment!)
  • load test data not from a csv, but directly create it in pandas
    data = {
        'PROJECT_NUMBER': ['R01HL093399', 'R01HL093399', 'R01HL093399', 'R01HL093399', 'R01HL095924'],
        'PATENT_ID': [8840150, 10022225, 9283301, 10293082, 9556432],
        'PMC_ID': [3132660, 3132660, 3132660, 3132660, 4121125],
        'UNIQUE_ID': ['R01HL093399_8840150_3132660', 'R01HL093399_10022225_3132660', 'R01HL093399_9283301_3132660', 'R01HL093399_10293082_3132660', 'R01HL095924_9556432_4121125']
    }
    
    df = pd.DataFrame(data)
  • use chardet to 'guess' the encoding of the csv (utf-8)

PostgreSQL db/table creation:

CREATE DATABASE Post_db
    WITH 
    OWNER = admin
    ENCODING = 'UTF8'
    LC_COLLATE = 'en_US.UTF-8'
    LC_CTYPE = 'en_US.UTF-8'
    TEMPLATE = template0;

CREATE TABLE main (
    PROJECT_NUMBER VARCHAR(255),
    PRODUCT_ID INT,
    CUSTOMERID INT,
    UNIQUE_ID VARCHAR(255) PRIMARY KEY,
    UNIQUE (PRODUCT_ID),
    UNIQUE (CUSTOMERID)
);

Python code:

import pandas as pd
import os
from sqlalchemy import create_engine
from sqlalchemy.engine import URL

DATA_PATH = '../../Data/'
CSV_FILE = 'customer_data.csv'

url = URL.create(
    drivername="postgresql",
    username="admin",
    host="localhost",
    database="Post_db",
    password="PASSWORT"
)

engine = create_engine(url)
df = pd.read_csv(os.path.join(DATA_PATH, CSV_FILE), encoding='utf8')
df.drop_duplicates().to_sql('main', engine, if_exists='replace', index=False)

Example of CSV file:

  PROJECT_NUMBER  PRODUCT_ID  CUSTOMERID                     UNIQUE_ID
0    R01HL093399    8840150  3132660   R01HL093399_8840150_3132660
1    R01HL093399   10022225  3132660  R01HL093399_10022225_3132660

Copy paste from CSV file:

,PROJECT_NUMBER,PRODUCT_ID,CUSTOMERID,UNIQUE_ID
0,R01HL093399,8840150,3132660,R01HL093399_8840150_3132660
1,R01HL093399,10022225,3132660,R01HL093399_10022225_3132660
2,R01HL093399,9283301,3132660,R01HL093399_9283301_3132660
3,R01HL093399,10293082,3132660,R01HL093399_10293082_3132660
4,R01HL095924,9556432,4121125,R01HL095924_9556432_4121125

Even if I try the to_sql command on the rows only, I still get the exact same error as before.

0

There are 0 answers