I have written this quick test script to compare the performance of bcp (using bcpandas package) to the performance of regular pandas df.to_sql. See the script below:
import time
import pandas as pd
import os
import numpy as np
from dotenv import load_dotenv
from db import create_con
from bcpandas import SqlCreds, to_sql
load_dotenv(override=True)
creds = SqlCreds(
os.environ['SQL_SERVER'],
os.environ['SQL_DB'],
os.environ['SQL_USER'],
os.environ['SQL_PW']
)
df = pd.DataFrame(
data=np.ndarray(shape=(100000, 6), dtype=int),
columns=[f"col_{x}" for x in range(6)]
)
#Using BCP
start = time.time()
print('Starting...')
to_sql(df, 'bcp_test_src', creds, index=False, if_exists='replace', schema='test', batch_size=10000)
print('Ending...')
end = time.time()
elapsed = end - start
print(f'Elapsed time: {elapsed} seconds')# 15.5 seconds
# Using Pandas to_sql
start = time.time()
print('Starting...')
con = create_con()
df.to_sql('bcp_test_src', con=con, index=False, if_exists='replace', schema='test')
print('Ending...')
end = time.time()
elapsed = end - start
print(f'Elapsed time: {elapsed} seconds')# 17.2 seconds
However, the difference between BCP and Pandas is very small. BCP takes 15.5 seconds and df.to_sql 17.2. Reading about BCP I was expecting it to just take a few seconds maximum to finish it all. What am I doing wrong? The database I am writing it to is an Azure SQL Database with the pricing tier of Standard S0: 10 DTUs. Could this have anything to do with it?