Fast data loading from Pandas Dataframe to Azure SQL Database

3682

Writing pandas data frames to an Azure SQL database either requires you to use a cursor or the SQL Alchemy library(particularly, we use the to_sql function). However, these methods are extremely slow. We have written about to_sql this in this article: Spark Dataframe performance benefits.

Not everyone has access to a Spark Cluster, though. So isn’t there a way to write pandas data frames efficiently to a SQL db? Seems like there is! It’s by using a cursor and fast_executemany .

To fully appreciate fast_executemany, let’s visit the code without the same. In this example, we will use the California housing dataset of scikit-learn. We will load it as it is into an Azure SQL DB from a python pandas dataframe.

0. Create a destination table in Azure SQL DB

Before writing code to load data into Azure SQL Database. Following is the structure of the table:

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[tbl_california_housing](
[MedInc] [nvarchar](50) NULL,
[HouseAge] [nvarchar](50) NULL,
[AveRooms] [nvarchar](50) NULL,
[AveBedrms] [nvarchar](50) NULL,
[Population] [nvarchar](50) NULL,
[AveOccup] [nvarchar](50) NULL,
[Latitude] [nvarchar](50) NULL,
[Longitude] [nvarchar](50) NULL,
[target] [nvarchar](50) NULL
) ON [PRIMARY]
GO

1. Load Data in a Pandas Dataframe

Load the California Housing Dataset into a Pandas Dataframe:

import pandas as pd
from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing()
pd_df_california_housing = pd.DataFrame(california_housing.data, columns = california_housing.feature_names)
pd_df_california_housing['target'] = pd.Series(california_housing.target)

2. Create an ODBC connection using pyodbc

Programming languages form connections to SQL Databases using ODBC connection/standard. In python, this is implemented by pyodbc library. Here is the code to create an ODBC connection:

import pyodbc
conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
       server='<mydatabaseserver.database.windows.net>',
       database='<mydatabase>',
       uid='<my database user>', pwd='<my database password>')

3. Load data into Azure SQL Database without fast execution

Next, here is the code to write the data to Azure SQL Database without fast execution.

import time
insert_to_tmp_tbl_stmt = f"INSERT INTO tbl_california_housing \  #Form Insert Statement
([MedInc],[HouseAge],[AveRooms],[AveBedrms],[Population],[AveOccup],[Latitude],[Longitude],[target] ) \
values(?,?,?,?,?,?,?,?,?)"

cursor = conn.cursor()  #create a cursor

start = time.time()     #Note start time
cursor.executemany(insert_to_tmp_tbl_stmt, pd_df_california_housing.values.tolist()) #load data into azure sql db
end = time.time()       #Note end time

print(f'{len(pd_df_california_housing)} rows inserted in pd_df_california_housing table')
print(f'{(end - start)/60} minutes elapsed')            

cursor.commit()        #Close the cursor and connection
cursor.close()
conn.close()

The following message was displayed:

20640 rows inserted in pd_df_california_housing table
8.287326637903849 minutes elapsed

4. Load data into Azure SQL Database ‘with’ fast execution

Now, the fundamental difference in this snippet from the previous section is 1 line of code i.e fast_executemany :

import time
insert_to_tmp_tbl_stmt = f"INSERT INTO tbl_california_housing \ #Form Insert Statement
([MedInc],[HouseAge],[AveRooms],[AveBedrms],[Population],[AveOccup],[Latitude],[Longitude],[target] ) \
values(?,?,?,?,?,?,?,?,?)"

cursor = conn.cursor() #create a cursor

cursor.fast_executemany = True  #Set fast_executemany  = True

start = time.time() #Note start time
cursor.executemany(insert_to_tmp_tbl_stmt, pd_df_california_housing.values.tolist()) #load data into azure sql db
end = time.time() #Note end time

print(f'{len(pd_df_california_housing)} rows inserted in pd_df_california_housing table')
print(f'{(end - start)/60} minutes elapsed') 

cursor.commit() #Close the cursor and connection
cursor.close()
conn.close()

The following message was displayed:

20640 rows inserted in pd_df_california_housing table in
0.06369909445444742 minutes elapsed

Conclusion

This is not limited to Azure SQL DB. Hence, we encourage you to try other databases and techniques etc. Please note that this is for information. Finally, we don’t claim guarantees regarding its accuracy or completeness and encourage the readers to use this at their own discretion.



I am a Data Scientist with 6+ years of experience.


Leave a Reply