Machine Learning with Azure Synapse SQL

2292

Machine Learning is not only about building models but consuming them. ML Models are consumed in two ways: Batch Inferencing and Real-Time Inferencing.

In Real-Time Inferencing, results are expected in real-time. Typically, a user submits a query, and the model responds. On the other hand, in Batch Inferencing, the model does not respond in real-time. Instead, a batch or batches of data is passed to the model for scoring.

In enterprises, batches of tabular data are stored in Data warehouses, suffice traditional, descriptive analytics. However, advanced forms of analytics like Predictive Modeling need tighter integration with Machine Learning platforms. Fortunately, platforms like Azure Synapse Analytics provide a unified analytics platform. To know more about the philosophy behind Azure Synapse, here is an article.

 

So what’s the scenario here? We will demonstrate a batch prediction scenario. A model will be trained on the California Housing Dataset, stored in Azure Synapse Data Lake, using Azure Machine Learning. This model will be scored on a batch of data from another table of the dedicated Azure Synapse SQL Pool (formerly Azure SQL DW).

For an introduction to SQL pool, read this article of ours. We assume that you have Azure Synapse Analytics workspace, a Dedicated SQL Pool and an Azure Machine Learning workspace set for this lab. Let’s layout the high-level steps:

  1. Load California Housing Dataset.
  2. Grant access to Azure Machine Learning to Azure Synapse Analytics.
  3. Create a Linked Service to AML workspace in Azure Synapse.
  4. Build the Model.
  5. Score the Model.

1. Load California Housing Dataset

Data forms the bedrock of Machine Learning. Hence, we will create our datasets in this first step, using Azure Synapse databases.

Azure Synapse Analytics comprises two types of Databases viz. Lake Database and SQL Database. By design, model training happens on lake database tables. On the other hand, batch inferencing happens on SQL Pool tables. For demonstration, let’s load the training data into the default lake database while the inferencing data will be loaded into the SQL pool. Let’s get started:

First, open a new notebook in the Azure Synapse studio and import all the required libraries.

import pandas as pd
from sklearn.datasets import fetch_california_housing

Next, we will load the built-in scikit learn dataset, California Housing, in a Pandas Dataframe. To know more about the California dataset, read this article of ours.

# load the  dataset

california_housing = fetch_california_housing()
pd_df_california_housing = pd.DataFrame(california_housing.data, columns = california_housing.feature_names)
pd_df_california_housing['target'] = pd.Series(california_housing.target)

Now, let’s sample and split the dataset into two. 99 per cent of the data will be used for the Model Training process, while 1 per cent will be used for inference. Hence, we will use the pandas split function to do so.

pd_df_california_housing_inferencing = pd_df_california_housing.sample(frac=0.01)
pd_df_california_housing_training = pd_df_california_housing.drop(pd_df_california_housing_inferencing.index)

Furthermore, we set the target column or the label column of the inferencing dataset to the default value of 0.0.

pd_df_california_housing_inferencing['target'] = 0.0

Further, we convert the pandas dataframes to spark dataframes and create a view on top of the inferencing table.

spark_df_california_housing_inferencing = spark.createDataFrame(pd_df_california_housing_inferencing)
spark_df_california_housing_training = spark.createDataFrame(pd_df_california_housing_training)

spark_df_california_housing_inferencing.createOrReplaceTempView("CaliforniaInference")

Next, we load the training table into the default lake database.

spark_df_california_housing_training.write.mode("overwrite").saveAsTable("default.CaliforniaTrain")

Lastly, we use scala context to load those dataframes directly into our Synapse SQL Pool.

%%spark

val scala_df_train = spark.sqlContext.sql ("select * from CaliforniaTrain")
scala_df_train.write.synapsesql("<your-database-name>.dbo.CaliforniaTrain", Constants.INTERNAL)

val scala_df_inference = spark.sqlContext.sql ("select * from CaliforniaInference")
scala_df_inference.write.synapsesql("SQLPool01.dbo.CaliforniaInference", Constants.INTERNAL)

To cross-check, go to the Data section of the Azure Synapse studio.

2. Grant access to Azure Machine Learning to Azure Synapse Analytics

With the dataset in place, let’s get to the heart of the subject i.e. Azure Machine Learning integration with Azure Synapse Analytics. Although Azure Synapse Analytics uses Spark, which can be used for building ML models, it cannot be a replacement to a specialized ML platform like Azure Machine Learning. Moreover, the latter offers tools for citizen data scientists like Auto ML and Azure Machine Learning Designer. Hence, it is good to have the combined power of Azure Machine Learning and Azure Synapse Analytics at your disposal. However, before that, you need to do some preparatory work.

Go to Azure Machine Learning workspace’s access control and click on Add Role Assignment and assign the Contributor role to the synapse workspace:

3. Create a Linked Service to AML workspace in Azure Synapse.

Next, create a linked service to AML in Azure Synapse:

4. Build the Model.

Now, we come to the ML model training part. In this section, we will train a model using Auto ML to build an ML model. Please note that you can build a custom model in Azure Machine Learning. However, to simplify model building and to introduce the readers to AutoML, we are taking a shortcut. Here are the steps.

Firstly, click on the training dataset californiatrain and choose train a new model:

Further, choose Regression and select the below configurations before running the Auto ML:

 

 

After running the AutoML job, wait for around 30 minutes to have the best model. You can view your best model registered under the Models section of Azure Machine Learning.

4. Score the Model.

Once the model is ready, click on the CaliforniaInference table. Select the Machine Learning>Predict with a model.

Next, select your registered model. This will open up an input/output mapping window.

Lastly, name your batch prediction script and model registry table. In our case, the batch prediction script will be a stored procedure. Click on the Deploy model + open script.  This will open a procedure script:
This creates a stored procedure for batch scoring and the database table (registry) in the SQL Pool, which contains the model details. Here is the registry table i.e. dbo.CaliforniaModel.
Furthermore, this is the generated stored procedure script:
-- Create a stored procedure for storing the scoring script.

CREATE PROCEDURE dbo.predictHousingPrices

AS

BEGIN

-- Select input scoring data and assign aliases.

WITH InputData AS
(
    SELECT
        CAST([MedInc] AS [real]) AS [MedInc],
        CAST([HouseAge] AS [real]) AS [HouseAge],
        CAST([AveRooms] AS [real]) AS [AveRooms],
        CAST([AveBedrms] AS [real]) AS [AveBedrms],
        CAST([Population] AS [real]) AS [Population],
        CAST([AveOccup] AS [real]) AS [AveOccup],
        CAST([Latitude] AS [real]) AS [Latitude],
        CAST([Longitude] AS [real]) AS [Longitude]
    FROM [dbo].[CaliforniaInference]

)

-- Using T-SQL Predict command to score machine learning models. 

SELECT *
FROM PREDICT (MODEL = (SELECT [model] FROM dbo.CaliforniaModel WHERE [ID] = '<YOUR-MODEL-NAME>'),
              DATA = InputData,
              RUNTIME = ONNX) WITH ([variable_out1] [real])  END
GO
Note that the key function here is PREDICT. Having said that, this stored procedure can display results. To store the results of the batch scoring back to the CaliforniaInference table, the procedure needs a little tweaking. Here is the tweaked procedure:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROC [dbo].[predictHousingPrices] AS
BEGIN

-- Select input scoring data and assign aliases.

IF OBJECT_ID(N'tempdb..#CaliforniaInference') IS NOT NULL
DROP TABLE [dbo].[#CaliforniaInference]

SELECT
    CAST([MedInc] AS [real]) AS [MedInc],
    CAST([HouseAge] AS [real]) AS [HouseAge],
    CAST([AveRooms] AS [real]) AS [AveRooms],
    CAST([AveBedrms] AS [real]) AS [AveBedrms],
    CAST([Population] AS [real]) AS [Population],
    CAST([AveOccup] AS [real]) AS [AveOccup],
    CAST([Latitude] AS [real]) AS [Latitude],
    CAST([Longitude] AS [real]) AS [Longitude]
INTO [dbo].[#CaliforniaInference] 
FROM [dbo].[CaliforniaInference]

-- Using T-SQL Predict command to score machine learning models.

IF OBJECT_ID(N'tempdb..#CaliforniaPred') IS NOT NULL
DROP TABLE [dbo].[#CaliforniaPred]

SELECT *
    INTO [dbo].[#CaliforniaPred]
    FROM PREDICT (MODEL = (SELECT [model] FROM dbo.CaliforniaModel WHERE [ID] = '<YOUR-MODEL-ID>'),
    DATA = [dbo].[#CaliforniaInference],
    RUNTIME = ONNX) WITH ([variable_out1] [real])

-- UPDATE/MERGE the Inference table.

MERGE [dbo].[CaliforniaInference] AS tgt 
         USING (select * from [dbo].[#CaliforniaPred]) AS source 
      ON 

      (CAST(tgt.MedInc AS [real]) = source.MedInc 
       and CAST(tgt.HouseAge AS [real]) = source.HouseAge 
       and CAST(tgt.AveRooms AS [real])= source.AveRooms
       and CAST(tgt.AveBedrms AS [real])= source.AveBedrms
       and CAST(tgt.Population AS [real])= source.Population
       and CAST(tgt.AveOccup AS [real]) = source.AveOccup
       and CAST(tgt.Latitude AS [real]) = source.Latitude
       and CAST(tgt.Longitude AS [real]) = source.Longitude) 
            WHEN MATCHED THEN
                UPDATE SET tgt.target = CAST(source.variable_out1 AS [real]) ;

END
Finally, run the stored procedure and view the results i.e. the target column:

Conclusion

This article is only for information. We do not claim any guarantees regarding its completeness or accuracy. Nonetheless, we hope this article is useful. Lastly, the feature image is taken from a Microsoft blog, since the image aptly summarizes the contents of the article.



I am a Data Scientist with 6+ years of experience.


Leave a Reply