Run Same Databricks Notebook for Multiple Times In Parallel (Concurrently) Using Python

In this blog, I would like to discuss how you will be able to use Python to run a databricks notebook for multiple times in a parallel fashion. Noting that the whole purpose of a service like databricks is to execute code on multiple nodes called the workers in parallel fashion. But there are times where you need to implement your own parallelism logic to fit your needs.

To follow along, you need to have databricks workspace, create a databricks cluster and two notebooks. The parent notebook orchestrates the parallelism process and the child notebook will be executed in parallel fashion. The idea would be that the parent notebook will pass along a parameter for the child notebook and the child notebook will use that parameter and execute a given task. Without further to say, let’s get to it.

For simplicity let’s design a child notebook that takes a number as an input and then print the multiplication of this number by 10. Also, to make sure that we test our parallelism logic, we will introduce 20 seconds sleep time for our child notebook. So open up your child notebook and enter the following code in Command 1 (this code will help us pass a parameter from the parent notebook).

dbutils.widgets.text("numberToProcess","","")
dbutils.widgets.get("numberToProcess")
numberToProcess = int(getArgument("numberToProcess"))

Open up a new command in child notebook and enter the following code which will calculate the 10 multiplier for our number of interest, introduce a sleep time of 20 seconds and then print the output

import time
outputNumber = numberToProcess * 10
time.sleep(20) # sleep for 20 seconds 
print('The desired output is {}'.format(outputNumber))

So your child notebook will look something like this

Now that you have your child notebook setup, go to your parent notebook and paste the following code to import the multithreading packages

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

Then we define the function that will execute a child notebook passing in the number parameter

def processAnIntegerNumber(numberToProcess):
  dbutils.notebook.run(path = "/Shared/ChildNotebook",
                                        timeout_seconds = 300, 
                                        arguments = {"numberToProcess":numberToProcess})

Finally, the magic command, which will take a list of numbers, spin up a multithreading executor and then map the list of numbers to that executor. This command will execute the childnotebook instance for 5 times (because the list of numbers contains five numbers) in a parallel fashion.

listOfNumbers = [87,90,12,34,78]
with ThreadPoolExecutor() as executor:
  results = executor.map(processAnIntegerNumber, listOfNumbers)

Executing the parent notebook, you will notice that 5 databricks jobs will run concurrently each one of these jobs will execute the child notebook with one of the numbers in the list. This is a snapshot of the parent notebook after execution

Notice how the overall time to execute the five jobs is about 40 seconds. Now, if we open the link to any of these jobs, we will notice that the time was also about 32 seconds indicating that jobs were run in parallel

Hope this blog helps you run your jobs faster and satisfies your “Need for Speed”.

You can follow this blog for Machine learning, data engineering, devops, dynamics and power apps tips and tricks.

Mount a Blob Storage in Azure DataBricks Only if Not Mounted Already (Using Python)

As discussed in this article by Databricks that during your work in a notebook, you can mount a Blob Storage container or a folder inside a container to Databricks File System. The whole point of mounting to a blob storage container is simply to use an abbreviated link to your data using the databricks file system rather than having to refer to the whole URL to your blob container every time you need to read/write data from that blob container. More details on mounting and its usage, can be found in the articles referenced above.

The purpose of this article is to suggest a way to check if the mountpoint has been created already and only attempt to create it if it doesn’t exist using python.

This can simply be done if we knew how to list existing mountpoints using python. Luckily, databricks offers this to us using the dbutils.fs.mounts() command. To access the actual mountpoint we can do something like this:

for mount in dbutils.fs.mounts():
  print (mount.mountPoint)

Knowing how to access mountpoints enables us to write some Python syntax to only mount if the mountpoint doesn’t exist. The code should look like the following:

storageAccountName = "your storage account name"
storageAccountAccessKey = "your storage account access key"
blobContainerName = "your blob container name"
if not any(mount.mountPoint == '/mnt/FileStore/MountFolder/' for mount in dbutils.fs.mounts()):
  try:
    dbutils.fs.mount(
    source = "wasbs://{}@{}.blob.core.windows.net".format(blobContainerName, storageAccountName),
    mount_point = "/mnt/FileStore/MountFolder/",
    extra_configs = {'fs.azure.account.key.' + storageAccountName + '.blob.core.windows.net': storageAccountAccessKey}
  )
except Exception as e:
  print("already mounted. Try to unmount first")

Or, you can add an error handler to print an error message if the the blob is mounted already, as such:

storageAccountName = "your storage account name"
storageAccountAccessKey = "your storage account access key"
blobContainerName = "your blob container name"

try:
  dbutils.fs.mount(
  source = "wasbs://{}@{}.blob.core.windows.net".format(blobContainerName, storageAccountName),
  mount_point = "/mnt/FileStore/MountFolder/",
  extra_configs = {'fs.azure.account.key.' + storageAccountName + '.blob.core.windows.net': storageAccountAccessKey}
  )
except Exception as e:
  print("already mounted. Try to unmount first")

Enable a Service Principal to Create Access Policies for an Azure Resource Using DevOps

If you got along well with your deployment strategy, you might got to the point where you are using a service principal in DevOps to do your deployments. One of the deployment tasks that I needed to do was to create an access policy using a service principal with Powershell. I simply wanted to do that using command Set-KeyVaultAccessPolicy. More specifically, I wanted to grant an access policy for my data factory to be able to get/list secrets in my key vault. Running my code with my windows account, everything was working as expected. Running my code using the service principal, it was failing with the following error

Exception: An error occurred: System.Exception: Insufficient privileges to complete the operation.at Set-KeyVaultAccessPolicy,

This has occurred even though my service principal was granted contributor access to the resource group where the key vault lives. That was the same level of permissions that my account had. Then why was it that my account was able to run Set-KeyVaultAccessPolicy but my service principal didn’t have enough privileges to do it?

It turns out that in order for a service principal to be able to create an access policy, it will need to validate the object ID for your resource against azure active directory. To fix, you can try one of the following two solutions:

Solution 1: Change the API permissions for your service principal to allow reading of the azure active directory graph.

This can be done by going to your service principal, and then select the API permissions blade. Click add permission as can be seen in the image below

Once there, scroll all the way to the bottom and select “Azure Active Directory Graph”. Then grant Directory.Read.All permissions to your application permissions options as can be seen below

This will enable your service principal to verify object ID and therefore create necessary access policies.

Solution 2: Add flag BypassObjectIdValidation to your command

This will enable your command to skip the object id validation step and create the access policy (even though this solution has worked for many people, it didn’t work for me). So your command will look something like

Set-AzKeyVaultAccessPolicy -ResourceGroupName $resourceGroupName `
                           -VaultName $keyVaultName `
                           -ObjectId $servicePrincipal.Id  `
                           -PermissionsToSecrets get,list `
                           -ErrorAction Stop `
                           -BypassObjectIdValidation

Hope one of these two solutions will help you figure out a solution to this problem.

Azure API Management and Dynamics 365 Web API

When you have SaaS systems and custom systems all over the place in your organization, there is a need for unification. The more systems you have over a long period of time, the less standardization you have among them. Other factors include different technologies and different architectural styles. If those system expose any kind of API that is needed internally (by your developers) or externally (by your customers), then a need to give those API’s a consistent look and feel along with a set of unified policies becomes important.

Azure API Management (APIM) is one solution to this problem. APIM is an Azure resource that you can provision and have it sit between your API consumers and the APIs exposed by your systems.

Azure API Management

My focus here is on one example: D365 Web API. First question that comes to mind is why do we need to expose D365 APIs through APIM even though the API is modern and well documented? Here are few reasons:

  1. The Common Data Model have API limits. If you have systems that read data from your Dynamics 365 instance through APIM, you can cache the results and save on some API calls. APIM comes equipped with a built in cache but if you need a bigger cache then you can attach an external cache system like Redis.
  2. Because of the imposed CDS per user limits, with APIM you can limit the number of API calls by a your consumer systems. As of this time, Microsoft provides a 25,000, 50,000 or 100,000 API calls per application or non-interactive user based on your licencing model (see details here).
  3. With APIM, you can set the authentication to your D365 Web API using an Azure AD application instead of a licensed user. Anyone who calls the API doesn’t need to have a user setup in D365 or given a security role. This basically ease things for user management. This ties back to number 2 above, the API calls here don’t count as part of a user quota because we are not using a licensed user authentication. The type of user we will use is called Application User, and this type doesn’t need a licence.
  4. If the systems that talk to D365 Web API expect an XML format result, you can transform the JSON output of D365 API into an XML output. Other transformations are also supported by APIM.
  5. APIM provides so many other policies that you can put in place between your consumers and the D365 API, to mention a few, you can change headers, add more data to the request/response etc.
  6. D365 API is extensive and it is time consuming for the its consumers to learn about it quickly. In APIM, you can have end points to what is needed only by your API Consumers.
  7. Analytics on who is calling your API’s and how many requests for each end point.
  8. You can package your APIs endpoints in products (group of API that serve some defined purpose), and you can provide your consumers with subscription keys to track who is calling what API endpoint.

To this point, we haven’t done any real work. In summary, here is what we want to do:

  1. Provision an APIM instance.
  2. Create a simple API in APIM that calls our D365 Web API.
  3. Setup the Authentication between APIM and D365 Web API using Azure AD and without consuming a Dynamics licence.
  4. Add a send-request policy for token generation to implicitly obtain a token and send it to D365 API.
  5. Call the APIs from APIM.

Provisioning an APIM Instance in Azure

Azure APIM comes in different pricing tiers flavors. In this blog, I opted out for the developer plan, the steps below should work on all other tiers including the consumption plan. Head to Create a new Resource in Azure, search for API Management and create it as below. The name needs to be globally unique. With the developer tier, expect a wait time of 30 minutes at least for this resource to provision, if you want a much faster provisioning, select the consumption plan. Once you provision the APIM instance, it will be accessible https://{Name of your APIM resource}.developer.azure-api.net/, this URL will server as the base URL for all the API’s the are behind this APIM resource.

Create A simple APIM API that Calls D365 API

We can create API’s in APIM in different ways. If your API has an open api file, previously called Swagger, then you can import that file and APIM will create the operations for you. Unfortunately, with Dynamics API we don’t have that luxury and we need to create the API manually starting from an empty API.

In your provisioned instance, click APIs on the left navigation, and click on Blank API.

In the dialog that appears, fill in a friendly name for your API and add the base URL for Dynamics API. If you have many APIs that you want to hide behind an APIM resource, give each API a prefix, in my case, I call it d365api. So now, calling https://{your crm org}.crm3.dynamics.com/api/data/v9.1/ is equivalent to calling your APIM instance at the URL https://apimd365.azure-api.net/d365api

Once an API is created, we need to add the operations to our API. Assume that your ESRI maps team at your organization wants to draw all of your contacts and accounts on a map and they need their data. You can easily expose two operations like this:

https://apimd365.azure-api.net/d365api/contacts

https://apimd365.azure-api.net/d365api/accounts

The way to do this is by Adding an operation as shown below, you need to specify the path of the operations. In Dynamics 365 and because it is a REST API, you can just have all the accounts by appending “contacts” to the base URL, same for accounts. Of course, if you want to only return specific data about those contacts or accounts, you can pass the OData filters in the query part of the operation being created.

At this point, we have a fully non-functional API :). The reason is Dynamics 365 still doesn’t know about APIM and doesn’t trust any request that comes from it. Next, let’s see how we can implement a server to server authentication without any intervention from the API caller.

Set the Authentication between APIM and D365

Note: Application and Client are used interchangeably in Azure AD terminology

APIM is very flexible from the API authentication/authorization point of view. It allows you to add an OAuth2.0 or OpenID connect authentication servers as part of its configurations to get access tokens that you can later supply to your API calls . As you may know, D365 API uses OAuth2.0 to Authenticate, this happens by calling an authorization end point that provides you with an access token, then you pass this token in a header called “Authorization” to any call you make to D365 API. Here we have two options, the first is letting the APIM consumers be responsible for calling the authorization endpoint, getting the token and provide as a header or making their life easy and abstract the authorization from them completely. I prefer the latter option because it is less headache for end users of your APIM.

In the case we are tackling now, I have my Dynamics 365 and my APIM in the same tenant, that means they both operate under the same Azure Active Directory. To let the APIM authenticate with Dynamics API, we need to create an Application Registration in Azure AD, give it permission to Access Dynamics 365 and let Dynamics 365 organization know about it. This App Registration is like a user identity that APIM uses to authenticate with D365 API.

To create an Application Registration in Azure AD, click on App Registration, New Registration.

Give the Application Registration a friendly name and select Web as its type. In the redirect URL, fill in any value as this is not important, for example, use https://localhost

Take a note of the application ID and the tenant ID as we will need them later.

To complete the credentials, we need to generate a secret that will behave like the password for the app registration. In a production environment, try to have this secret stored in the key vault, but for now, just create a secret, set its expiry and copy its value as well (This is your only chance to copy the secret, after that it will be masked forever, If you lose the secret, you need to generate a new one).

Our App Registration is created, but two things are missing, the permission to access Dynamics APIs from this app registration and letting the Dynamics system know about this app registration. To give the app registration a permission, Click on API permissions, add Dynamics CRM

Select the User_Impersonation permission and click Add Permission.

To make Dynamics 365 aware of this app registration, you need to create what is called an Application User (you need to be a Dynamics administrator to do this). The Application user is the representative of Dynamics 365 that talks with the App Registration. Navigate to Settings->Security->Users. Switch to the Application User View and click New User. You only need the Application ID of the App Registration that you collected earlier. Fill in the other required fields and save, if everything is successful, the Application ID URL and the Object ID should auto populate. Give this user a security role to access only the needed data by your API consumers and nothing more. For our example to work, at least give a read access on the account and contact entities.

The last piece of information we need from the Azure AD is the token generation URL, this URL is what the APIM will call to get a token that will be used to authorize a request to D365 API. In the overview section of either Azure AD or your app Registration, Click on Endpoints in the command bar and select the token URL

By now, you should have the application ID, tenant ID, secret value and the token generation URL.

Add a send-request Policy For Token Generation

To make things easy for the APIM consumers, we need to implicitly authenticate with D365 API using the App Registration and Azure AD. One of the most amazing features of APIM is the ability to do almost anything in between receiving a call and before we forward it to the back-end service (D365 API). In our case, we need to issue a call to the Token generation URL with the proper App Registration credentials, get a token and insert it into a header called “Authorization”.

Click on APIs, select the API we created above and click on all operations. and in the Inbound Processing designer, click on the little edit icon. The reason we want to do this on all operations is because D365 requires a token on any operation, so here we will do it once and it will apply to each individual operation after that.

By default, your inbound processing looks like this, this means that do nothing on inbound calls, nothing in the back-end and nothing with the outbound results. In our case, we want to add a request to the inbound stage to get us a token.

Policies in APIM are so flexible and you can do so many things with them to manipulate the call pipeline. In our case, we want to to send a request to the token URL and thats done by a send-request policy. You will need most of the values you collected before in the App Registration step. Notice that the send-request policy stores the result in a variable called bearerToken and just after the send-request policy, we use a set-header policy that create a header called “Authorization” and populate it with the value of the access token. (Note: The send-request returns a bearerToken object that among other things, contains the access token that needs to be in the authorization header, that explains the casting logic you see in the set-header policy).

  <inbound>
        <base />
        <send-request mode="new" response-variable-name="bearerToken" timeout="20" ignore-error="true">
            <set-url>https://login.microsoftonline.com/{your tenant guid here}/oauth2/token</set-url>
            <set-method>POST</set-method>
            <set-header name="Content-Type" exists-action="override">
                <value>application/x-www-form-urlencoded</value>
            </set-header>
            <set-body>@{
              return "client_id={your client secret guid}&resource=https://{your org name}.crm3.dynamics.com/&client_secret={your secret here}&grant_type=client_credentials";

             }</set-body>
        </send-request>
        <set-header name="Authorization" exists-action="override">
            <value>@("Bearer " + (String)((IResponse)context.Variables["bearerToken"]).Body.As<JObject>()["access_token"])</value>
        </set-header>
    </inbound>

Call The API from APIM

Now is when the hard work “should” pay off. Open a browser tab (since we are doing a GET request, a browser should be enough, otherwise, use Postman), paste this URL

https://apimd365.azure-api.net/d365api/contacts

If you see the list of contacts returned to you, then you have done everything right, if you don’t, check the returned error, most of the time, it is security related and a review of all the values in the app registration and the send-request policy is a good start for debugging.

Subscription Keys and Products

APIM comes equipped with two powerful features, Products and Subscriptions. Products is a way to group specific APIs together to manage them as a unit. Subscriptions are basically a method of tracking API consumers by asking them to provide a subscription key with each request. You can control the API limits and analyse them by subscription keys, rate limiting and throttling policies are very common policies in APIM (more on that here). This adds a simple layer of security and tracking capability in your APIM, for example our call to get contacts will be something like this

https://apimd365.azure-api.net/d365api/contacts &subscription-key={some key}

Each subscription has a primary and secondary key that you can provide to your APIM consumers and if they are compromised, you can regenerate them again. Also, subscription keys can be passed as a header if you don’t want to expose them in the URL.

Summary

We have created an APIM resource between your D365 API callers and the API itself. APIM provides a lot of control over what happens to the request over its lifetime. We showed the important of authenticating/authorization using an app registration and an application user and how this way saves us from consuming the licensed users CDS API limits. We also saw how we can incorporate policies to to politely hijack the request and issue a call to the token endpoint and set the authorization header with its response. This post is in no way an extensive post about APIM, to know more about its wide feature set, visit the official documentations here.

Embed Power BI Visuals in Power App Portals for External Customers

Exposing a report or a dashboard tile from Power BI on a page in your Power App Portal is a supported feature. Not only, you will get the read-only visuals on your page, but you will also get a good amount of power like Natural Language Q/A, exporting data, drilling , filtering, slicing etc.

PowerApp Portal now has a liquid tag called powerbi, with this simple tag we can embed reports and dashboards in a single line of liquid any where on the portal. To get the powerbi liquid tag working, you need to enable Power BI Visualization from the portal admin center (you need to be a global admin).

The powerbi tag accepts the authentication_type, a path parameters for a report or a dashboard, a roles parameter for Row level security roles and a tileid for a specific dashboard tile if needed. Previously, the authentication type allowed only AAD (Azure Active Directory) and Anonymous values. In the AAD case, only users in the company AAD who have the reports shared with them, will have access to view the report (A sign in may be required to load the report), this is a good case for the Employee portal but hardly a good case for any other portal type or if your customers are not in your AAD. With the Anonymous case, you need to make your report fully public in Power BI service which means no authentication what so ever which can be a no-no for some organizations. In the October 2019 release, a new authentication type called powerbiembedded became available and with this authentication type, we can easily support viewing a report from Power BI on the portal by letting the Power App portal itself authenticate with the Power BI Service on the user’s behalf.

{% powerbi authentication_type:”” path:”” roles:”” tileid”” %}

The very first step to enable this mode of authentication is by going to the portal admin center and enable Power BI Embedded Service.

You will be asked to select the work space(s) that includes the reports you wish to expose. Move the workspace to the Selected work spaces and click Enable. This may trigger a portal restart that may take few minutes.

Up until now, the Power BI service and the Power App Portal don’t trust each other. To enable the Portal to authenticate with the Power BI Service, we need to make AAD aware of that. We need to create a security group in AAD and add the portal application to it as a member. This will allow us to tell Power BI Service that anyone in this group is allowed to see what’s inside workspace. To do this, login into https://portal.azure.com and open AAD.

Select Groups

Select New Group

Fill in the group details and click Create

Open the recently created group and click on Members, Add Members.

Search for the portal application and add it as a member. To save time, you can paste the Application ID found under the Portal Admin center in the new member search box and it should filter that portal application only.

Now, we need to tell Power BI that this security group is trusted. To do this, sign in into https://powerbi.microsoft.com and from the gear icon, select Admin Portal.

Under Tenant settings->Developer Settings, add the group you created in the previous step. Any member in this group will have access to Power BI APIs. Click Apply when done.

Our work is done here, what we need now is to expose a visual on the portal. Create a web page with a custom page template. Let the page template point to a custom web template. To show a working example, I created a sample web template called PowerBI and it has the powerbi liquid tag.

Notice that the authentication_type is now powerbiembeded and the path points to a report in my Power BI Service. Note that roles parameter here, I assigned it to some dummy role I created in my report (see https://docs.microsoft.com/en-us/power-bi/service-admin-rls) as It seems like at least one role has to be passed in the liquid tag for the visualization to work. A really good use case of the roles parameter is to have the roles in your Power BI report match the name of your web roles on the Power Portal, then you can control who can see what data based on your web role.

The final result? a report on the Power Portal with the portal taking care of authentication with Power BI Service.

Select Top n Records For Each Group In Python (Pandas)

Say that you have a dataframe in Pandas and you are interested in finding the top n records for each group. Depending on your need, top n can be defined based on a numeric column in your dataframe or it can simply be based on the count of occurrences for the rows in that group.

For example, suppose (god forbid) that I have a retailer store with four branches (A, B, C, D). Also, suppose that for each day, I would like to get the three branches with most number of items sold. So, I would like to take the day as a group, count the number of sold items in each branch for that day and then pick the three branches with highest number of sales. This is our problem, now, without further to say, let’s see the code for our example.

import pandas as pd
sales = {
'branch_name': [ 'A', 'A',  'B',   'B',   'C', 'C',  'C', 'C',  'D',
                        'A', 'A',  'B',   'B',   'C', 'D',  'D', 'D',  'D'],
    
'date': ['01/01/2020','01/01/2020','01/01/2020','01/01/2020','01/01/2020','01/01/2020',
         '01/01/2020','01/01/2020','01/01/2020', '02/01/2020','02/01/2020','02/01/2020',
         '02/01/2020','02/01/2020','02/01/2020','02/01/2020','02/01/2020','02/01/2020'],
    
'item_no': ['I1','I2','I3','I4','I5','I6','I7','I8','I9',
           'I10','I11','I12','I13','I14','I15','I16','I17','I18']
        
         }
sales = pd.DataFrame(sales)
sales['date'] = pd.to_datetime(sales['date'],format = '%d/%m/%Y')

Sweet, let’s look how our dataframe looks like

        branch_name 	date 	item_no
0 	   A       	2020-01-01 	I1
1 	   A       	2020-01-01 	I2
2 	   B       	2020-01-01 	I3
3 	   B       	2020-01-01 	I4
4 	   C       	2020-01-01 	I5
5 	   C       	2020-01-01 	I6
6 	   C       	2020-01-01 	I7
7 	   C       	2020-01-01 	I8
8 	   D       	2020-01-01 	I9
9 	   A       	2020-01-02 	I10
10 	   A       	2020-01-02 	I11
11 	   B       	2020-01-02 	I12
12 	   B       	2020-01-02 	I13
13 	   C       	2020-01-02 	I14
14 	   D       	2020-01-02 	I15
15 	   D       	2020-01-02 	I16
16 	   D       	2020-01-02 	I17
17 	   D       	2020-01-02 	I18

Great, from the dataframe above, you can see that for 2020-01-01, we have most sales coming from branches A, B and C whereas for 2020-01-02, we have most sales coming from branches A, B and D. Let’s write the expression to return what we just concluded with our eyes.

Solution

The solution here should be as the following:
1- We need to count the number of items sold for each day and each branch.

sales_agg = sales.groupby(['date', 'branch_name']).agg({'item_no':'nunique'}).reset_index()
print(sales_agg)
      date 	   branch_name 	item_no
0 	2020-01-01 	A 	         2
1 	2020-01-01 	B 	         2
2 	2020-01-01 	C 	         4
3 	2020-01-01 	D 	         1
4 	2020-01-02 	A 	         2
5 	2020-01-02 	B 	         2
6 	2020-01-02 	C 	         1
7 	2020-01-02 	D 	         4

2- For each day, we need to sort the branches by the number of items sold in descending order

sales_sorted = sales_agg.groupby(['date']).apply(lambda x: x.sort_values(['item_no'],ascending = False)).reset_index(drop = True)
print(sales_sorted)
 	date 	    branch_name 	item_no
0 	2020-01-01   	C 	         4
1 	2020-01-01   	A 	         2
2 	2020-01-01   	B 	         2
3 	2020-01-01   	D 	         1
4 	2020-01-02   	D 	         4
5 	2020-01-02   	A 	         2
6 	2020-01-02   	B 	         2
7 	2020-01-02   	C 	         1

3- Now for each date, we need to pick the top n records (3 in our case)

sales_sorted.groupby(['date']).head(3)
      date 	   branch_name 	item_no
0 	2020-01-01 	   C 	      4
1 	2020-01-01 	   A 	      2
2 	2020-01-01 	   B 	      2
4 	2020-01-02 	   D 	      4
5 	2020-01-02 	   A 	      2
6 	2020-01-02 	   B 	      2

Great, we are done. This has returned the top 3 branches based on total number of items sold for each day. Hope this helps you solve your own problem.

Your Fully Equipped and Most Simple Machine Learning Regression Template

In this blog, I would like to go through a full machine learning project using regression modeling and following common sense data cleaning and processing. This project should cover most of the checkpoints for any given machine learning project in real world utilizing one of the most powerful and commonly used regression algorithms, I will be covering:

1- The process of data exploration and why it is not just a fancy thing to do to impress business users.
2- How to divide your numbers (numeric variables) from your letters (categorical variables) and then bring them back together.
3- The ultimate painful experience of missing values and how to cheat on them.
4- Letters, oh god I hate letters, show me the numbers from these categories.
5- Let’s bring our best gladiators (support vector machines, random forests, gradient boosting algorithm …etc) and see what they can learn.
6- Which algorithm was best at learning? Are they consistent in their learning? Can I simply visualize the performance for each model and let my eye do the comparison because my brain is busy with other things?

Data Ingestion and Exploration

First thing we need to do of course is to load/read our data. For the purpose of this blog, I decided to use one of the most popular data sets for regression purposes; the house pricing dataset. If you want to follow along, feel free to download the file from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/download/EGpnHbJBO1XGF5B35WNU%2Fversions%2FHOSha0bPv4HEBGjo9O9x%2Ffiles%2Ftrain.csv. This dataset contains several numerical and categorical predictors to help predict the sales price for a house. So, the objective here is that we need to use sales price from historical houses to try to come up with rules that map our given predictors to the sales price.

Once downloaded on your local computer, you can read the file and show the first few rows using the following code, of course, after loading all the required libraries

""" Import required libraries Main Libraries"""
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyodbc
import os
import sklearn
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
%matplotlib inline

# Basic Statistics Packages
from scipy.stats import skew, norm
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

# ML Libraries
from mlxtend.regressor import StackingCVRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, scale, StandardScaler,RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, ElasticNetCV, BayesianRidge, Lasso
from sklearn.svm import SVR
from mlxtend.regressor import StackingCVRegressor
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
house = pd.read_csv("C:\\train.csv")
house.head()

After having a first look at the data, it would be always best to split your data into numeric and categorical variables. The reason for doing this is because categorical variables will mostly need be handled differently than numeric variables for ML purposes. This in particular is really important when 1) we impute missing values, 2) when we do some basic data transformation like applying a one hot encoder on categorical variables and 3) numeric variables use different visuals than categorical variables during the exploration phase.

# In addition to splitting variables into numeric and categorical we need to remove
# The response variable and the the id variable from the list
response_var = 'SalePrice'
id_var = 'Id'
num_vars = [col for col in house.columns if house[col].dtype in ['int64','float64']]
cat_vars = [col for col in house.columns if house[col].dtype in ['object']]
num_vars.remove(response_var)
num_vars.remove(id_var)
print(num_vars)
print(cat_vars)

Okay, looks good. We have a better idea about what our numeric predictors and categorical predictors are from a 30,000 foot view. Now, let’s have a closer view at our most important variable, our response variable, the SalesPrice variable. Let’s check the distribution of our variable

sns.distplot(house[response_var])
Shows the distribution for the sales price response variable.

From the plot above we can see that the variable ‘SalePrice’ is fairly positively skewed (it has a few houses with a very high value for SalePrice in comparison to other houses) but not so strongly skewed which is a good thing. But do we have to treat this variable? Not necessarily but very highly recommended to treat it. Why? simply because most machine learning algorithms that we will be using here assumes that the variable is normally distributed. The most obvious solution to treat a skewed variable is to get rid of the observations (or houses) that has this very high number in their ‘SalePrice’. But other more commonly used approaches can be implemented to treat positively skewed variables; like log transformation or taking the square root for a variable. So, let’s pick log transformation for now before we delve any deeper in our analysis.

house[response_var] = np.log1p(house[response_var])
sns.distplot(house[response_var])
Log transformed SalePrice. Look way nicer now with more normally distributed values than the absolute values (values before log transformation)

Awesome, now that we have a fairly normally distributed response variable, let’s have a look at how this variable is correlated with other numerical predictors. Best visual to see the relationship between two numeric variables is to use scatter plot (yes, Mr. obvious is here). Scatter plots here can help tremendously in four main aspects:
1) What predictors are most strongly correlated with our response variable?
2) What predictors don’t show any variability in their values?
3) How is the relationship between each predictor and the response variable; linear or non-linear?
4) Is there very obvious outliers in any of these predictor variables?.
Without further to say, let’s plot our data and check for these four points

# Define a figure and axes with a defined size and defining that we need to do subplots
fig, axs = plt.subplots(ncols=0, nrows=0, figsize=(12, 120))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)
# Iterate through the numeric columns from the data frame house
for i, feature in enumerate(list(house[num_vars]), 1):
    # Define the location for the plot of each variable using 3 coordinates location
    # 3 coordinates: number of rows, number of columns, location of the plot on the grid
    plt.subplot(len(list(num_vars)), 3, i)
    # do the actual plot for the given feature
    sns.scatterplot(x=feature, y=response_var, data=house)
    # y and x labels    
    plt.xlabel('{}'.format(feature), size=15,labelpad=12.5)
    plt.ylabel('SalePrice', size=15, labelpad=12.5)
    
    for j in range(2):
        plt.tick_params(axis='x', labelsize=12)
        plt.tick_params(axis='y', labelsize=12)
        
plt.show()

Okay, this result in a fairly large picture, so I will cut and paste the output in multiple images

Okay, there is a hell a lot to talk about here. As I mentioned, scatter plots never fail you on how important and revealing they are. Very powerful visualization indeed.

Outliers
This blog is not intended to show you all the different fancy ways to detect outliers. Here I am following the simplest possible approach there is to detect outliers. Simply, I am looking at the scatter plot and see where observations don’t make much sense visually. Of course, these outliers can be of importance and can happen regularly in your population but based on the sample we are having here, It seems like the point I have highlighted, using my fancy yellow circles, to be outliers that we can live without. I wasn’t conservative in my selection, so as a first order of business I will get rid of all houses highlighted in the yellow circles above.

Based on what we have highlighted in yellow, we need to get rid of houses with the following values:
– LotFrontage is more than 150
– Lot area is more than 50000
– MasVnrArea is more than 1000
– BsmtFinSF1 is more than 3000
– BsmtFinSF2 is more than 1200
– TotalBsmtSF is more than 4000
– 1stFlrSF is more than 3000
– 2ndFlrSF is more than 1750
– GrLivArea is more than 4000
– BedroomAbvGr is more than 7
– TotRmsAbvGrd is more than 12
– GarageArea is more than 1200
– WoodDeckSF is more than 750
– OpenPorchSF is more than 400
– EnclosedPorch is more than 500
Of course depends on your project and your data, you might need to be more or less conservative.

# Write code to remove outliers, as we decided from above
print('data frame dimensions before removing outliers is {}'.format(house.shape))
house = house[house['LotFrontage'] <= 150]
house = house[house['LotArea'] <= 20000]
house = house[house['MasVnrArea']<= 1000]
house = house[house['BsmtFinSF1']<= 3000]
house = house[house['BsmtFinSF2']<= 1200]
house = house[house['TotalBsmtSF']<= 4000]
house = house[house['1stFlrSF'] <= 3000]
house = house[house['2ndFlrSF']<= 1750]
house = house[house['GrLivArea']<= 4000]
house = house[house['BedroomAbvGr'] <= 7]
house = house[house['TotRmsAbvGrd'] <= 12]
house = house[house['GarageArea'] <= 1200]
house = house[house['WoodDeckSF']<= 750]
house = house[house['OpenPorchSF'] <= 400]
house = house[house['EnclosedPorch'] <= 500]
print('data frame dimensions after removing outliers is {}'.format(house.shape))

Great, now that outliers have been removed, we can see the relationships between our predictor variable and the response variable more clearly and it will help us define visually what variables are most important to us. For example, below I want to list the visual for the ‘LotArea’ vs ‘SalePrice’ before and after the outliers were removed. You can see how, visually, the relationship became way more clear to us. and trust me on this, ML algorithms use patterns fairly similar to how we ourselves define patterns, so by removing these outliers we first help our minds to see what is important to us and we definitely make it easier for the algorithm to better define the relationship.

Separate the wheat from the chaff
Now that we removed the most obvious outliers, let’s look at the correlation between different predictors and the ‘SalePrice’ to keep those important variables in mind while we continue with this project. Here, there are two approaches. The manual approach where we have a look at the scatter plots from above and try to decipher the correlation visually. The other more formal way of doing it is to do some sort of correlation plot that is specifically used to cater our desire to understand correlation. Let’s try both here. By looking at the scatter plot that we generated after removing outliers, we can conclude the following
– LotArea is highly correlated with the SalePrice.
– LotFrontage is highly correlated with the SalePrice.
– OverallQual is highly correlated with SalePrice.
– TotalBsmtSF is highly correlated with SalePrice.
– 1stFlrSF is highly correlated with SalePrice.
– GrLivArea is highly correlated with SalePrice.
– Many other variables are highly correlated with SalePrice.
– BsmtFinSF2 doesn’t show much variability. We don’t loose much getting rid of it.
– LowQualFinSF doesn’t show much variability. We don’t loose much getting rid of it.
– PoolArea, ScreenPorch, 3SsnPorch, MiscVal don’t much variability. We wouldn’t loose much getting rid of them.

In addition of seeing what variables are correlated and what is not. We can see that many variables are linearly correlated with the SalePrice variable. This indicates that a linear model might be able to do a very great job predicting the value of the SalePrice. We shall see later if this is a valid observation.

# Remove the unwanted variables from the list of numeric variables
remove_vars = ['BsmtFinSF2','LowQualFinSF','PoolArea','ScreenPorch','3SsnPorch','MiscVal']
num_vars = [item for item in num_vars if item not in remove_vars]
print(num_vars)

Missing Values

Even though different ML algorithms differ sometimes significantly on how they map their inputs to outputs, they all seem to agree on one thing; they all do hate missing values. Again, this article is not about discussing the latest and the greatest strategies to deal with missing values, rather, to show that dealing with missing values is a very essential step during the ML process. That being said, from the projects I worked on, I have never seen a better strategy for imputing missing values than the common sense and the awareness of the data itself and of the problem at hand. For instance, sometimes a missing value might simply mean a zero for a particular variable. So knowing something like this is way more powerful and way more easier than implementing an algorithmic approach to impute that missing values. and yes, it is way better than implementing a KNN approach to estimate these missing values.
Okay, let’s see how bad is the problem. Let’s first run a script to check the variables with missing values and see the percentage of missing values in each variable.

missing_dict = {}
for var in house.columns:
    percent_missing = (house[var].isnull().sum()/ house.shape[0]) * 100
    if percent_missing > 0:
        missing_dict[var] = percent_missing
        print('variable: {} , missing_values: {}'.format(var, percent_missing))

Let’s visualize missing values per variable

# Sort the dictionary for smallest to biggest value
missing_dict = {k: v for k, v in sorted(missing_dict.items(), key=lambda item: item[1])}
plt.figure(figsize=(12,8))
plt.bar(range(len(missing_dict)), list(missing_dict.values()), align='center')
plt.xticks(range(len(missing_dict)), list(missing_dict.keys()), rotation = 70)
plt.show()

Great, now we know that we have a total of 16 variables with missing values and only 5 of them they have a very high percentage of missing values of more than 50% of the whole sample size. The rest of the variables they have less than 10% missing values of the whole sample size. So, what do we need to do here?.
For this given problem, we can check the documentation and see if missing values mean anything. See if we can calculate the missing values for variables that has big percentage of their values missing. If we can’t do that, then we might sadly need to get rid of these variables. We follow the same approach for variables with low percentage of missing values. But if we were unable to calculate missing values by reading about these variables then we need to use a method to find a proxy to this missing value.

 # Data description states that missing values for the garage represents a no garage (we replace them with None)
for var in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    house[var] = house[var].fillna('None')
house['GarageYrBlt'] = house['GarageYrBlt'].fillna(0)
    
# Data description states that missing values for the basement represents a no basement (we replace them with None)
for var in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    house[var] = house[var].fillna('None')

# Other variables that indicate a doesn't exist if missing that we can replace with None
for var in ('PoolQC', 'Alley','Fence','FireplaceQu', 'Electrical'):
    house[var] = house[var].fillna('None')
# Other variables:
# MiscFeature is a variable to add further information that were not covered by other categories. So, let's just get rid of that
# Garage
cat_vars.remove('MiscFeature')

Okay, great. This should have helped us to get rid of all missing values. You can verify that no more missing values are there by running the following script again

for var in house.columns:
    percent_missing = (house[var].isnull().sum()/ house.shape[0]) * 100
    if percent_missing > 0:
        print('variable: {} , missing_values: {}'.format(var, percent_missing))

Feature Engineering

Awesome. We did most of the daunting tasks that are a must for cleaning and improving the quality of our dataset. Most of what we did was just to meet the most basic requirement for a ML to work correctly and smoothly. After all of that is behind you, let’s unleash the artist that lives inside (please don’t unleash your singing skills or drawing skills, keep that leashed for now). feature engineering in its simplest form, is the process of updating the existing features or derive new features from the existing features for the hope of building smarter rules to map features to our response variable. That’s to say, sometimes two independent variables don’t show strong correlation with the response variable but if we multiply them, the resultant variable might show very strong linear or non-linear correlation. This step of the ML process is usually 20% data scientist and 80% subject matter expert (SME). That’s to say, to excel in your feature engineering process you need to know the data very well, know the problem at hand, have a solid understanding of your variables, how they get measured and what do they exactly mean from a business perspective. For our problem here, we kind of have a very strong sense of what are the different variables and the house sale problem. But if you are working with a harder business problem in an engineering field for example, then you need to have a lot of communication with your SME in this step. Explain to them the power of feature engineering and let them flush all the information they have in their head right into your model. Never underestimate the importance of this step.
For the purpose of this article, I want to reference the ideas discussed in this very good read at Kaggle ( https://www.kaggle.com/lavanyashukla01/how-i-made-top-0-3-on-a-kaggle-competition). You also can find many interesting notebooks there for more specific details on this dataset.

house_num = house[num_vars]
house_num.index = house.index

house_cat = house[cat_vars]
house_cat.index = house.index
house_y = house[response_var]

house_num['HasWoodDeck'] = (house_num['WoodDeckSF'] == 0) * 1
house_num['HasOpenPorch'] = (house_num['OpenPorchSF'] == 0) * 1
house_num['HasEnclosedPorch'] = (house_num['EnclosedPorch'] == 0) * 1

house_num['Total_Home_Quality'] = house_num['OverallQual'] + house_num['OverallCond']
house_num['TotalSF'] = house_num['TotalBsmtSF'] + house_num['1stFlrSF'] + house_num['2ndFlrSF']
house_num['Total_sqr_footage'] = (house_num['BsmtFinSF1']  + house_num['1stFlrSF'] + house_num['2ndFlrSF'])
house_num['Total_Bathrooms'] = (house_num['FullBath'] + (0.5 * house_num['HalfBath']) +
                               house_num['BsmtFullBath'] + (0.5 * house_num['BsmtHalfBath']))


house_num['TotalBsmtSF'] = house_num['TotalBsmtSF'].apply(lambda x: np.exp(6) if x <= 0.0 else x)
house_num['2ndFlrSF'] = house_num['2ndFlrSF'].apply(lambda x: np.exp(6.5) if x <= 0.0 else x)
house_num['GarageArea'] = house_num['GarageArea'].apply(lambda x: np.exp(6) if x <= 0.0 else x)
house_num['GarageCars'] = house_num['GarageCars'].apply(lambda x: 0 if x <= 0.0 else x)
house_num['LotFrontage'] = house_num['LotFrontage'].apply(lambda x: np.exp(4.2) if x <= 0.0 else x)
house_num['MasVnrArea'] = house_num['MasVnrArea'].apply(lambda x: np.exp(4) if x <= 0.0 else x)
house_num['BsmtFinSF1'] = house_num['BsmtFinSF1'].apply(lambda x: np.exp(6.5) if x <= 0.0 else x)

house_num['has2ndfloor'] = house_num['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
house_num['hasgarage'] = house_num['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
house_num['hasbsmt'] = house_num['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
house_num['hasfireplace'] = house_num['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

In addition to the transformations done above, you may also want to experiment with applying additional functions on your features; for example you can do log transformations, square root, square and cubic functions on your most important features at least.

From Letters to Numbers

Great. So far we have done a great job and hopefully easy and understandable steps to get to a very advanced stage of our ML project life-cycle. But as you may have noticed we focused mostly on our numeric variables. This is simply because of the flexibility on the transformations that we can do on numeric features. Even our ML algorithms prefer to work with numbers so don’t feel guilty about it. For that purpose, we need to pivot or spread about categorical variables to get the information from the row format to the column format. In other words, say if feature X contains values YES/NO, then we will need to create a column called X_YES that has a value of 1 for rows that has a YES value and 0 for rows that has a NO value. But the only thing that we need to watch for here, is the number of unique values these categorical features have. For instance if a variable has 100 unique values then we will end up with 100 extra columns. This will be a very expensive addition to the dataset that doesn’t necessarily add much value. So, first order of business here is that we need to check the number of unique values; if the number of unique values is large then we need to use label encoder, otherwise, we use one hot encoder. label encoder will simply change the categories from letters and encode them into numbers keeping one column in the output. One hot encoder will create a new column for each category as we discussed above. Let’s visualize how many unique values does each variable have using this code:

# Define a figure and axes with a defined size and defining that we need to do subplots
fig, axs = plt.subplots(ncols=0, nrows=0, figsize=(10, 120))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)
# Iterate through the numeric columns from the data frame house
for i, var in enumerate(list(house_cat[cat_vars]), 1):
    # Define the location for the plot of each variable using 3 coordinates location
    # 3 coordinates: number of rows, number of columns, location of the plot on the grid
    plt.subplot(len(list(cat_vars)), 5, i)
    value_counts = house_cat[var].value_counts()
    df_counts = value_counts.rename_axis('unique_values').reset_index(name='counts')
    # do the actual plot for the given feature
    sns.barplot(x='unique_values',y = 'counts', data=df_counts)
    # y and x labels    
    plt.title('{}'.format(var), size=12)
    plt.xlabel('')    
    plt.ylabel('')
    plt.xticks(rotation=70)       
plt.show()

Great, I guess that we are really lucky. None of the variables is showing an extremely large number of unique values. This makes the decision really easy on us; we need to create a boolean (1/0) column for each unique value from each variable. This can be either accomplished by using the OneHotEncoder class or the get_dummies function in Python. For simplicity, I will go ahead and use the get dummies function.

house_cat = pd.get_dummies(house_cat)
house_x = pd.concat([house_cat, house_num], axis = 1 )

Getting to the Cool Stuff: The Gladiators at Work

https://www.realmofhistory.com/wp-content/uploads/2016/07/12-facts-ancient-roman-gladiators.jpg?ezimgfmt=ng:webp/ngcb5

By now, we have at our disposal a clean dataset with all the previous steps applied. Now, it is time to put the machine learning algorithms at work and see how do they perform in predicting the ‘SalePrice’ response variable.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(house_x, house_y, test_size = 0.2, random_state = 0)

Before we do any prediction, we need to define our prediction error functions. That’s to say, for any prediction that we get, we need to know how far away is our prediction from the actual value. This will give us an idea about how good our model is. For the purpose of our experiment here, we need to create two error functions; the first is one that calculate the error between predicted values and actual values for any given model. The second calculates error using cross validation. 10 folds cross validation for example splits the training data into 10 different splits. At each time, 1 fold is used for validation and the remaining 9 folds are used to train the model. Then we estimate the prediction error for this round. The second round we pick the second fold to be our validation and the remaining 9 folds to be used to train the model and we calculate the prediction error. We repeat this experiment ten times and therefore we get 10 different prediction error measures. This will give us the ability to have a more comprehensive estimate of the prediction error and we can calculate the mean and standard deviation for this prediction error. Check out the image below with the provided link to better understand the cross validation process.

Reference: https://transform365home.files.wordpress.com/2020/01/b1ae4-1rgba1biouys7wqcxcl4u5a.png

Let’s define our models for now and after defining them, let’s fit them using our training dataset

# Setup all models
# Light Gradient Boosting Regressor
lightgbm = LGBMRegressor(objective='regression', 
                       num_leaves=6,
                       learning_rate=0.01, 
                       n_estimators=7000,
                       max_bin=200, 
                       bagging_fraction=0.8,
                       bagging_freq=4, 
                       bagging_seed=8,
                       feature_fraction=0.2,
                       feature_fraction_seed=8,
                       min_sum_hessian_in_leaf = 11,
                       verbose=-1,
                       random_state=42)

# XGBoost Regressor
xgboost = XGBRegressor(learning_rate=0.01,
                         n_estimators=6000,
                         max_depth=4,
                         min_child_weight=0,
                         gamma=0.6,
                         subsample=0.7,
                         colsample_bytree=0.7,
                         objective='reg:squarederror',
                         nthread=-1,
                         scale_pos_weight=1,
                         seed=27,
                         reg_alpha=0.00006,
                         random_state=42)

# Ridge Regressor
ridge_alphas = [1e-15, 1e-10, 1e-8, 9e-4, 7e-4, 5e-4, 3e-4, 1e-4, 1e-3, 5e-2, 1e-2, 0.1, 0.3, 1, 3, 5, 10, 15, 18, 20, 30, 50, 75, 100]
ridge = make_pipeline(RobustScaler(), RidgeCV(alphas=ridge_alphas, cv=kf))

# Support Vector Regressor
svr = make_pipeline(RobustScaler(), SVR(C= 20, epsilon= 0.008, gamma=0.0003))

# Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators=6000,
                                learning_rate=0.01,
                                max_depth=4,
                                max_features='sqrt',
                                min_samples_leaf=15,
                                min_samples_split=10,
                                loss='huber',
                                random_state=42)  

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=1200,
                          max_depth=15,
                          min_samples_split=5,
                          min_samples_leaf=5,
                          max_features=None,
                          oob_score=True,
                          random_state=42)

# Stack up all the models above, optimized using xgboost
stack_gen = StackingCVRegressor(regressors=(lightgbm, svr, ridge, gbr, rf),
                                meta_regressor=xgboost,
                                use_features_in_secondary=True)
models = []
models.append(('lightgbm', lightgbm))
models.append(('xgboost', xgboost))
models.append(('ridge',ridge))
models.append(('svr', svr))
models.append(('gbr', gbr))
models.append(('rf', rf))
models.append(('stack_gen', stack_gen))
# Fit the models 
print('stack_gen')
stack_gen_model_fit = stack_gen.fit(np.array(X_train), np.array(y_train))

print('lightgbm')
lgb_model_fit = lightgbm.fit(X_train, y_train)

print('Svr')
svr_model_fit = svr.fit(X_train, y_train)

print('Ridge')
ridge_model_fit = ridge.fit(X_train, y_train)

print('RandomForest')
rf_model_fit = rf.fit(X_train, y_train)

print('GradientBoosting')
gbr_model_fit = gbr.fit(X_train, y_train)

print('XBoost')
xgb_model_fit = xgboost.fit(X_train, y_train)

So first thing we need to do is to visualize our cross validation results. Let’s get the cross validation results and plot them in the same code snippet

## Calculate and plot the cross validation error rate
cv_results = []
cv_names = []
for name, model in models:
    cv_scores = cv_rmse(model = model, X = X_train, Y = y_train)
    cv_results.append(cv_scores)
    cv_names.append(name)
    msg = "%s: %f (%f)" % (name, cv_scores.mean(), cv_scores.std())
    print(msg)

fig = plt.figure()
fig.suptitle('Algorithm Comparison for House Price Cross Validation Results')
ax = fig.add_subplot(111)
plt.boxplot(cv_results)
ax.set_xticklabels(cv_names)
plt.show()

Great! we are seeing some really great results. Most algorithms are showing similar error rates. Lowest error rate surprisingly seems to be for the linear ridge regression. If we were to deploy this to production I would definitely choose the ridge regression results. In addition to its low error measure in comparison to other algorithms, ridge regression has the most powerful advantage over all of them, which is the ease of interpretability. That’s to say, it would be really easy to understand the magnitude of each feature in contributing to the final predicted sore for SalePrice. This enables us to go beyond the predicted value and into understanding what features contribute negatively or positively on this score. A very powerful tool indeed in the field of machine learning. But interpretability or explainability is out of the scope of this article and therefore I wouldn’t delve much into that concept here.

Very sweet, now that we saw that our cross validation results are reasonable, let’s make predictions using our fitted models and see how predicted values compare to actual values.

## Make prediction on the training data for each model
y_train_stack = stack_gen_model_fit.predict(X_train)
y_train_lgb = lgb_model_fit.predict(X_train)
y_train_svr = svr_model_fit.predict(X_train)
y_train_ridge = ridge_model_fit.predict(X_train)
y_train_rf = rf_model_fit.predict(X_train)
y_train_gbr = gbr_model_fit.predict(X_train)
y_train_xgb = xgb_model_fit.predict(X_train)

## Make prediction on the test data for each model
y_test_stack = stack_gen_model_fit.predict(X_test)
y_test_lgb = lgb_model_fit.predict(X_test)
y_test_svr = svr_model_fit.predict(X_test)
y_test_ridge = ridge_model_fit.predict(X_test)
y_test_rf = rf_model_fit.predict(X_test)
y_test_gbr = gbr_model_fit.predict(X_test)
y_test_xgb = xgb_model_fit.predict(X_test)

Now that we got our predicted values from all models let’s visualize our actual values versus predicted values across different models

plt.scatter(y_train_ridge, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_ridge, y_test, c = "lightgreen", marker = "s", label = "Testing data")
plt.title("Ridge Regression For Sale Price")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")
plt.plot([10.5, 13], [10.5, 13], c = "red")
plt.show()
plt.scatter(y_train_lgb, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_lgb, y_test, c = "lightgreen", marker = "s", label = "Testing data")
plt.title("Light Gradient Boosting Regression For Sale Price")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")
plt.plot([10.5, 13], [10.5, 13], c = "red")
plt.show()

Conclusion

If you made it till here then I have to tell you congratulations! you have built a fairly sophisticated model to predict house prices without using any of the advanced techniques used in machine learning. This article was meant to show you that by following common sense on your data that you can achieve fairly high accuracy on your model. That’s to say, we achieved some fairly good results without implementing any hyperparameter tuning, piplelines or advanced feature selection. This is not to underestimate the value of these steps but rather to show that good results can still be achieved by following the steps discussed in this article. I will be blogging about these topics in my future posts.

References:

In addition to the links above, some of the ideas an code snippets used in this article were adopted from the following great articles and notebooks:

Read a CSV file stored in blob container using python in DataBricks

Le’ts say that you have a csv file, a blob container and access to a DataBricks workspace. The purpose of this mini blog is to show how easy is the process from having a file on your local computer to reading the data into databricks. I will go through the process of uploading the csv file manually to a an azure blob container and then read it in DataBricks using python code.

Step 1: Upload the file to your blob container

This can be done simply by navigating to your blob container. From there, you can click the upload button and select the file you are interested in. Once selected, you need to click the upload button that in the upload blade. See screenshot below.

Once uploaded, you will be able to see the file available in your blob container as shown below:

Step 2: Get credentials necessary for databricks to connect to your blob container

From your azure portal, you need to navigate to all resources then select your blob storage account and from under the settings select account keys. Once their, copy the key under Key1 to a local notepad.

Step 3: Configure DataBricks to read the file

Here, you need to navigate to your databricks work space (create one if you don’t have one already) and launch it. Once launched, go to workspace and create a new python notebook.

To start reading the data, first, you need to configure your spark session to use credentials for your blob container. This can simply be done through the spark.conf.set command. More precisely, we start with the following

storage_account_name = 'nameofyourstorageaccount' 
storage_account_access_key = 'thekeyfortheblobcontainer'
spark.conf.set('fs.azure.account.key.' + storage_account_name + '.blob.core.windows.net', storage_account_access_key)


Once done, we need to build the file path in the blob container and read the file as a spark dataframe.

blob_container = 'yourblobcontainername'
filePath = "wasbs://" + blob_container + "@" + storage_account_name + ".blob.core.windows.net/Sales/SalesFile.csv"
salesDf = spark.read.format("csv").load(filePath, inferSchema = True, header = True)

And congrats, we are done. You can use the display command to have a sneak peak at our data as shown below.

Canvas App Performance Quick Wins

Canvas Apps are relatively easy to implement given that you become familiar with their overall structure and their light -but many- formulas. What you don’t want is make your boss loves them but then hate them when they are too slow when applied in production. Let’s go over some very simple ways to make your current and next Canvas App faster and more responsive and most importantly, make your users happy. Keep in mind that performance is a huge subject and this is just the tip of the iceberg.

Caching

If you have a region list in your CDS and you want to populate this region list in a drop down on few screens in your app. You may use the Filter formula like this:
Filter(RegionList, Status=”Active”)

and set this as the data source for your drop down. This is all good until you need this drop down on many screen which will cause multiple network calls to CDS. This accounts for longer time and more requests to CDS which are now accounted for (we have a limit).

Quick fix for this is to have all of your data that doesn’t change much (like regions, departments, etc) be loaded once and stored in global variables on the start of the app. For regions, you can write something like this in the OnStart of the app:

Collect(collectRegionList, Filter(RegionList, Status = “Active”))

Now, set your drop downs to the collectRegionList variable and this way no multiple loading of data happens.

Concurrency

If you need to load regions, departments, account types and other data at the start of the app, you can write something like this in the app OnStart:

Collect(collectRegionList, Filter(RegionList, Status = “Active”)) ; Collect(collectDepartmentList, Filter(DepartmentList, Status = “Active”)) ; Collect(collectAccountTypesList, Filter(AccountTypesList, Status = “Active”))

This will cache your data which is good, but the load time of your app will be long and the app will be loading for sometime. Say, each line takes 2 seconds to complete, you end up waiting 6 seconds for the 3 statements for execute. A better way is to use the super easy Canvas App concurrency capability which will a bit more than 2 seconds to execute.

Concurrent(
Collect(collectRegionList, Filter(RegionList, Status = “Active”)) ; Collect(collectDepartmentList, Filter(DepartmentList, Status = “Active”)) ; Collect(collectAccountTypesList, Filter(AccountTypesList, Status = “Active”)) ;
)

Delegation

This is an important concept. When a canvas app executes a function on a data source, who does the processing of the records? It depends on the function and the data source. For example, if you need to use the Filter function with a Share Point data source, and you ask for all records that are active in a set of 10000 records, Canvas App send this condition to Share Point and Share Point returns only those records that are active. Basically, the Canvas App doesn’t loop through all the records to see which one is active and which one is not. If your users are on a mobile network, this makes a huge difference in performance. In this case, we call the Filter function delegable with Share Point data sources.

On the other hand, other function are not delegable, take Search for example and the same Share Point data source, when you search for something in Share Point, a canvas app will receive all the records from Share Point and search each record locally. Here we say that the Search function is not delegable with Share Point data sources.

So when you choose which function to use, also consider which data source your are using as well, as both of these will determine how delegable is your functions. Below is a summary of the common functions and popular data sources with information about their delegability.

https://docs.microsoft.com/en-us/learn/modules/work-with-data-source-limits-powerapps-canvas-app/2-functions-predicates-data-source

Refresh? No thanks.

I used to do this a lot with Windows Millennium (long time ago) . While Windows is dying doing something for me, I was hitting refresh all the time on my desktop, I think I was helping it die faster. Fast forward to now, don’t use refresh unless you absolutely need it. In many cases, Canvas Apps will do the refresh for you, if that’s the case, don’t double the work by calling the Refresh function again.

Use Power Platform with On-Premise SQL Databases

Power platform is all built around data. Luckily, this data can reside anywhere, thanks to the Connections feature provided by the platform. If you want your Apps to interact with data from a local database hosted on a server somewhere then this is how you do it. This has a lot of potential from master data management to compliance to regulations and many more. I will go through a simple example starting from creating the database then to building a Canvas App around the data.

Step 1: Create the database and a table (if you don’t have one already). In your server, create a database or use an existing one. In this case, I created a DB called Master Accounts.

To simulate a CRUD operation later, I created a table called Accounts in this database using the following script. Make sure to specify a Primary Key or Canvas App will make your app read only without the ability to add/delete records.

Step 2: Create a user that can access this database. This is achieved by creating a “login” in the security database in your server and assigning this login the created database.

Step 3: To give the Power Platform the ability to interact with you local database, you need to install the On-Premises Data gateway. Notice that this gateway can be configured to work with all the Power Platform apps or only Power BI, of course we need the former option. After you are done, the gateway interface will look like the following image.

Step 4: Sign in with you Power Platform Admin Account:

When done, this is what you should get:

Step 5: Connect to the database from Power Platform. Visit make.powerapps.com and on the left pane, select Data and Connections. Select the New Connection from the command bar.

In the new Connection wizard, select the type of connection to be SQL Server and you should see a window similar to the following:

Of course, you can authenticate in different way, but I choose the SQL Server Authentication mechanism. Plug the values we created in the first two steps. Make sure that the correct gateway is selected.

If all is well, you should see a new connection in the list, mine looks like this:

Now let’s go and create a canvas app from data and see the magic!

Step 6: Create a Canvas App starting from SQL Server Data.

Select the “Accounts” table we created before and hit Connect:

And now you should end up with a Canvas app the can perform CRUD Operations on the local database. It will require some redesign though:)

The nice thing is that this connection is a Power Platform Connection now so you can use it with Flow!

This capability opens a lot of doors for the organizations that are hesitant to move their data up in the clouds, so go experiment with it and use it!