Your Fully Equipped and Most Simple Machine Learning Regression Template

In this blog, I would like to go through a full machine learning project using regression modeling and following common sense data cleaning and processing. This project should cover most of the checkpoints for any given machine learning project in real world utilizing one of the most powerful and commonly used regression algorithms, I will be covering:

1- The process of data exploration and why it is not just a fancy thing to do to impress business users.
2- How to divide your numbers (numeric variables) from your letters (categorical variables) and then bring them back together.
3- The ultimate painful experience of missing values and how to cheat on them.
4- Letters, oh god I hate letters, show me the numbers from these categories.
5- Let’s bring our best gladiators (support vector machines, random forests, gradient boosting algorithm …etc) and see what they can learn.
6- Which algorithm was best at learning? Are they consistent in their learning? Can I simply visualize the performance for each model and let my eye do the comparison because my brain is busy with other things?

Data Ingestion and Exploration

First thing we need to do of course is to load/read our data. For the purpose of this blog, I decided to use one of the most popular data sets for regression purposes; the house pricing dataset. If you want to follow along, feel free to download the file from This dataset contains several numerical and categorical predictors to help predict the sales price for a house. So, the objective here is that we need to use sales price from historical houses to try to come up with rules that map our given predictors to the sales price.

Once downloaded on your local computer, you can read the file and show the first few rows using the following code, of course, after loading all the required libraries

""" Import required libraries Main Libraries"""
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyodbc
import os
import sklearn
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from scipy import stats
import warnings
pd.set_option('display.max_columns', None)
%matplotlib inline

# Basic Statistics Packages
from scipy.stats import skew, norm
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

# ML Libraries
from mlxtend.regressor import StackingCVRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, scale, StandardScaler,RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, ElasticNetCV, BayesianRidge, Lasso
from sklearn.svm import SVR
from mlxtend.regressor import StackingCVRegressor
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
house = pd.read_csv("C:\\train.csv")

After having a first look at the data, it would be always best to split your data into numeric and categorical variables. The reason for doing this is because categorical variables will mostly need be handled differently than numeric variables for ML purposes. This in particular is really important when 1) we impute missing values, 2) when we do some basic data transformation like applying a one hot encoder on categorical variables and 3) numeric variables use different visuals than categorical variables during the exploration phase.

# In addition to splitting variables into numeric and categorical we need to remove
# The response variable and the the id variable from the list
response_var = 'SalePrice'
id_var = 'Id'
num_vars = [col for col in house.columns if house[col].dtype in ['int64','float64']]
cat_vars = [col for col in house.columns if house[col].dtype in ['object']]

Okay, looks good. We have a better idea about what our numeric predictors and categorical predictors are from a 30,000 foot view. Now, let’s have a closer view at our most important variable, our response variable, the SalesPrice variable. Let’s check the distribution of our variable

Shows the distribution for the sales price response variable.

From the plot above we can see that the variable ‘SalePrice’ is fairly positively skewed (it has a few houses with a very high value for SalePrice in comparison to other houses) but not so strongly skewed which is a good thing. But do we have to treat this variable? Not necessarily but very highly recommended to treat it. Why? simply because most machine learning algorithms that we will be using here assumes that the variable is normally distributed. The most obvious solution to treat a skewed variable is to get rid of the observations (or houses) that has this very high number in their ‘SalePrice’. But other more commonly used approaches can be implemented to treat positively skewed variables; like log transformation or taking the square root for a variable. So, let’s pick log transformation for now before we delve any deeper in our analysis.

house[response_var] = np.log1p(house[response_var])
Log transformed SalePrice. Look way nicer now with more normally distributed values than the absolute values (values before log transformation)

Awesome, now that we have a fairly normally distributed response variable, let’s have a look at how this variable is correlated with other numerical predictors. Best visual to see the relationship between two numeric variables is to use scatter plot (yes, Mr. obvious is here). Scatter plots here can help tremendously in four main aspects:
1) What predictors are most strongly correlated with our response variable?
2) What predictors don’t show any variability in their values?
3) How is the relationship between each predictor and the response variable; linear or non-linear?
4) Is there very obvious outliers in any of these predictor variables?.
Without further to say, let’s plot our data and check for these four points

# Define a figure and axes with a defined size and defining that we need to do subplots
fig, axs = plt.subplots(ncols=0, nrows=0, figsize=(12, 120))
# Iterate through the numeric columns from the data frame house
for i, feature in enumerate(list(house[num_vars]), 1):
    # Define the location for the plot of each variable using 3 coordinates location
    # 3 coordinates: number of rows, number of columns, location of the plot on the grid
    plt.subplot(len(list(num_vars)), 3, i)
    # do the actual plot for the given feature
    sns.scatterplot(x=feature, y=response_var, data=house)
    # y and x labels    
    plt.xlabel('{}'.format(feature), size=15,labelpad=12.5)
    plt.ylabel('SalePrice', size=15, labelpad=12.5)
    for j in range(2):
        plt.tick_params(axis='x', labelsize=12)
        plt.tick_params(axis='y', labelsize=12)

Okay, this result in a fairly large picture, so I will cut and paste the output in multiple images

Okay, there is a hell a lot to talk about here. As I mentioned, scatter plots never fail you on how important and revealing they are. Very powerful visualization indeed.

This blog is not intended to show you all the different fancy ways to detect outliers. Here I am following the simplest possible approach there is to detect outliers. Simply, I am looking at the scatter plot and see where observations don’t make much sense visually. Of course, these outliers can be of importance and can happen regularly in your population but based on the sample we are having here, It seems like the point I have highlighted, using my fancy yellow circles, to be outliers that we can live without. I wasn’t conservative in my selection, so as a first order of business I will get rid of all houses highlighted in the yellow circles above.

Based on what we have highlighted in yellow, we need to get rid of houses with the following values:
– LotFrontage is more than 150
– Lot area is more than 50000
– MasVnrArea is more than 1000
– BsmtFinSF1 is more than 3000
– BsmtFinSF2 is more than 1200
– TotalBsmtSF is more than 4000
– 1stFlrSF is more than 3000
– 2ndFlrSF is more than 1750
– GrLivArea is more than 4000
– BedroomAbvGr is more than 7
– TotRmsAbvGrd is more than 12
– GarageArea is more than 1200
– WoodDeckSF is more than 750
– OpenPorchSF is more than 400
– EnclosedPorch is more than 500
Of course depends on your project and your data, you might need to be more or less conservative.

# Write code to remove outliers, as we decided from above
print('data frame dimensions before removing outliers is {}'.format(house.shape))
house = house[house['LotFrontage'] <= 150]
house = house[house['LotArea'] <= 20000]
house = house[house['MasVnrArea']<= 1000]
house = house[house['BsmtFinSF1']<= 3000]
house = house[house['BsmtFinSF2']<= 1200]
house = house[house['TotalBsmtSF']<= 4000]
house = house[house['1stFlrSF'] <= 3000]
house = house[house['2ndFlrSF']<= 1750]
house = house[house['GrLivArea']<= 4000]
house = house[house['BedroomAbvGr'] <= 7]
house = house[house['TotRmsAbvGrd'] <= 12]
house = house[house['GarageArea'] <= 1200]
house = house[house['WoodDeckSF']<= 750]
house = house[house['OpenPorchSF'] <= 400]
house = house[house['EnclosedPorch'] <= 500]
print('data frame dimensions after removing outliers is {}'.format(house.shape))

Great, now that outliers have been removed, we can see the relationships between our predictor variable and the response variable more clearly and it will help us define visually what variables are most important to us. For example, below I want to list the visual for the ‘LotArea’ vs ‘SalePrice’ before and after the outliers were removed. You can see how, visually, the relationship became way more clear to us. and trust me on this, ML algorithms use patterns fairly similar to how we ourselves define patterns, so by removing these outliers we first help our minds to see what is important to us and we definitely make it easier for the algorithm to better define the relationship.

Separate the wheat from the chaff
Now that we removed the most obvious outliers, let’s look at the correlation between different predictors and the ‘SalePrice’ to keep those important variables in mind while we continue with this project. Here, there are two approaches. The manual approach where we have a look at the scatter plots from above and try to decipher the correlation visually. The other more formal way of doing it is to do some sort of correlation plot that is specifically used to cater our desire to understand correlation. Let’s try both here. By looking at the scatter plot that we generated after removing outliers, we can conclude the following
– LotArea is highly correlated with the SalePrice.
– LotFrontage is highly correlated with the SalePrice.
– OverallQual is highly correlated with SalePrice.
– TotalBsmtSF is highly correlated with SalePrice.
– 1stFlrSF is highly correlated with SalePrice.
– GrLivArea is highly correlated with SalePrice.
– Many other variables are highly correlated with SalePrice.
– BsmtFinSF2 doesn’t show much variability. We don’t loose much getting rid of it.
– LowQualFinSF doesn’t show much variability. We don’t loose much getting rid of it.
– PoolArea, ScreenPorch, 3SsnPorch, MiscVal don’t much variability. We wouldn’t loose much getting rid of them.

In addition of seeing what variables are correlated and what is not. We can see that many variables are linearly correlated with the SalePrice variable. This indicates that a linear model might be able to do a very great job predicting the value of the SalePrice. We shall see later if this is a valid observation.

# Remove the unwanted variables from the list of numeric variables
remove_vars = ['BsmtFinSF2','LowQualFinSF','PoolArea','ScreenPorch','3SsnPorch','MiscVal']
num_vars = [item for item in num_vars if item not in remove_vars]

Missing Values

Even though different ML algorithms differ sometimes significantly on how they map their inputs to outputs, they all seem to agree on one thing; they all do hate missing values. Again, this article is not about discussing the latest and the greatest strategies to deal with missing values, rather, to show that dealing with missing values is a very essential step during the ML process. That being said, from the projects I worked on, I have never seen a better strategy for imputing missing values than the common sense and the awareness of the data itself and of the problem at hand. For instance, sometimes a missing value might simply mean a zero for a particular variable. So knowing something like this is way more powerful and way more easier than implementing an algorithmic approach to impute that missing values. and yes, it is way better than implementing a KNN approach to estimate these missing values.
Okay, let’s see how bad is the problem. Let’s first run a script to check the variables with missing values and see the percentage of missing values in each variable.

missing_dict = {}
for var in house.columns:
    percent_missing = (house[var].isnull().sum()/ house.shape[0]) * 100
    if percent_missing > 0:
        missing_dict[var] = percent_missing
        print('variable: {} , missing_values: {}'.format(var, percent_missing))

Let’s visualize missing values per variable

# Sort the dictionary for smallest to biggest value
missing_dict = {k: v for k, v in sorted(missing_dict.items(), key=lambda item: item[1])}
plt.figure(figsize=(12,8)), list(missing_dict.values()), align='center')
plt.xticks(range(len(missing_dict)), list(missing_dict.keys()), rotation = 70)

Great, now we know that we have a total of 16 variables with missing values and only 5 of them they have a very high percentage of missing values of more than 50% of the whole sample size. The rest of the variables they have less than 10% missing values of the whole sample size. So, what do we need to do here?.
For this given problem, we can check the documentation and see if missing values mean anything. See if we can calculate the missing values for variables that has big percentage of their values missing. If we can’t do that, then we might sadly need to get rid of these variables. We follow the same approach for variables with low percentage of missing values. But if we were unable to calculate missing values by reading about these variables then we need to use a method to find a proxy to this missing value.

 # Data description states that missing values for the garage represents a no garage (we replace them with None)
for var in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    house[var] = house[var].fillna('None')
house['GarageYrBlt'] = house['GarageYrBlt'].fillna(0)
# Data description states that missing values for the basement represents a no basement (we replace them with None)
for var in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    house[var] = house[var].fillna('None')

# Other variables that indicate a doesn't exist if missing that we can replace with None
for var in ('PoolQC', 'Alley','Fence','FireplaceQu', 'Electrical'):
    house[var] = house[var].fillna('None')
# Other variables:
# MiscFeature is a variable to add further information that were not covered by other categories. So, let's just get rid of that
# Garage

Okay, great. This should have helped us to get rid of all missing values. You can verify that no more missing values are there by running the following script again

for var in house.columns:
    percent_missing = (house[var].isnull().sum()/ house.shape[0]) * 100
    if percent_missing > 0:
        print('variable: {} , missing_values: {}'.format(var, percent_missing))

Feature Engineering

Awesome. We did most of the daunting tasks that are a must for cleaning and improving the quality of our dataset. Most of what we did was just to meet the most basic requirement for a ML to work correctly and smoothly. After all of that is behind you, let’s unleash the artist that lives inside (please don’t unleash your singing skills or drawing skills, keep that leashed for now). feature engineering in its simplest form, is the process of updating the existing features or derive new features from the existing features for the hope of building smarter rules to map features to our response variable. That’s to say, sometimes two independent variables don’t show strong correlation with the response variable but if we multiply them, the resultant variable might show very strong linear or non-linear correlation. This step of the ML process is usually 20% data scientist and 80% subject matter expert (SME). That’s to say, to excel in your feature engineering process you need to know the data very well, know the problem at hand, have a solid understanding of your variables, how they get measured and what do they exactly mean from a business perspective. For our problem here, we kind of have a very strong sense of what are the different variables and the house sale problem. But if you are working with a harder business problem in an engineering field for example, then you need to have a lot of communication with your SME in this step. Explain to them the power of feature engineering and let them flush all the information they have in their head right into your model. Never underestimate the importance of this step.
For the purpose of this article, I want to reference the ideas discussed in this very good read at Kaggle ( You also can find many interesting notebooks there for more specific details on this dataset.

house_num = house[num_vars]
house_num.index = house.index

house_cat = house[cat_vars]
house_cat.index = house.index
house_y = house[response_var]

house_num['HasWoodDeck'] = (house_num['WoodDeckSF'] == 0) * 1
house_num['HasOpenPorch'] = (house_num['OpenPorchSF'] == 0) * 1
house_num['HasEnclosedPorch'] = (house_num['EnclosedPorch'] == 0) * 1

house_num['Total_Home_Quality'] = house_num['OverallQual'] + house_num['OverallCond']
house_num['TotalSF'] = house_num['TotalBsmtSF'] + house_num['1stFlrSF'] + house_num['2ndFlrSF']
house_num['Total_sqr_footage'] = (house_num['BsmtFinSF1']  + house_num['1stFlrSF'] + house_num['2ndFlrSF'])
house_num['Total_Bathrooms'] = (house_num['FullBath'] + (0.5 * house_num['HalfBath']) +
                               house_num['BsmtFullBath'] + (0.5 * house_num['BsmtHalfBath']))

house_num['TotalBsmtSF'] = house_num['TotalBsmtSF'].apply(lambda x: np.exp(6) if x <= 0.0 else x)
house_num['2ndFlrSF'] = house_num['2ndFlrSF'].apply(lambda x: np.exp(6.5) if x <= 0.0 else x)
house_num['GarageArea'] = house_num['GarageArea'].apply(lambda x: np.exp(6) if x <= 0.0 else x)
house_num['GarageCars'] = house_num['GarageCars'].apply(lambda x: 0 if x <= 0.0 else x)
house_num['LotFrontage'] = house_num['LotFrontage'].apply(lambda x: np.exp(4.2) if x <= 0.0 else x)
house_num['MasVnrArea'] = house_num['MasVnrArea'].apply(lambda x: np.exp(4) if x <= 0.0 else x)
house_num['BsmtFinSF1'] = house_num['BsmtFinSF1'].apply(lambda x: np.exp(6.5) if x <= 0.0 else x)

house_num['has2ndfloor'] = house_num['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
house_num['hasgarage'] = house_num['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
house_num['hasbsmt'] = house_num['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
house_num['hasfireplace'] = house_num['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

In addition to the transformations done above, you may also want to experiment with applying additional functions on your features; for example you can do log transformations, square root, square and cubic functions on your most important features at least.

From Letters to Numbers

Great. So far we have done a great job and hopefully easy and understandable steps to get to a very advanced stage of our ML project life-cycle. But as you may have noticed we focused mostly on our numeric variables. This is simply because of the flexibility on the transformations that we can do on numeric features. Even our ML algorithms prefer to work with numbers so don’t feel guilty about it. For that purpose, we need to pivot or spread about categorical variables to get the information from the row format to the column format. In other words, say if feature X contains values YES/NO, then we will need to create a column called X_YES that has a value of 1 for rows that has a YES value and 0 for rows that has a NO value. But the only thing that we need to watch for here, is the number of unique values these categorical features have. For instance if a variable has 100 unique values then we will end up with 100 extra columns. This will be a very expensive addition to the dataset that doesn’t necessarily add much value. So, first order of business here is that we need to check the number of unique values; if the number of unique values is large then we need to use label encoder, otherwise, we use one hot encoder. label encoder will simply change the categories from letters and encode them into numbers keeping one column in the output. One hot encoder will create a new column for each category as we discussed above. Let’s visualize how many unique values does each variable have using this code:

# Define a figure and axes with a defined size and defining that we need to do subplots
fig, axs = plt.subplots(ncols=0, nrows=0, figsize=(10, 120))
# Iterate through the numeric columns from the data frame house
for i, var in enumerate(list(house_cat[cat_vars]), 1):
    # Define the location for the plot of each variable using 3 coordinates location
    # 3 coordinates: number of rows, number of columns, location of the plot on the grid
    plt.subplot(len(list(cat_vars)), 5, i)
    value_counts = house_cat[var].value_counts()
    df_counts = value_counts.rename_axis('unique_values').reset_index(name='counts')
    # do the actual plot for the given feature
    sns.barplot(x='unique_values',y = 'counts', data=df_counts)
    # y and x labels    
    plt.title('{}'.format(var), size=12)

Great, I guess that we are really lucky. None of the variables is showing an extremely large number of unique values. This makes the decision really easy on us; we need to create a boolean (1/0) column for each unique value from each variable. This can be either accomplished by using the OneHotEncoder class or the get_dummies function in Python. For simplicity, I will go ahead and use the get dummies function.

house_cat = pd.get_dummies(house_cat)
house_x = pd.concat([house_cat, house_num], axis = 1 )

Getting to the Cool Stuff: The Gladiators at Work

By now, we have at our disposal a clean dataset with all the previous steps applied. Now, it is time to put the machine learning algorithms at work and see how do they perform in predicting the ‘SalePrice’ response variable.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(house_x, house_y, test_size = 0.2, random_state = 0)

Before we do any prediction, we need to define our prediction error functions. That’s to say, for any prediction that we get, we need to know how far away is our prediction from the actual value. This will give us an idea about how good our model is. For the purpose of our experiment here, we need to create two error functions; the first is one that calculate the error between predicted values and actual values for any given model. The second calculates error using cross validation. 10 folds cross validation for example splits the training data into 10 different splits. At each time, 1 fold is used for validation and the remaining 9 folds are used to train the model. Then we estimate the prediction error for this round. The second round we pick the second fold to be our validation and the remaining 9 folds to be used to train the model and we calculate the prediction error. We repeat this experiment ten times and therefore we get 10 different prediction error measures. This will give us the ability to have a more comprehensive estimate of the prediction error and we can calculate the mean and standard deviation for this prediction error. Check out the image below with the provided link to better understand the cross validation process.


Let’s define our models for now and after defining them, let’s fit them using our training dataset

# Setup all models
# Light Gradient Boosting Regressor
lightgbm = LGBMRegressor(objective='regression', 
                       min_sum_hessian_in_leaf = 11,

# XGBoost Regressor
xgboost = XGBRegressor(learning_rate=0.01,

# Ridge Regressor
ridge_alphas = [1e-15, 1e-10, 1e-8, 9e-4, 7e-4, 5e-4, 3e-4, 1e-4, 1e-3, 5e-2, 1e-2, 0.1, 0.3, 1, 3, 5, 10, 15, 18, 20, 30, 50, 75, 100]
ridge = make_pipeline(RobustScaler(), RidgeCV(alphas=ridge_alphas, cv=kf))

# Support Vector Regressor
svr = make_pipeline(RobustScaler(), SVR(C= 20, epsilon= 0.008, gamma=0.0003))

# Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators=6000,

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=1200,

# Stack up all the models above, optimized using xgboost
stack_gen = StackingCVRegressor(regressors=(lightgbm, svr, ridge, gbr, rf),
models = []
models.append(('lightgbm', lightgbm))
models.append(('xgboost', xgboost))
models.append(('svr', svr))
models.append(('gbr', gbr))
models.append(('rf', rf))
models.append(('stack_gen', stack_gen))
# Fit the models 
stack_gen_model_fit =, np.array(y_train))

lgb_model_fit =, y_train)

svr_model_fit =, y_train)

ridge_model_fit =, y_train)

rf_model_fit =, y_train)

gbr_model_fit =, y_train)

xgb_model_fit =, y_train)

So first thing we need to do is to visualize our cross validation results. Let’s get the cross validation results and plot them in the same code snippet

## Calculate and plot the cross validation error rate
cv_results = []
cv_names = []
for name, model in models:
    cv_scores = cv_rmse(model = model, X = X_train, Y = y_train)
    msg = "%s: %f (%f)" % (name, cv_scores.mean(), cv_scores.std())

fig = plt.figure()
fig.suptitle('Algorithm Comparison for House Price Cross Validation Results')
ax = fig.add_subplot(111)

Great! we are seeing some really great results. Most algorithms are showing similar error rates. Lowest error rate surprisingly seems to be for the linear ridge regression. If we were to deploy this to production I would definitely choose the ridge regression results. In addition to its low error measure in comparison to other algorithms, ridge regression has the most powerful advantage over all of them, which is the ease of interpretability. That’s to say, it would be really easy to understand the magnitude of each feature in contributing to the final predicted sore for SalePrice. This enables us to go beyond the predicted value and into understanding what features contribute negatively or positively on this score. A very powerful tool indeed in the field of machine learning. But interpretability or explainability is out of the scope of this article and therefore I wouldn’t delve much into that concept here.

Very sweet, now that we saw that our cross validation results are reasonable, let’s make predictions using our fitted models and see how predicted values compare to actual values.

## Make prediction on the training data for each model
y_train_stack = stack_gen_model_fit.predict(X_train)
y_train_lgb = lgb_model_fit.predict(X_train)
y_train_svr = svr_model_fit.predict(X_train)
y_train_ridge = ridge_model_fit.predict(X_train)
y_train_rf = rf_model_fit.predict(X_train)
y_train_gbr = gbr_model_fit.predict(X_train)
y_train_xgb = xgb_model_fit.predict(X_train)

## Make prediction on the test data for each model
y_test_stack = stack_gen_model_fit.predict(X_test)
y_test_lgb = lgb_model_fit.predict(X_test)
y_test_svr = svr_model_fit.predict(X_test)
y_test_ridge = ridge_model_fit.predict(X_test)
y_test_rf = rf_model_fit.predict(X_test)
y_test_gbr = gbr_model_fit.predict(X_test)
y_test_xgb = xgb_model_fit.predict(X_test)

Now that we got our predicted values from all models let’s visualize our actual values versus predicted values across different models

plt.scatter(y_train_ridge, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_ridge, y_test, c = "lightgreen", marker = "s", label = "Testing data")
plt.title("Ridge Regression For Sale Price")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")
plt.plot([10.5, 13], [10.5, 13], c = "red")
plt.scatter(y_train_lgb, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_lgb, y_test, c = "lightgreen", marker = "s", label = "Testing data")
plt.title("Light Gradient Boosting Regression For Sale Price")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")
plt.plot([10.5, 13], [10.5, 13], c = "red")


If you made it till here then I have to tell you congratulations! you have built a fairly sophisticated model to predict house prices without using any of the advanced techniques used in machine learning. This article was meant to show you that by following common sense on your data that you can achieve fairly high accuracy on your model. That’s to say, we achieved some fairly good results without implementing any hyperparameter tuning, piplelines or advanced feature selection. This is not to underestimate the value of these steps but rather to show that good results can still be achieved by following the steps discussed in this article. I will be blogging about these topics in my future posts.


In addition to the links above, some of the ideas an code snippets used in this article were adopted from the following great articles and notebooks:

Read a CSV file stored in blob container using python in DataBricks

Le’ts say that you have a csv file, a blob container and access to a DataBricks workspace. The purpose of this mini blog is to show how easy is the process from having a file on your local computer to reading the data into databricks. I will go through the process of uploading the csv file manually to a an azure blob container and then read it in DataBricks using python code.

Step 1: Upload the file to your blob container

This can be done simply by navigating to your blob container. From there, you can click the upload button and select the file you are interested in. Once selected, you need to click the upload button that in the upload blade. See screenshot below.

Once uploaded, you will be able to see the file available in your blob container as shown below:

Step 2: Get credentials necessary for databricks to connect to your blob container

From your azure portal, you need to navigate to all resources then select your blob storage account and from under the settings select account keys. Once their, copy the key under Key1 to a local notepad.

Step 3: Configure DataBricks to read the file

Here, you need to navigate to your databricks work space (create one if you don’t have one already) and launch it. Once launched, go to workspace and create a new python notebook.

To start reading the data, first, you need to configure your spark session to use credentials for your blob container. This can simply be done through the spark.conf.set command. More precisely, we start with the following

storage_account_name = 'nameofyourstorageaccount' 
storage_account_access_key = 'thekeyfortheblobcontainer'
spark.conf.set('' + storage_account_name + '', storage_account_access_key)

Once done, we need to build the file path in the blob container and read the file as a spark dataframe.

blob_container = 'yourblobcontainername'
filePath = "wasbs://" + blob_container + "@" + storage_account_name + ""
salesDf ="csv").load(filePath, inferSchema = True, header = True)

And congrats, we are done. You can use the display command to have a sneak peak at our data as shown below.

Canvas App Performance Quick Wins

Canvas Apps are relatively easy to implement given that you become familiar with their overall structure and their light -but many- formulas. What you don’t want is make your boss loves them but then hate them when they are too slow when applied in production. Let’s go over some very simple ways to make your current and next Canvas App faster and more responsive and most importantly, make your users happy. Keep in mind that performance is a huge subject and this is just the tip of the iceberg.


If you have a region list in your CDS and you want to populate this region list in a drop down on few screens in your app. You may use the Filter formula like this:
Filter(RegionList, Status=”Active”)

and set this as the data source for your drop down. This is all good until you need this drop down on many screen which will cause multiple network calls to CDS. This accounts for longer time and more requests to CDS which are now accounted for (we have a limit).

Quick fix for this is to have all of your data that doesn’t change much (like regions, departments, etc) be loaded once and stored in global variables on the start of the app. For regions, you can write something like this in the OnStart of the app:

Collect(collectRegionList, Filter(RegionList, Status = “Active”))

Now, set your drop downs to the collectRegionList variable and this way no multiple loading of data happens.


If you need to load regions, departments, account types and other data at the start of the app, you can write something like this in the app OnStart:

Collect(collectRegionList, Filter(RegionList, Status = “Active”)) ; Collect(collectDepartmentList, Filter(DepartmentList, Status = “Active”)) ; Collect(collectAccountTypesList, Filter(AccountTypesList, Status = “Active”))

This will cache your data which is good, but the load time of your app will be long and the app will be loading for sometime. Say, each line takes 2 seconds to complete, you end up waiting 6 seconds for the 3 statements for execute. A better way is to use the super easy Canvas App concurrency capability which will a bit more than 2 seconds to execute.

Collect(collectRegionList, Filter(RegionList, Status = “Active”)) ; Collect(collectDepartmentList, Filter(DepartmentList, Status = “Active”)) ; Collect(collectAccountTypesList, Filter(AccountTypesList, Status = “Active”)) ;


This is an important concept. When a canvas app executes a function on a data source, who does the processing of the records? It depends on the function and the data source. For example, if you need to use the Filter function with a Share Point data source, and you ask for all records that are active in a set of 10000 records, Canvas App send this condition to Share Point and Share Point returns only those records that are active. Basically, the Canvas App doesn’t loop through all the records to see which one is active and which one is not. If your users are on a mobile network, this makes a huge difference in performance. In this case, we call the Filter function delegable with Share Point data sources.

On the other hand, other function are not delegable, take Search for example and the same Share Point data source, when you search for something in Share Point, a canvas app will receive all the records from Share Point and search each record locally. Here we say that the Search function is not delegable with Share Point data sources.

So when you choose which function to use, also consider which data source your are using as well, as both of these will determine how delegable is your functions. Below is a summary of the common functions and popular data sources with information about their delegability.

Refresh? No thanks.

I used to do this a lot with Windows Millennium (long time ago) . While Windows is dying doing something for me, I was hitting refresh all the time on my desktop, I think I was helping it die faster. Fast forward to now, don’t use refresh unless you absolutely need it. In many cases, Canvas Apps will do the refresh for you, if that’s the case, don’t double the work by calling the Refresh function again.

Use Power Platform with On-Premise SQL Databases

Power platform is all built around data. Luckily, this data can reside anywhere, thanks to the Connections feature provided by the platform. If you want your Apps to interact with data from a local database hosted on a server somewhere then this is how you do it. This has a lot of potential from master data management to compliance to regulations and many more. I will go through a simple example starting from creating the database then to building a Canvas App around the data.

Step 1: Create the database and a table (if you don’t have one already). In your server, create a database or use an existing one. In this case, I created a DB called Master Accounts.

To simulate a CRUD operation later, I created a table called Accounts in this database using the following script. Make sure to specify a Primary Key or Canvas App will make your app read only without the ability to add/delete records.

Step 2: Create a user that can access this database. This is achieved by creating a “login” in the security database in your server and assigning this login the created database.

Step 3: To give the Power Platform the ability to interact with you local database, you need to install the On-Premises Data gateway. Notice that this gateway can be configured to work with all the Power Platform apps or only Power BI, of course we need the former option. After you are done, the gateway interface will look like the following image.

Step 4: Sign in with you Power Platform Admin Account:

When done, this is what you should get:

Step 5: Connect to the database from Power Platform. Visit and on the left pane, select Data and Connections. Select the New Connection from the command bar.

In the new Connection wizard, select the type of connection to be SQL Server and you should see a window similar to the following:

Of course, you can authenticate in different way, but I choose the SQL Server Authentication mechanism. Plug the values we created in the first two steps. Make sure that the correct gateway is selected.

If all is well, you should see a new connection in the list, mine looks like this:

Now let’s go and create a canvas app from data and see the magic!

Step 6: Create a Canvas App starting from SQL Server Data.

Select the “Accounts” table we created before and hit Connect:

And now you should end up with a Canvas app the can perform CRUD Operations on the local database. It will require some redesign though:)

The nice thing is that this connection is a Power Platform Connection now so you can use it with Flow!

This capability opens a lot of doors for the organizations that are hesitant to move their data up in the clouds, so go experiment with it and use it!

Power Platform and Change Management

Let’s face it, switching users from using their Excel sheets or Access databases toward using one monolithic Dynamics 365 application can be a hard change management process if you have so many users to convince. Sometimes,even the upper management can’t force that change depending on what type of organization it is.

With the new Power platform capabilities , the change management seems to be getting easier and easier because now we have options that we didn’t have before (or we did have but the are currently improved). Once the organization decides that this is the platform to go with, then here are some options that will make it easier to convince the user base to switch.

The simple approach that can be used right away is using the model-driven apps capability of dividing your applications into verticals. If you have one huge application with so many entities, then create multiple apps that are used by different business units or group of users. Each business unit or group should only see what they need to see and in this way, the probability users getting lost in the application is reduced and the amount of training needed for the users is reduced. This also means that error rate will be reduced as well because their options are more limited to what they need only.

With Model-driven apps, and in addition to limiting what entities a user can see, you can also limit what forms, views, charts, dashboards and business process flows. So when you have an entity (like the Case) that is used by multiple groups then each group can see their own forms and views and charts without being overwhelmed with everything else. I won’t call this a security layer but a way of organizing components.

Image result for model driven apps"

If model-driven apps are not enough, then the Canvas Apps are to the rescue. Canvas Apps are new and their concept is new. Unlike model-driven app that seem intuitive to someone who knows the previous versions of Dynamics, Canvas App require a shift in the design mentality. Now we are not talking about a single application that can do many things, but about an application and many other little helper separate applications around it that all feed the same data layer (Common Data Model). So when you create data using a Canvas App, it is possible to view it from Dynamics and vice versa.

The introduction of Canvas Apps adds a new question during to the design process: “Should we implement this module in Dynamics or using a Canvas App?“. This question is becoming an important one because it doesn’t only affect the application architecture but also the user on-boarding experience, training time, error rate and user confidence.

Canvas apps are great when there is a user or group of users who do a limited set of functionalities that can be separated away. Take an example of a service call center agent who just answers the calls, log a ticket and try to solve it or escalate it. You don’t need to train this agent on the whole almighty Dynamics for customer service but only on a screen or two of the Canvas App that she and her team has access to. Keep in mind that Canvas Apps can have more complicated use cases.

So to make the change management process easier, you don’t need to take the users away from their Excel sheet into an application that is a 100 times the size of their Excel sheet but to an application that is almost the same size as their Excel sheet. Success is almost guaranteed in this case.

Using the Calendar Control View in the Unified Interface

Often, we get asked to show records in a calendar view. I personally used the JavaScript-based Full Calendar many times in the past to do that. If your requirement is just showing the records on a calendar with basic functionality then the Calendar control in the unified interface might be your answer.

In the classic interface, we used to have a calendar control on the entity that only works in the Phone and Tablet Layouts. This control basically allows us to view the records on a calendar instead of just showing them in a list.

Moving to the unified interface, the “Web” option is now available. To test that, I created a dummy event entity with Start date, End date and Description fields.

A custom Entity with Start date, end date and description fields.

Then from the controls section on the new entity (use the classical interface designer as this is not available yet on the new designer), add a calendar view, enable it for web and bind the start,end and description fields to the fields we just created above. Note that the description field will show on the calendar, you either can bind it to the name of the record or a custom description field if you want to show more information. Save and Publish your changes.

Add the calendar control and bind the values

Now when you go to view the events, instead of the classical view, you will see a nice calendar view.

The calendar control shows instead of the classical view.

If you like to go back to the normal View list, you can do that from the top right corner.

Business Rules for PowerApps Portals – v1

When it comes to customizing Dynamics 365, I don’t care how we do it, I care about enabling the customers to use the system easily after it gets delivered to them. This of course means if we can get things done by OOB configuration and customization wizards, then it is the way to go, the last option is to write code. One example is the use of Business Rules instead of client side scripting, for simple to medium needs, a business rule can save us (and the customer) from nasty JavaScript code and enable them to change it later without worry.

The same problem applies to the Portals side of Dynamics. I’ve never worked on a portal project where the OOB features satisfy the client needs. This means any small change like hiding a field or a section needs to be backed up by some Javascript that lives inside the Entity form or the Web Form Step. Even though the needed Javascript can be simple, not everyone is comfortable doing it specially if the Dynamics Admin is not a technical person and honestly, they don’t need to know Javascript.

I though of a configuration-based solution that I call Portal Business Rules. This solution doesn’t have a fancy designer like the Business Rules in Dynamics Forms, but it is configuration based and it is capable of producing/modifying Javascript without the need to write it yourself. This solution has many of the common functionalities that a project needs. That being said, and similar to how client side scripting is still needed on the Dynamics side even with the existence of Business Rules, complex needs will still require Javascript on the portal and the good news is that this complex Java script can coexist with my proposed solution.

The current functionality of the solution is limited to:

  1. Each rule is governed by a single IF/ELSE condition.
  2. The rule works with Entity forms and Web form steps.
  3. Each rule can have unlimited number of actions. Actions include Show/Hide fields. Disable/Enable Fields, Make fields Required/Not Required, Set Field Value, Prevent Past Date and Prevent Future Date (for Datetime fields), Show/Hide Sections, Show/Hide Tabs.
  4. A rule will parse the XML of the related form or tab and suggest the fields/sections/tabs to be used in the rule logic.
  5. For some of the field types (Option sets and two option sets), a suggested value table shows up for ease of use. So instead of figuring out the integer value of an option set field, they will be listed for the user to select from.
  6. The ability to use “In” and “Not In” Operators. For example you can say if an option set value is in “2^3^4” which means if the option set is either of these 3 values, then the condition will hold true.
  7. You can see the generated Java script directly in a special tab.
  8. The Generated Java script for all the rules gets injected into the Entity form or web form step Custom Java script field and it is decorated with special comments to make it clear that this is generated by the solution and not by hand.
  9. When a rule is deleted or drafted, its logic gets removed automatically from the corresponding entity form or web form step.
  10. Basic error handling is added so that when the operands has the wrong value format, an error will show up to tell the user to fix it.

Here is a quick video showing the installation steps:

Here is a simple rule creation demo that shows/hides a tab based on a two option set value:

Another demo of multi action rule, where the Job Title field is shown and becomes required if the Company Name field is populated:

Another demo of how an option set is used in a rule. How error handling works if the operand value is of wrong format.

And finally, the “In” Operator is one of the advanced operators. Here is an example of how we can populate a field if the condition falls into one of a predetermined list of values:

Of course, there are many other possible operations features that you want to check out if you install the solution. Manipulating section visibility, field states (enabled and disabled) and many more.

Many will notice that we can only have one condition in a single rule for now and I’m currently thinking on the best way to associate other conditions to a rule with either AND or OR logical operators between them, similar to how Dynamics 365 Business Rules behave.

To be fair, the best solution for this problem is not my proposed solution but is to make the Business rules that currently exist for Dynamics forms work on the Portal Forms as well, I can say that this solution needs to be done by Microsoft itself as there no much visibility on the Business Rules engine for us,developers. Based on my knowledge, the business rules in Dynamics seem to be built using the Windows Workflow Foundation (from looking at their XAML).

In summary, the problem I’m trying to solve is reducing the need for code further, similar to how Business Rules reduced the need for client side scripting on the Dynamics 365 side. If code is still needed, then my solution and custom code can still live together.

Please refer to my repository on Github for installation steps. Feedback is really appreciated.

NOTE: For the Java script functions that I call in the back-end, I use this existing library on GitHub developed by Aung Khaing .

Update October 16, 2019

During some search, I found out that a company called North52 has a similar solution that was done before and they inject Javascript the same way I do but of course with a nicer interface :). I have a bit more functionality provided. Here is the Link

Slim Solution, a Plugin for XrmToolbox

Recently, I was given many unmanaged Dynamics 365 solutions to maintain. The thing I hate about solution is that they can become messy with a click of a button when you add an existing component to it. By Messy I mean there are a lot of things that are not needed but added to the solution. If you add the Case entity, many developers do add the whole case entity even though one or 2 fields are needed to be modified, the rest of the information is confusing and it is not very straight forward to clean that up.

The problem gets worse when you want to build a managed solution out of the unmanaged one. The managed solution needs to be very clear on what it does to what parts of the system. If the managed solution changes one case field, adds a new relationship then those the only changes that need to exist in the solution.

While cleaning out the solutions manually, by looking at their managed exports (you can know if a component has changed by looking at its managed XML export). I decided to write a very small XrmToolbox plugin that helps me in that. I wrote the plugin sometime ago but it took a while to validate it as XrmToolbox has a new lengthy validation process.

The basic idea of the plugin is that it checks all the managed entities that are added into the solution, find which field, form or view is either customized or added to that managed entity. Then it will tell you which components need to be in the solution so that you can remove the rest.

The plugin is called SlimSolution and it is currently available for download in the XrmToolbox. It is still lacking many of the features I want such as checking other component types but I will be adding those in the near future.

An example of the usage of this plugin would be something like this. You create a solution (or other developers do) and you want to clean it up from the unwanted components. As an example, I created the below solution that has:

  1. A custom unmanaged entity
  2. 3 managed entities in which I did add all components to them and metadata.
  3. I modified/added some fields in the account and KB article entities and did nothing to the Agreement entity.

What I want is to clean up this solution by only keeping the managed entities that have been customized.

When you open the SlimSolution plugin, you first load the solutions and hit Check Solution. A somewhat nice summary appears on the right with some details and suggestions on what are the changes that need to stay in those managed entities. Of course and as I mentioned above, the plugin only checks for Forms, Views and Fields for now and gives you the list of components that need to stay in the solution.

You can see that the unmanaged entity is not mentioned because the plugin assumes unmanaged entities are created to be included in the solution (not always the case though but this is the assumption here). You will see information about what needs to stay from the account and KB entities because they were modified. You don’t see anything for the agreement entity which means that whole entity can be removed from the solution.

In addition to the above, if the solution contains some inactive processes/BPFs/Dialogs, it will alert you to remove them from the solution. The code for this plugin is constructed in away that makes adding component validators an easy task which I will do in the near future as I have some other validator ideas in mind.

Tips on Dynamics 365 Plugin Code validation for AppSource Submissions

Not long ago, I was involved in submitting a really complex application built on top of Dynamics 365 to Microsoft AppSource. The application contains a lot of plugins and code activities that perform some complex tasks and automation. The team faced some issues that I think are worth sharing with others to save your time if you you are working on such a submission.

Microsoft, provides us with tools such as the Solution Checker that validates your solution including your plugin and web resource code. The problem is, that’s not all. When you submit an application to the AppSource team, it goes through a rigorous manual and automatic checks using tools that are not publicly available to us, developers. If there are issues in your code, your submission will be rejected with explanation on what to fix and with the list of issues ordered based on their priorities. To pass the submission, all critical and high priority issues need to be fixed (if you can convince the AppSource team that somethings needs to be done a certain way and can’t be done another way, they will mostly make an exception).

After the first submission, the app got rejected with tons of things to modify/fix (even after running the solution checker on all the solutions). To be honest, the documents they sent were scary (1000+) pages with explanations on the issues. After looking at the issue list, it turned out that 90% of the critical/high priority issues are related to writing thread safe plugins. Luckily, the fix was very easy for those issues but it cost us around 2 weeks of time to do another submission and get it verified again. The following are the most common critical issues.

Variables May Cause Threading Issues

A plugin in Dynamics, is a simple class that implements the IPlugin interface, and thus, has a single Execute method as a minimum. Almost always, you need to create the organization service, the tracing service, the context and maybe other object. A bare bone plugin that builds, will look something like this:

public class SomePlugin : IPlugin
    public void Execute(IServiceProvider serviceProvider)
        throw new NotImplementedException();

A useful plugin, will have extra objects created so that we can communicate with the Dynamics organization,

public class SomePlugin: IPlugin {
 // Obtain the tracing service
 ITracingService tracingService = null;
 IPluginExecutionContext context = null;
 // Obtain the execution context from the service provider.  

 public void Execute(IServiceProvider serviceProvider) {
  tracingService =
   (ITracingService) serviceProvider.GetService(typeof(ITracingService));

  context = (IPluginExecutionContext)

Now what’s wrong with the above plugin code? In a normal .NET application, this is a normal thing to do, but in a Dynamics plugin, it is not. To understand why, we need to understand how plugins get executed on our behalf behind he scenes. When a plugin runs for the first time (because of some trigger), most of the plugin global variables get cached, this happens when the constructor of the plugin is first executed. This means, in the next run, the same tracing service and context “may” be shared with the next run. This applies on any variable you define outside your function as a global variable in your plugin class. Ultimately, this causes threading issues (multiple runs of the same plugin instance compete for the same cached variable) and you may end up with extremely difficult-to-debug errors and unexplained deadlocks. The fix for the above, is very simple, just create your variables locally in the execute function, so each run of the plugin executes its own set of local variables.

public class SomePlugin: IPlugin {
 public void Execute(IServiceProvider serviceProvider) {
 ITracingService  tracingService =
   (ITracingService) serviceProvider.GetService(typeof(ITracingService));

  IPluginExecutionContext context = (IPluginExecutionContext)

This by default means, that any helper function in your plugin should get what it needs from its parameters and not from global variables. Assume you have a function that needs the tracing service, and this function get’s called from the Execute method, pass the tracing service that was created in the execute method to that function and don’t make it a global object.

public class SomePlugin: IPlugin {
 public void Execute(IServiceProvider serviceProvider) {
 ITracingService  tracingService =
   (ITracingService) serviceProvider.GetService(typeof(ITracingService));

  IPluginExecutionContext context = (IPluginExecutionContext)
// do work here

private void HelperFunction(ITracingService tracingService, int param1, int param2, string param3)
//use tracing service here

On the other hand, anything that is read only (config string, some constant number) is safe to stay as a global class member.

Plugins That Trigger on Any Change

This problem is more common. The filtering attributes of a plugin, are a way to limit when that plugin executes. Try to have as few as possible of those filtering attributes, don’t specify all of them. At that time I was involved in that submission, the solution checker wasn’t able to detect such problem but it may have improved now.

Image result for filtering attributes

Plugins That Update the Record Or Retrieve Attributes Again

This is also a common issue, when a plugin is triggered on an update of an entity record, it is really a bad idea to issue another update request to the same record again. An example of this can be the need to update fieldX based on the value of fieldY. When the plugin triggers on fieldY change, you issue an service.Update(entity) with the new value of fieldX. This implicates the performance of the whole organization and even worse, it can cause an infinite loop if the filtering attributes are not set properly. Another, bad use case is to issue a retrieve attributes query for the same record when pre-images and post images can be used to remedy that.

To be clear, sometimes, there is no way around issuing another retrieve inside the plugin or sending a self-update request, we had some of those cases and we were able to convince the AppSource team that our way was the only way.

Slow Plugins

As a general rule of thumb, your plugin should be slim and does a very small thing and does it fast. Plugins have some upper limit on the time they can run within and your plugin should never exceed that time (or not even half of it). When your plugin does exceed the time allocated for it, it is time for redesigning it.


While those issues have simple fixes in general, they can cause slowness and unexplained errors and a rejection from AppSource. Even if you are not submitting anything to AppSource, make sure that you set some ground rules for the developers working on the same code base on how to write good plugins. More on plugins best practices can be found here.

What this blog is about?

Digitally transforming your business is a daunting task. It requires revamping the whole organization processes and systems to meet the never ending demands of today’s digital world. We all face problems with this kind of transformation, and since many of those problems are technical, we would love to help!

We are a group of professionals working with the latest technologies in the Business Intelligence, Business Applications, Cloud Computing, Data Science, Machine Learning and Software Development fields. Our goal is to help the community by providing solutions to problems we face in our day to day jobs that we think other people may face too.