Monday, January 8, 2018

Azure Machine Learning Workbench: Getting Started

Today, we're going to take a look at one of the newest Data Science offerings from Microsoft.  Of course, we're talking about the Azure Machine Learning (AML) Workbench!  Join us as we dive in and see what this new tool is all about.

Before we install the AML Workbench, let's talk about what it is.  The AML Workbench is a local environment for developing data science solutions that can be easily deployed and managed using Microsoft Azure.  It doesn't appear to be related to AML Studio in any way.  Throughout this series, we'll walk through all of the different things we can do with the AML Workbench.  For today, we're just going to get our feet wet.

Now, we need to create an Azure Machine Learning Experimentation resource in the Azure portal.  You can find complete instructions here.  We will also include a Workspace and a Model Management Account.  This appears to be free for the first two users.  However, we're not sure whether they charge separately for the storage account.  Maybe someone can let us know in the comments.  Now, let's boot this baby up!
Azure Machine Learning Workbench
New Project
In the top-left corner, we can see the Workspace we created in the Azure portal.  Let's add a new Project to this.
Create New Project
Now, we have to add the details for our new project.  Strangely, the project name can't include spaces.  We felt like we were past the point where names had to be simple, but maybe it's a Git thing.  Either way, we'll call our new project "Classifying_Iris" and use the "Classifying Iris" template at the bottom of the screen.  Let's see what's inside this project.
Project Dashboard
The first thing we see is the Project Dashboard.  This is a great place to create (or read) quality documentation on exactly what the project does, links to external resources, etc.
iris_sklearn
Following the QuickStart instructions, we were able to run the "iris_sklearn.py" code.  Unfortunately, it's not immediately obvious what this does.  Fortunately, the Exploring Results section tells us to check the Run History.  We can find this icon on the left side of the screen.
Run History
iris_sklearn Run History
This is pretty cool stuff actually.  This view would let us know how long our code is taking to run, as well as what parameters are being input.  This would be extremely helpful if we were running repeated experiments.  In our case, it doesn't show much though.
Job History
If we click on the Job Name in the Jobs section on the right side of the screen, we can see a more detailed result set.
Run Properties
This is what we were looking for!  This gives us all kinds of information about the run.  This could be extremely useful for showing the results of an experiment to bosses or colleagues.
Logs
Further down the page, we see the Logs section.  This is where we can access all the granular information we would need if we needed to debug a particular issue.

The next section of the instructions is the Quick CLI Reference.  This gives us a bunch of code we can use to run these scripts from the Command Line (or Powershell).  Let's open a new command line window.
Open Command Prompt
In the top-left corner of the window, we can select "Open Command Prompt" from the "File" menu.
Command Prompt
In the command prompt, we can copy the first line of code from the instructions.

pip install matplotlib
This code will install the Python library "matplotlib".  This library contains quite a few functions for creating graphs in Python.  You can read more about it here.  Now that we have the library installed, let's copy the next line of code.

az login
This code will help us log the Command Line Interface into Azure.  When we run this command, we get the following response.
To sign in, use a web browser to open the page https://aka.ms/devicelogin and enter the code ######### to authenticate.
When we follow the instructions, we can log into our Azure subscription.
Azure Login
The next piece of code we need to run is as follows.

python run.py
This piece of code will run the "run.py" script from our project.  We'll look at this script in a later post.  For now, let's see the output from this script.  Please note that the "run.py" script is iterative and creates a large amount of output.  You can skip to the OUTPUT END header if you don't want to see the output.

OUTPUT BEGIN

RunId: Classifying_Iris_1509457170414

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 10.0
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.6415094339622641

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 0 31 19]
 [ 0  4 46]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457170414

RunId: Classifying_Iris_1509457188739

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 5.0
LogisticRegression(C=0.2, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.6415094339622641

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 0 32 18]
 [ 0  4 46]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457188739

RunId: Classifying_Iris_1509457195895

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 2.5
LogisticRegression(C=0.4, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.660377358490566

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 0 33 17]
 [ 0  4 46]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457195895

RunId: Classifying_Iris_1509457203051

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 1.25
LogisticRegression(C=0.8, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.6415094339622641

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 1 33 16]
 [ 0  5 45]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457203051

RunId: Classifying_Iris_1509457210237

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 0.625
LogisticRegression(C=1.6, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.660377358490566

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 1 36 13]
 [ 0  5 45]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457210237

RunId: Classifying_Iris_1509457217482

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 0.3125
LogisticRegression(C=3.2, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.660377358490566

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 1 36 13]
 [ 0  4 46]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457217482

RunId: Classifying_Iris_1509457225704

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 0.15625
LogisticRegression(C=6.4, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.6792452830188679

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 1 36 13]
 [ 0  3 47]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457225704

RunId: Classifying_Iris_1509457234132

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 0.078125
LogisticRegression(C=12.8, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.6792452830188679

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 1 36 13]
 [ 0  3 47]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457234132

RunId: Classifying_Iris_1509457242301

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 0.0390625
LogisticRegression(C=25.6, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.6981132075471698

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 1 37 12]
 [ 0  3 47]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457242301

RunId: Classifying_Iris_1509457249742

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 0.01953125
LogisticRegression(C=51.2, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.6981132075471698

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 1 37 12]
 [ 0  3 47]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================
RunId: Classifying_Iris_1509457249742

RunId: Classifying_Iris_1509457257076

Executing user inputs .....
===========================

Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]

Iris dataset shape: (150, 5)
Regularization rate is 0.009765625
LogisticRegression(C=102.4, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy is 0.6792452830188679

==========================================
Serialize and deserialize using the outputs folder.

Export the model to model.pkl
Import the model from model.pkl
New sample: [[3.0, 3.6, 1.3, 0.25]]
Predicted class is ['Iris-setosa']
Plotting confusion matrix...
Confusion matrix in text:
[[50  0  0]
 [ 1 37 12]
 [ 0  4 46]]
Confusion matrix plotted.
Plotting ROC curve....
ROC curve plotted.
Confusion matrix and ROC curve plotted. See them in Run History details page.

Execution Details
=================

RunId: Classifying_Iris_1509457257076

OUTPUT END

Like we said before, we'll dig more into this code in a later post.  For now, let's take a look at the run history again.

Run History 2
Now, we can see all of the runs that just took place.  This is a really easy way to get a visual of what our code was accomplishing.

This seems like a good place to stop for today.  At first glance, the AML Workbench is much more developer-oriented than its Studio counterpart.  There's a ton of information here, but it's going to take some more time for us to get comfortable here.  Stay tuned for the next post where we'll dig into the rest of the pre-built code focusing on executing our code in different environments.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Science Consultant
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, December 18, 2017

Azure Machine Learning in Practice: Productionalization

Today, we're going to finish up our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation, Data Cleansing, Model Selection, Model Evaluation, Threshold Selection, Feature Selection and Feature Engineering phases of the experiment.  In this post, we're going to walk through the Productionalization process.

Productionalization is the process of taking the work we've done so far and making it accessible to the end user.  This is by far the most important process.  If we are unable to connect the end user to the model, then everything up until now was for nothing.  Fortunately, this is where Azure Machine Learning really differentiates itself from the rest of the data science tools on the market.  First, let's create a simple experiment that takes our testing data and scores that data using our trained model.  Remember that we investigated the use of some basic engineered features, but found that they didn't add value.
Productionalization
Now, let's take a minute to talk about web services.  A web service is a simple resource that sits on the Internet.  A user or application can send a set of data to this web service and receive a set of data in return, assuming they have the permissions to do so.  In our case, Azure Machine Learning makes it incredibly simple to create a deploy our experiement as an Azure Web Service.
Set Up Web Service
On the bar at the bottom of the Azure Machine Learning Studio, there's a button for "Set Up Web Service".  If we click it, we get a neat animation and a few changes to our experiment.
Predictive Experiment
We can see that we now have two new modules, "Web Service Input" and "Web Service Output".  When the user or application hits the web service, these are what they interact with.  The user or application passes a data set to the web service as a JSON payload.  Then, that payload flows into our Predictive Experiment and is scored using our model.  Finally, that scored data set is passed back to the user or application as a JSON payload.  The simplicity and flexibility of this type of model means that virtually any environment can easily integrate with Azure Machine Learning experiments.  However, we need to deploy it first.
Deploy Web Service
Just like with creating the web service, deployment is as easy as clicking a button on the bottom bar.  Unless you have a reason, it's good practice to deploy a new web service, as opposed to a classic one.
Web Service Deployment
Now, all we have to do is link it to a web service plan and we're off!  You can find out more about web service plans and their pricing here.  Basically, you can pay-as-you-go or you can buy a bundle at a discount and pay for any overges.  Now, let's take a look at a brand new portal, the Azure Machine Learning Web Services Portal.
Azure Machine Learning Web Services Portal
This is where we can manage and monitor all of of our Azure Machine Learning Web Services.  We'll gloss over this for now, as it's not the subject of this post.  However, we may venture back in a later post.  Let's move over to the "Consume" tab.
Azure Machine Learning Web Service Consumption Information
On this tab, we can find the keys and URIs for our new web services.  However, there's something far more powerful lurking further down on the page.
Sample Web Service Code
Azure Machine Learning provides sample code for calling the web service using four languages, C#, Python, Python 3+ and R.  This is amazing for us because we're not developers.  We couldn't code our way out of a box.  But, Azure Machine Learning makes it so easy that we don't have to.

Hopefully, this post sparked your imagination for all the ways that you could utilize Azure Machine Learning in your organization.  Azure Machine Learning is one of the best data science tools on the market because it drastically slashes the amount of time it takes to build, evaluate and productionalize your machine learning algorithms.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Science Consultant
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, November 27, 2017

Azure Machine Learning in Practice: Feature Engineering

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation, Data Cleansing, Model Selection, Model Evaluation, Threshold Selection and Feature Selection phases of the experiment.  In this post, we're going to walk through the feature engineering process.

Feature Engineering is the process of adding new features or transforming existing features in the input dataset.  The goal of Feature Engineering is to create features that will greatly strengthen the model in terms of performance or accuracy.  This is a huge area within Machine Learning.  We can't possibly do it justice in just one post.  However, we can talk about a few different ways we would approach this problem using our dataset.

Just like with Feature Selection, traditional machine learning would usually start this as one of the first steps of any machine learning process.  However, Azure Machine Learning makes it so easy to throw the raw data at a large pool of models to see what happens.  We oftentimes find that we can create good models without much feature transformation.  This leaves us in the situation where we can decide whether the additional performance or accuracy is worth the time required to engineer more features.

In this case, let's assume that 93% precision and 90% recall is not high enough.  This means that we may gain some additional accuracy from feature engineering.  Let's take a look at the dataset again for a refresher.
Credit Card Fraud Data 1

Credit Card Fraud Data 2
The following description is lifted from the first post in this series.

We can see that this data set has the following columns: "Row Number", "Time", "V1"-"V28", "Amount" and "Class".  The "Row Number" column is simply used as a row identifier and should not be included in any of the models or analysis.  The "Time" column represents the number of seconds between the current transaction and the first transaction in the dataset.  This information could be very useful because transactions that occur very rapidly or at constant increments could be an indicator of fraud.  The "Amount" column is the value of the transaction.  The "Class" column is our fraud indicator.  If a transaction was fraudulent, this column would have a value of 1.

Finally, let's talk about the "V1"-"V28" columns.  These columns represent all of the other data we have about these customers and transactions combined into 28 numeric columns.  Obviously, there were far more than 28 original columns.  However, in order to reduce the number of columns, the creator of the data set used a technique known as Principal Component Analysis (PCA).  This is a well-known mathematical technique for creating a small number of very dense columns using a large number of sparse columns.  Fortunately for the creators of this data set, it also has the advantage of anonymizing any data you use it on.  While we won't dig into PCA in this post, there is an Azure Machine Learning module called Principal Component Analysis that will perform this technique for you.  We may cover this module in a later post.  Until then, you can read more about it here.

Now, let's talk about three different types of engineered features.  The first type are called "Discretized" features.  Discretization, also known as Bucketing and Binning, is the process of taking a continuous feature (generally a numeric value) and turning it into a categorical value by applying thresholds.  Let's take a look at the "Amount" feature in our data set.
Amount Statistics
Amount Histogram
Obviously, we aren't going to get any information out of this histogram without some serious zooming.  We did spend some time on this, but there's really not much to see there.  Therefore, we can just use the information from the Statistics.

We see that the mean is four times larger than the median.  This indicates that this feature is heavily right-skewed.  This is exactly what we're seeing in the histogram.  Most of the records belong to a proportionally small set of values.  This is an extremely common trend when looking at values and dollar amounts.  So, how do we choose the thresholds we want to use?  First and foremost, we should use domain knowledge.  There is no way to replace a good set of domain knowledge.  We firmly believe that the best feature engineers are the ones who know a data set and a business problem very well.  Unfortunately, we're not Credit Card Fraud experts.  However, we do have some other tools in our toolbox.

One technique for discretizing a heavily skewed feature is to create buckets by using powers.  For instance, we can create buckets for "<$1", "$1-$10", "$10-$100" and ">$100".  In general, we like to handle these using the "Apply SQL Transformation" module.  You could easily do the same using R or Python.  Here's the code we used and the resulting column.

SELECT
    *
    ,CASE
        WHEN [Amount] < 1 THEN "1 - Less Than $1"
        WHEN [Amount] < 10 THEN "2 - Between $1 and $10"
        WHEN [Amount] < 100 THEN "3 - Between $10 and $100"
        ELSE "4 - Greater Than $100"
    END AS [Amount (10s)]
FROM t1
Amount (10s) Histogram
Using this technique, we were able to take an extremely skewed numeric feature and turn it into an interepretable discretized feature.  Did this help our model?  We won't know that until we build a new model.  In general, the goal is to create as many new features as possible, seeing which ones are truly valuable at the end.  In fact, there are other techniques out there for choosing optimal thresholds than can provide tremendous business value.  We encourage you to investigate this.  In this experiment, we'll create new features for Amount (2s) and Amount (5s), then move on.

SELECT
    *
    ,CASE
        WHEN [Amount] < 1 THEN '1 - Less Than $1'
        WHEN [Amount] < 2 THEN '2 - Between $1 and $2'
        WHEN [Amount] < 4 THEN '3 - Between $2 and $4'
        WHEN [Amount] < 8 THEN '4 - Between $4 and $8'
        WHEN [Amount] < 16 THEN '5 - Between $8 and $16'
        WHEN [Amount] < 32 THEN '6 - Between $16 and $32'
        WHEN [Amount] < 64 THEN '7 - Between $32 and $64'
        WHEN [Amount] < 128 THEN '8 - Between $64 and $128'
        ELSE '9 - Greater Than $128'
    END AS [Amount (2s)]
    ,CASE
        WHEN [Amount] < 1 THEN '1 - Less Than $1'
        WHEN [Amount] < 5 THEN '2 - Between $1 and $5'
        WHEN [Amount] < 25 THEN '3 - Between $5 and $25'
        WHEN [Amount] < 125 THEN '4 - Between $25 and $125'
        ELSE '5 - Greater Than $125'
    END AS [Amount (5s)]
    ,CASE
        WHEN [Amount] < 1 THEN '1 - Less Than $1'
        WHEN [Amount] < 10 THEN '2 - Between $1 and $10'
        WHEN [Amount] < 100 THEN '3 - Between $10 and $100'
        ELSE '4 - Greater Than $100'
    END AS [Amount (10s)]
FROM t1
Amount (2s) Histogram
Amount (5s) Histogram
The next type of engineered features are called "Transformation" features.  Transformation, at least in our minds, is the process of taking an existing feature and applying some type of basic function to it.  In the previous example, we tried to alleviate the skew of the "Amount" feature by Discretizing it.  However, there are other options.  For instance, what if we were to create a new feature "Amount (Log)" by taking the logarithm of the existing feature?

SELECT
    *
    ,LOG( [Amount] + .01 ) AS [Amount (Log)]
FROM t1
Amount (Log) Histogram
We can see that this feature now strongly resembles a bell curve.  Honestly, this makes us question whether this data set is actually real or if it's faked.  In the real world, we don't often find features that look this clean.  Alas, that's not the purpose of this post.

There are plenty of other functions we can use for Transformation features.  For instance, if a feature has a large amount of negative values, and we don't care about the sign, it could help to apply a square or absolute value function.  If a feature has a "wavy" distribution, it could help to apply a sine or cosine function.  There are a lot of options here that we won't explore in-depth in this post.  For now, we'll just keep the Amount (Log) feature and move on.

The final type of engineered features are "Interaction" features.  In some cases, it can be helpful to see what effect the combination of multiple fields has.  In the binary world, this is really simple.  We can create an "AND" feature by taking [Feature1] * [Feature2].  This is what people generally mean when they say "Interactions".  Less commonly, we also see "OR" features by taking MAX( [Feature1], [Feature2] ).  There's nothing stopping us from taking these binary concepts and applying them to continuous values as well.  For instance, let's create two-way interaction variables for ALL of the "V" features in our dataset.  Since there are 28 "V" features in our dataset, there are 28 * 27 / 2 possibilities.  This is way too many to manually write using SQL.  This is where R and Python come in handy.  Let's write a simple R script to loop through all of the features, creating the interactions.

#####################
## Import Data
#####################

dat.in <- maml.mapInputPort(1)
temp <- dat.in[,1]
dat.int <- data.frame(temp)

################################################
## Loop through all possible combinations
################################################

for(i in 1:27){
    for(j in (i+1):28){
     
#########################################
## Choose the appropriate columns
#########################################

        col.1 <- paste("V", i, sep = "")
        col.2 <- paste("V", j, sep = "")
     
        val.1 <- dat.in[,col.1]
        val.2 <- dat.in[,col.2]

#######################################
## Create the interaction column
#######################################

        dat.int[,length(dat.int) + 1] <- val.1 * val.2
        names(dat.int)[length(dat.int)] <- paste("V", i, "*V", j, sep="")
    }
}

###################
## Outputs Data
###################

dat.out <- data.frame(dat.in, dat.int[,-1])
maml.mapOutputPort("dat.out");

Interaction Features
We can see that we now have 410 columns, with all of the interaction columns coming at the end of the data set.  Now, we need to determine whether the features we added were helpful.  How do we do that?  We simply run everything through our Tune Model Hyperparameters experiment again.  Good thing we saved that.  An important note here is that Neural Networks, Support Vector Machines and Locally-Deep Support Vector Machines are all very sensitive to the size of the data.  Therefore, if we are doing large-scale feature engineering like this, we may need to exclude those models, reduce the number of random sweeps or use the faster models to perform feature selection BEFORE we create our final model.

After running the Tune Model Hyperparameters experiment on the new dataset, we've determined that the engineered features did not help our model.  This is not entirely surprising to us as the features we added were mostly for educational purposes.  When engineering features, we should generally take into account existing business knowledge about which types of features seem to be important.  Unfortunately, we have no business knowledge of this dataset.  So we ended up blindly engineering features to see if we could find something interesting.  In this case, we did not.

Hopefully, this post opened your minds to the world of feature engineering.  Did we cover all of the types of engineered features in this post?  Not by a long shot.  There are plenty of other types, such as Summarization, Temporal and Statistical features.  We encourage you to look into these on your own.  We may even have another post on them in the future.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Science Consultant
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, November 6, 2017

Azure Machine Learning in Practice: Feature Selection

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation, Data Cleansing, Model Selection, Model Evaluation and Threshold Selection phases of the experiment.  In this post, we're going to walk through the feature selection process.

In the traditional data science space, feature selection is generally one of the first phases of a modelling process.  A large reason for this is that, historically, building models using a hundred features would take a long time.  Also, an individual would have to sort through all of the features after modeling to determine what impact the features were having.  Sometimes, they would even find that some features would make the model less accurate by including them.  There's also the concept of parsimony to consider.  Basically, less variables was generally considered better.

However, technology and modelling techniques have come a long way over the last few decades.  We would be doing a great disservice to modern Machine Learning to say that it resembles traditional statistics in a major way.  Therefore, we try to approach feature selection from a more practical perspective.

First, we found that we were able to train over 1000 models in about an hour and a half.  Therefore, removing features for performance reasons is not necessary.  However, in other cases, it may be.  In those cases, paring down features initially could be beneficial.

Now, we need to determine which variables are having no (or even negative) impact on the resulting model.  If they aren't helping, then we should remove them.  To do this, we can use a technique known as Permutation Feature Importance.  Azure Machine Learning even has a built-in module for this.  Let's take a look.
Feature Selection Experiment
Permutation Feature Importance
This module requires us to input our trained model, as well as a testing dataset.  We also have to decide which metric we would like to use.  With that, it will output a dataset showing us the impact that each feature has on the model.

So, how does Permutation Feature Importance work?  Honestly, it's one of more clever algorithms we've come across.  The module chooses one feature at a time, randomly shuffles the values for that feature across the different rows, then retrains the model.  Then, it can evaluate the impact of that feature by seeing how much the trained model changed when the values were shuffled.  A very important feature would obviously cause large changes in the model if they were shuffled.  A less important feature would have less impact.  In our case, we want to measure impact by using Precision and Recall.  Unfortunately, the module only gives us the option to use one at a time.  Therefore, we'll have to be more creative.  Let's start by looking at the output of the Precision module.
Feature Importance (Precision) 1
Feature Importance (Precision) 2
We can see that there a few features that are important.  The rest of the features have no impact on the model.  Let's look at the output from the Recall module.
Feature Importance (Recall) 1
Feature Importance (Recall) 2
Now, let's compare the results of the two modules.
Feature Importance
We can see that the two modules have almost the same output, except for V4, which is only important for Precision.  This means that we should be able to remove all of the other features without affecting the model.  Let's try it and see what happens.
Feature Reduction Experiment
Tune Model Hyperparameters Results
R Script Results
We can see that removing those features from the model did reduce the Precision and Recall.  There was not a dramatic reduction in these values, but there was a reduction nonetheless.  This is likely caused by rounding error.  What was originally a very small decimal value assigned to the importance of each feature was rounded to 0, causing us to think that there was no importance.  Therefore, we are at a decision point.  Do we remove the features knowing that they slightly hurt the model?  Since we're having no performance issues and model understandability is not a factor, we would say no.  It's better to keep the original model than it is to make a slimmer version.

It is important to note that in practice, we've never seen the "Permutation Feature Importance" model throw out the majority of the features.  Usually, there are a few features that have a negative impact.  As we slowly remove them one at a time, we eventually find that most of the features have a positive impact on the model.  While we won't get into the math behind the scenes, we will say that we highly suspect this unusual case was caused by the fact that we are only given a subset of the Principal Components created using Principal Components Analysis.

Hopefully, this post enlightened you to some of the thought process behind Feature Selection in Azure Machine Learning.  Permutation Feature Importance is a fast, simple way to improve the accuracy and performance of your model.  Stay tuned for the next post where we'll be talking about Feature Engineering.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, October 16, 2017

Azure Machine Learning in Practice: Threshold Selection

Today, we're going to continue with our Fraud Detection experiment.  If you haven't read our previous posts in this series, it's recommended that you do so.  They cover the Preparation, Data Cleansing, Model Selection and Model Evaluation phases of the experiment.  In this post, we're going to walk through the threshold selection process.

So far in this experiment, we've taken the standard Azure Machine Learning evaluation metrics without much thought.  However, an important thing to note is that all of these evaluation metrics assume that the prediction should be positive when the predicted is probability is greater than .5 (or 50%), and negative otherwise.  This doesn't have to be the case.

In order to optimize the threshold for our data, we need a data set, a model and a module that optimizes the threshold for the data set and model.  We already have a data set and a model, as we've spent the last few post building those.  What we're missing is a module to optimize the threshold.  For this, we're going to use an Execute R Script.

Execute R Script
Execute R Script Properties
We talked about R scripts in one of our very first Azure Machine Learning posts.  This is one of the ways in which Azure Machine Learning allows us to expand its functionality.  Since Azure Machine Learning doesn't have the threshold selection capabilities we're looking for, we'll build them ourselves.  Take a look at this R Script.

CODE BEGIN

dat <- maml.mapInputPort(1)

###############################################################
## Actual Values must be 0 for negative, 1 for positive.
## String Values are not allowed.
##
## You must supply the names of the Actual Value and Predicted
## Probability columns in the name.act and name.pred variables.
##
## In order to hone in on an optimal threshold, alter the
## values for min.out and max.out.
##
## If the script takes longer than a few minutes to run and the
## results are blank, reduce the num.thresh value.
###############################################################

name.act <- "Class"
name.pred <- "Scored Probabilities"
name.value <- c("Scored Probabilities")
num.thresh <- 1000
thresh.override <- c()
num.out <- 20
min.out <- -Inf
max.out <- Inf
num.obs <- length(dat[,1])
cost.tp.base <- 0
cost.tn.base <- 0
cost.fp.base <- 0
cost.fn.base <- 0

#############################################
## Choose an Optimize By option.  Options are
## "totalcost", "precision", "recall" and
## "precisionxrecall".
#############################################

opt.by <- "precisionxrecall"

act <- dat[,name.act]
act[is.na(act)] <- 0
pred <- dat[,name.pred]
pred[is.na(pred)] <- 0
value <- -dat[,name.value]
value[is.na(value)] <- 0

#########################
## Thresholds are Defined
#########################

if( length(thresh.override) > 0 ){
thresh <- thresh.override
num.thresh <- length(thresh)
}else if( num.obs <= num.thresh ){
thresh <- sort(pred)
num.thresh <- length(thresh)
}else{
thresh <- sort(pred)[floor(1:num.thresh * num.obs / num.thresh)]
}

#######################################
## Precision/Recall Curve is Calculated
#######################################

prec <- c()
rec <- c()
true.pos <- c()
true.neg <- c()
false.pos <- c()
false.neg <- c()
act.true <- sum(act)
cost.tp <- c()
cost.tn <- c()
cost.fp <- c()
cost.fn <- c()
cost <- c()

for(i in 1:num.thresh){
thresh.temp <- thresh[i]
pred.temp <- as.numeric(pred >= thresh.temp)
true.pos.temp <- act * pred.temp
true.pos[i] <- sum(true.pos.temp)
true.neg.temp <- (1-act) * (1-pred.temp)
true.neg[i] <- sum(true.neg.temp)
false.pos.temp <- (1-act) * pred.temp
false.pos[i] <- sum(false.pos.temp)
false.neg.temp <- act * (1-pred.temp)
false.neg[i] <- sum(false.neg.temp)
pred.true <- sum(pred.temp)
prec[i] <- true.pos[i] / pred.true
rec[i] <- true.pos[i] / act.true
cost.tp[i] <- cost.tp.base * true.pos[i]
cost.tn[i] <- cost.tn.base * true.neg[i]
cost.fp[i] <- cost.fp.base * false.pos[i]
cost.fn[i] <- cost.fn.base * false.neg[i]
}

cost <- cost.tp + cost.tn + cost.fp + cost.fn
prec.ord <- prec[order(rec)]
rec.ord <- rec[order(rec)]

plot(rec.ord, prec.ord, type = "l", main = "Precision/Recall Curve", xlab = "Recall", ylab = "Precision")

######################################################
## Area Under the Precision/Recall Curve is Calculated
######################################################

auc <- c()

for(i in 1:(num.thresh - 1)){
                auc[i] <- prec.ord[i] * ( rec.ord[i + 1] - rec.ord[i] )
}

#################
## Data is Output
#################

thresh.out <- 1:num.thresh * as.numeric(thresh >= min.out) * as.numeric(thresh <= max.out)
num.thresh.out <- length(thresh.out[thresh.out > 0])
min.thresh.out <- min(thresh.out[thresh.out > 0])

if( opt.by == "totalcost" ){
opt.val <- cost
}else if( opt.by == "precision" ){
opt.val <- prec
}else if( opt.by == "recall" ){
opt.val <- rec
}else if( opt.by == "precisionxrecall" ){
opt.val <- prec * rec
}

ind.opt <- order(opt.val, decreasing = TRUE)[1]

ind.out <- min.thresh.out + floor(1:num.out * num.thresh.out / num.out) - 1
out <- data.frame(rev(thresh[ind.out]), rev(true.pos[ind.out]), rev(true.neg[ind.out]), rev(false.pos[ind.out]), rev(false.neg[ind.out]), rev(prec[ind.out]), rev(rec[ind.out]), rev(c(0, auc)[ind.out]), rev(cost.tp[ind.out]), rev(cost.tn[ind.out]), rev(cost.fp[ind.out]), rev(cost.fn[ind.out]), rev(cost[ind.out]), rev(c(0,cumsum(auc))[ind.out]), thresh[ind.opt], prec[ind.opt], rec[ind.opt], cost.tp[ind.opt], cost.tn[ind.opt], cost.fp[ind.opt], cost.fn[ind.opt], cost[ind.opt])
names(out) <- c("Threshold", "True Positives", "True Negatives", "False Positives", "False Negatives", "Precision", "Recall", "Area Under P/R Curve", "True Positive Cost", "True Negative Cost", "False Positive Cost", "False Negative Cost", "Total Cost", "Cumulative Area Under P/R Curve", "Optimal Threshold", "Optimal Precision", "Optimal Recall", "Optimal True Positive Cost", "Optimal True Negative Cost", "Optimal False Positive Cost", "Optimal False Negative Cost", "Optimal Cost")

maml.mapOutputPort("out");

CODE END

We created this piece of code to help us examine the area under the Precision/Recall curve (AUC in Azure Machine Learning Studio refers to the area under the ROC curve) and to determine the optimal threshold for our data set.  It even allows us to input a custom cost function to determine how much money would have been saved and/or generated using the model.  Let's take a look at the results.
Experiment
Execute R Script Outputs
We can see that there are two different outputs from the "Execute R Script" module.  The first output, "Result Dataset", is where the data that we are trying to output is sent.  The second output, "R Device", is where any console displays or graphics are set.  Let's take a look at our results.
Execute R Script Results 1
Execute R Script Results 2
Execute R Script Graphics
We can see that this script outputs a subset of the thresholds evaluated, as well as the statistics using that threshold.  Also, it outputs the optimal threshold evaluated using a particular parameter that we will look at shortly.  Finally, it draws a graph of the Precision/Recall Curve.  For technical reasons, this graph is an approximation of the curve, not a complete representation.  A more accurate version of this curve can be seen in the "Evaluate Model" module.  We can choose our optimization value using the following set of code.

#############################################
## Choose an Optimize By option.  Options are
## "totalcost", "precision", "recall" and
## "precisionxrecall".
#############################################

opt.by <- "precisionxrecall"

In our case, we chose to optimize by using Precision * Recall.  Looking back at the results from our "Execute R Script", we see that our thresholds jump all the way from .000591 to .983291.  This is because the "Scored Probabilities" output from the "Score Model" module are very heavily skewed towards zero.  In turn, this skew is caused by the fact that our "Class" variable is heavily imbalanced.
Scored Probabilities Statistics
Scored Probabilities Histogram
Because of the way the R code is built, it determined that the optimal threshold of .001316 has a Precision of 82.9% and a Recall of 90.6%.  These values are worse than those originally reported by the "Tune Model Hyperparameters" module.  So, we can override the thresholds in our R code using the following code at the top of the batch:

thresh.override <- (10:90)/100

This will tell the R script to forcibly use thresholds from .10 to .90.  Let's check out the results.
Overridden Threshold Results 1
Overridden Threshold Results 2
We can see that by moving our threshold down to .42, we can tweak out slightly more value from our model.  However, this is a such a small amount of value that it's not worth any amount of effort to do it in this case.

So, when would this be useful?  As with everything, it all comes down to dollars.  We can talk to stakeholders and clients all day about how much Accuracy, Precision and Recall our models have.  In general, they aren't interested in that type of information.  However, if we can input an estimate of their cost function into this script, then we can tie real dollars to the model.  We were able to use this script to show a client that they could save $200k per year in lost product using their model.  That had far more impact than a Precision value ever would.

Hopefully, this post sparked your interest in tuning your Azure Machine Learning models to maximize their effectiveness.  We also want to emphasize that you can use R and Python to greatly extend the usefulness of Azure Machine Learning.  Stay tuned for the next post where we'll be talking about Feature Selection.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com