Monday, September 19, 2016

Azure Machine Learning: HTTP Data, R Scripts, and Summarize Data

Today, we're going to take a look at Sample 1: Download dataset from UCI: Adult 2 class dataset from Azure ML.  Since we're all new to Azure ML, this is a great way to learn some of the neat functionality.  Let's walk through and learn some stuff!
Sample 1 Workflow
We can see that this workflow (is that what it's called?) contains four items.  There are two Data Input items leading into an R script, which is then passed through to a Summarize Data item.  Let's start with the "Enter Data Manually" item on the left.
Enter Data Manually
We can see that the data is a bunch of column names being input in CSV format.  We're not sure how this will be used just yet, but it could definitely be helpful for naming the columns how we want them named.  We can also look at start, end and elapsed times.  This would be great for debugging slow-running items.  Let's take a look at the output log.

Record Starts at UTC 07/13/2016 15:48:39:

Run the job:"/dll "Microsoft.Analytics.Modules.EnterData.Dll, Version=, Culture=neutral, PublicKeyToken=69c3241e6f0468ca;Microsoft.Analytics.Modules.EnterData.Dll.EnterData;Run" /Output0 "..\..\dataset\dataset.dataset" /dataFormat "CSV" /data "empty" /hasHeader "True"  /ContextFile "..\..\_context\ContextFile.txt""
[Start] Program::Main
[Start]     DataLabModuleDescriptionParser::ParseModuleDescriptionString
[Stop]     DataLabModuleDescriptionParser::ParseModuleDescriptionString. Duration = 00:00:00.0047673
[Start]     DllModuleMethod::DllModuleMethod
[Stop]     DllModuleMethod::DllModuleMethod. Duration = 00:00:00.0000228
[Start]     DllModuleMethod::Execute
[Start]         DataLabModuleBinder::BindModuleMethod
[Verbose]             moduleMethodDescription Microsoft.Analytics.Modules.EnterData.Dll, Version=, Culture=neutral, PublicKeyToken=69c3241e6f0468ca;Microsoft.Analytics.Modules.EnterData.Dll.EnterData;Run
[Verbose]             assemblyFullName Microsoft.Analytics.Modules.EnterData.Dll, Version=, Culture=neutral, PublicKeyToken=69c3241e6f0468ca
[Start]             DataLabModuleBinder::LoadModuleAssembly
[Verbose]                 Loaded moduleAssembly Microsoft.Analytics.Modules.EnterData.Dll, Version=, Culture=neutral, PublicKeyToken=69c3241e6f0468ca
[Stop]             DataLabModuleBinder::LoadModuleAssembly. Duration = 00:00:00.0081428
[Verbose]             moduleTypeName Microsoft.Analytics.Modules.EnterData.Dll.EnterData
[Verbose]             moduleMethodName Run
[Information]             Module FriendlyName : Enter Data Manually
[Information]             Module Release Status : Release
[Stop]         DataLabModuleBinder::BindModuleMethod. Duration = 00:00:00.0111598
[Start]         ParameterArgumentBinder::InitializeParameterValues
[Verbose]             parameterInfos count = 3
[Verbose]             parameterInfos[0] name = dataFormat , type = Microsoft.Analytics.Modules.EnterData.Dll.EnterData+EnterDataDataFormat
[Verbose]             Converted string 'CSV' to enum of type Microsoft.Analytics.Modules.EnterData.Dll.EnterData+EnterDataDataFormat
[Verbose]             parameterInfos[1] name = data , type = System.IO.StreamReader
[Verbose]             parameterInfos[2] name = hasHeader , type = System.Boolean
[Verbose]             Converted string 'True' to value of type System.Boolean
[Stop]         ParameterArgumentBinder::InitializeParameterValues. Duration = 00:00:00.0258120
[Verbose]         Begin invoking method Run ... 
[Verbose]         End invoking method Run
[Start]         DataLabOutputManager::ManageModuleReturnValue
[Verbose]             moduleReturnType = System.Tuple`1[T1]
[Start]             DataLabOutputManager::ConvertTupleOutputToFiles
[Verbose]                 tupleType = System.Tuple`1[Microsoft.Numerics.Data.Local.DataTable]
[Verbose]                 outputName Output0
[Start]                 DataTableDatasetHandler::HandleOutput
[Start]                     SidecarFiles::CreateVisualizationFiles
[Information]                         Creating dataset.visualization with key visualization...
[Stop]                     SidecarFiles::CreateVisualizationFiles. Duration = 00:00:00.1242780
[Start]                     SidecarFiles::CreateDatatableSchemaFile
[Information]                         SidecarFiles::CreateDatatableSchemaFile creating "..\..\dataset\dataset.schema"
[Stop]                     SidecarFiles::CreateDatatableSchemaFile. Duration = 00:00:00.0121113
[Start]                     SidecarFiles::CreateMetadataFile
[Information]                         SidecarFiles::CreateMetadataFile creating "..\..\dataset\dataset.metadata"
[Stop]                     SidecarFiles::CreateMetadataFile. Duration = 00:00:00.0055093
[Stop]                 DataTableDatasetHandler::HandleOutput. Duration = 00:00:00.5321402
[Stop]             DataLabOutputManager::ConvertTupleOutputToFiles. Duration = 00:00:00.5639918
[Stop]         DataLabOutputManager::ManageModuleReturnValue. Duration = 00:00:00.5668404
[Verbose]         {"InputParameters":{"Generic":{"dataFormat":"CSV","hasHeader":true},"Unknown":["Key: data, ValueType : System.IO.StreamReader"]},"OutputParameters":[{"Rows":15,"Columns":1,"estimatedSize":0,"ColumnTypes":{"System.String":1},"IsComplete":true,"Statistics":{"0":[15,0]}}],"ModuleType":"Microsoft.Analytics.Modules.EnterData.Dll","ModuleVersion":" Version=","AdditionalModuleInfo":"Microsoft.Analytics.Modules.EnterData.Dll, Version=, Culture=neutral, PublicKeyToken=69c3241e6f0468ca;Microsoft.Analytics.Modules.EnterData.Dll.EnterData;Run","Errors":"","Warnings":[],"Duration":"00:00:00.8298274"}
[Stop]     DllModuleMethod::Execute. Duration = 00:00:00.8603897
[Stop] Program::Main. Duration = 00:00:01.0831653
Module finished after a runtime of 00:00:01.1406311 with exit code 0

Record Ends at UTC 07/13/2016 15:48:40.

Yikes!  This appears to be written in the language underpinning Azure ML.  There are some cool things to notice.  You can see that some tasks have durations.  This would be great for debugging.  Let's stay away from these outputs as they seem to be above our pay grade (for now!).

For those of you that have experience with ETL tools like SSIS or Alteryx, you'll recognize that it's a pain sometimes to have to store the output of every single item in case you have to debug. Well, Azure ML makes this really easy.  Many of the items have a Visualize option that you can access by right-clicking on the item after a successful run.
Enter Data Manually (Visualize)
Let's see what's underneath!
Enter Data Manually (Visualization)
This is definitely the coolest browse feature we've seen.  On the left side, we can see the raw data (row and columns).  In this case, we only have one column and fifteen columns.  However, above the column we can see a histogram of its values.  This is not very useful for a column of unique text values, but it would definitely be great for a refined data set.  On the right side, we can see some summary statistics, Unique Values, Missing Values, and Feature Type (Data Type).  We can also see a larger version of our histogram in this pane.  For this input, this isn't too enlightening, but we're already excited about what this will do when we have a real data set to throw at it.

Let's move on to the other input item, Import Data.
Import Data
We can see that this item is using a Web URL as its data source.  This is a really cool area that's becoming more popular along with Data Science.  We can also see that this data is being read in as a CSV with no header row.  That's why we needed to assign our own column names in the other input.  It even gives us the option of using cached results.  Why would you ever want to use cached results?  We're not sure, but maybe one of the readers could drop a comment explaining it to us.  In case you want to see the raw data, here's the link.  Let's move on to the visualization.
Import Data (Visualization)
Now this is what we've been waiting for!  We can look at all of the histograms to easily get a sense of how each column is distributed.  We can also click the "View As" option on the left side to change from histograms to box plots.
Import Data (Box Plots)
As you can see, this option only applies to numeric columns.  Looking at the Summary Statistics panel, we can see that we have a few more values for numeric columns than we did for the text column.  It gives us the standard "5-Number Summary" of Mean (Arithmetic Average), Median, Min, Max, and Standard Deviation.  The Missing Values field is also pretty interesting for checking data quality.  All in all, this is the best data visualization screen we've seen, and it comes built-in to all of these tools.  Now that we've got a dataset we like, we can look at another option called "Save as Dataset".
Import Data (Save As Dataset)
This would allow us to save this dataset so that we can use it later.  All we have to do is give it a name and it will show up in our "Saved Datasets" list.
Saved Datasets
Let's move on to the R script.
Execute R Script
This item gives us the option of creating an R Script to run, as well as defining a Random Seed and the version of R we would like to use.  The Random Seed is useful if you want to be able to replicate the results of a particular random experiment.  The R Version is great if you have a particular function or package that only works with certain versions.  This leads us to another question.  How do you manage R packages?  Is there a way to store R packages in your Azure Workspace so that you can use them in your scripts or do you need to have the script download the package every time?  Perhaps a reader could let us know.  Let's take a look at the script itself.

# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
dataset2 <- maml.mapInputPort(2) # class: data.frame

# Contents of optional Zip port are in ./src/
# source("src/yourfile.R");
# load("src/yourData.rdata");

# Sample operation
colnames(dataset2) <- c(dataset1['column_name'])$column_name;
data.set = dataset2;

# You'll see this output in the R Device port.
# It'll have your stdout, stderr and PNG graphics device(s).

# Select data.frame to be sent to the output Dataset port

Whoever wrote this code did a good job of commenting, which makes it really easy to see what this does.  The two input data sets (column names and data) are stored as dataset1 and dataset2.  These two data sets are combined into a single data frame with the column names as the headers and the data as the data.  In R terminology, a data frame is very similar to a table in Excel or SQL.  It has a table name, column names, as well as values within each column.  Also, the values in different columns can be of different data types, as long as values within a single column are of a single data type.  Finally, this script outputs the data frame.  So, if we use the Visualize feature, we should see an identical data set to what we from Import Data, albeit with proper column names attached.
Execute R Script (Visualization)
Indeed this is the case.  There is another item in Azure ML that can handle this type of procedure called "Edit Metadata".  Perhaps they used an R Script as a display of functionality.  Either way, let's look at a specific feature of the "Execute R Script" item called "R Device".  This is basically a way to look at the R log from within Azure.
Execute R Script (R Device)
Execute R Script (R Log)
While this looks pretty simple, it's actually amazing.  One of our biggest frustrations with using R from other tools is that they make it difficult to debug code.  This log would make that just as easy as using the R Console.

Before we move on to the final item, we'd like to point out the "Run Selected" you can see by right-clicking on any tool in the workspace option.  When we initially saw this, we thought it would allow us to run only this tool using a cached set of data.  This would be a gamechanger when you are dealing with lengthy data import times.  However, this option runs the selected item, as well as any necessary items preceding it.  This is still really cool as it allows you to run segments of your experiment, but not as groundbreaking as we initially thought.  Let's move on to the "Summarize Data" item.
Summarize Data
Unfortunately, this item does not have any customization options, it gives you every option, every time.  It effectively gives you the same values you would see if you looked at every column individually in the earlier Visualization windows.  It also gives you a few more values like 1st Quartile, 3rd Quartile, Mode and Mean Deviation.  We're not quite sure what Mean Deviation is.  There are a few statistical concepts like Mean Squared Error and the Standard Deviation of the Sample Mean, but we're not quite sure what this value is trying to represent.  Again, maybe a knowledgeable reader can enlighten us.  Regardless, this view is really interested for looking at things like Missing Values, it's immediately apparently which columns have an issue with Missing values.  You can also see which columns have unusual mins, maxes, or means.  At the end of the day, this is a useful visualization if you want a high-level view of your dataset.

Hopefully, this piqued your interest in Azure ML (it definitely did for us!).  Join us next time when we walk through the next Sample to see what cool stuff Azure ML has in store.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
BI Engineer
Valorem Consulting