Monday, April 2, 2018

Azure Machine Learning Workbench: Built-In Data Preparation, Part 2

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've looked at Getting Started, Utilizing Different Environments, Built-In Data Sources and started looking at Built-In Data Preparation.  In this post, we're going to continue to focus on the built-in data preparation options that the AML Workbench provides.  If you haven't read the previous post, it's recommended that you do so now as it provides context around how to navigate the Data Preparation tab.
Iris Data
So far, the columns in this Dataflow have been given names, records with no "Species" have been filtered out and we've created some Inspectors.  Let's see what other options are available.
Dataflows Menu
The first relevant menu we find in the top bar is the "Dataflows" menu.  This menu has options for "Create Dataflow", "Rename Dataflow" and "Remove Dataflow".  "Rename Dataflow" and "Remove Dataflow" aren't very exciting.  However, "Create Dataflow" is more interested than we initially thought.
Create Dataflow
This option allows us to create a new Dataflow using an existing Dataflow.  In fact, the existing Dataflow doesn't even need to be part of the same Data Preparation Package or Workspace.  We only need to be able to provide a DPREP file (this is what AML Workbench creates on the backend when we create a Data Preparation Package).  This means that we can have one set of steps stored in the first Dataflow, then use the result of those steps as the start of another Dataflow.  Why would we want this?  The major reason is that this allows us to create a "base" Dataflow that contains clean data, then use that Dataflow multiple times for different purposes.  We could even provide this DPREP file out to other people so they can see the same data we're seeing.  Please note that we are not recommending this as the preferred approach for sharing data.  We have databases, SQL and otherwise, for this task.

This raises an interesting question for us.  If we reference an external Dataflow using a DPREP file, then that Dataflow becomes an implicit component of our Dataflow  However, if we update that DPREP file, does our Dataflow update as well or is it a one-time upload?
Update Test
We tested this on our own.  When we reference an external Dataflow, a refreshable, but NOT LIVE, connection is created.  When we update the external Dataflow, we need to refresh our Dataflow to see the changes propagate.

Just for fun, let's create a new Dataflow using the existing "Iris" Dataflow.

The next menu is the "Transformations" menu.  We could easily write an entire blog series on just the contents of this menu.  Obviously, there are the basics you would expect from any tool, such as changing data types, joins/appends and summarizations.  We'll touch on a couple of the more interesting features and leave the rest up to you to explore.

The most common type of transformation is adding a new column.  In AML Workbench, we have the option of creating a new column with the "Add Column (Script)" option from the "Transformations" menu.
Add Column (Script)
This opens up a new window where we can name our column, decide where it will be located in the dataset and create an expression.  It also provides a hint at the bottom in case we aren't sure how to use the script functionality.
Add Column (Script) Window 1

Add Column (Script) Window 2
The scripting language we will use here is Python v3.5.  As a quick showcase of functionality, we created a column called "Random", which contains a random value between 0 and 1.  In order to do this, we need a Python library called "random".  Since this library is not included by default, we need to import it first.  This leads us to the following code:
import random
random.random() 
For those unfamiliar with Python, it is a case-sensitive language.  In this case, we are importing the "random" library, and calling the "random" function from within that library.

Another interesting thing to note about this window is that we don't have to use a custom expression.  The "Code Block Type" dropdown gives us the option of using a "module".  This would allow us to save a large, shareable block of code as a .py file.  Then we could use that module in the script by using the "import" command.  This is another victory for shareable code.
Code Block Type
Some of you may be thinking "Why did you create a useless column like Random?"  Turns out, it's not entirely useless for our purposes.  It allows us to show off our next piece of functionality, Sort.
Sort
The Sort option is also found within the "Transformations" menu.  If we select the "Random" column, then the "Sort" option, we can sort the dataset by the "Random" column.  This gives us a clean look at the the different values in the columns.  While this has no analytical value, it does give us a quick overview of what we are looking at.
Iris Data (Sorted)
We could have also accomplished this by right-clicking the "Random" column header, and selecting the "Sort" option.
Sort (Right-Click)
Moving on, another interesting option within the "Transformations" menu is the "Advanced Filter (Script)" option.
Advanced Filter (Script)
This option allows us to use Python code to filter our dataset using any filter logic that we could possibly write using Python (which is almost anything we would ever want to do).
Advanced Filter (Script) Window
In this case, we decided to randomly filter out half of our data by using the "Random" column.  We could just as easily filtered on one of the other columns.  We could even have created a brand new expression in this window and filtered on it.  The possibilities are almost endless.

The final script option we'll look at is "Transform Dataflow (Script)".  This can also be found in the "Transformations" menu.
Transform Dataflow (Script)
This option is basically the mother of all transformations.  This allows us to use Python to completely redesign our dataset in virtually any way we want.
Transform Dataflow (Script) Window
As a showcase of functionality, we used Python to create a new column, filter the dataset further and remove a column.  Here's the code we used:

df['Sepal Length (log)'] = np.log( df['Sepal Length'] )
df = df[df['Random'] < .75]
del df['Random']
As you can see, the Script transformations are incredibly powerful.  There's almost nothing they can't do.  We encourage you to look through the rest of the transformations on your own.

Alas, there is one final transformation we want to touch on, "Derive Column By Example".  This can also be found in the "Transformations" menu.  Instead of providing a strict formula, this option, as well as the other "By Example" transformations, allow us to provide examples for how the new field should work.  Then, AML Workbench will deduce what we're trying to accomplish.  This can be extremely beneficial when the equation is complex or we simply want to quickly showcase something.  Let's quickly turn the "Species" column into a numeric column by using this technique.
Derived Column
By simply providing the numeric labels for three rows in the dataset, AML Workbench was able to deduce exactly what we wanted.  This is some very cool functionality that really separates AML Workbench from most of the competition.

This post showcased just how much power we can get from the Built-in Data Preparation tools within AML Workbench.  Hopefully, this piqued your interest to go try it on your own.  Stay tuned for the next post where we'll continue with the Built-In Data Preparation by looking at Inspectors.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, March 12, 2018

Azure Machine Learning Workbench: Built-In Data Preparation, Part 1

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've look at Getting Started, Utilizing Different Environments and Built-In Data Sources.  In this post, we're going to focus on the built-in data preparation options that the AML Workbench provides.
Prepare Data
Let's start by navigating over to the "Data" pane and select the "Iris" Data Source.  From here, we have two options.  First, we could use an existing Data Preparation by selecting it from the "Data Preparations" list in the "Data" pane.  We also have the option of selecting the "Prepare" option at the top of "iris" tab.  This option will allow us to start from scratch.  Let's go with the second option for now.
Prepare
In the "Prepare" window, we need to choose with "Data Preparation Package" we want to use.  Basically, a Data Preparation Package is a grouping of transformations that are run as a unit.  This package will import all of the necessary data and run any number of transformations on any number of data sources, with only a single line of code.
Data Preparation
If we choose to use the existing iris.dprep package, we end up with a new tab that looks very similar to the Data Source tab, with a few additions.  On the left side of the tab, we have the "Dataflows" pane.  A Dataflow is a set of transformations performed on a single Data Source.  On the right side of the tab, we see the "Steps" pane.  A Step is a single transformation.  Therefore, we can see that multiple Steps are grouped into a single Dataflow and multiple Dataflows are grouped into a single Data Preparation Package.

Some of you may have noticed something strange about the table in the middle of the tab.  The columns suddenly have names, despite us never having supplied them.  We can even see these steps reflected in the "Steps" pane.  This is because we aren't looking at the Dataflow we just created.  Instead, we are looking at the existing Dataflow.  Our Dataflow is the second one on the list, with the identical name.
Empty Dataflow
If we select the second Dataflow in the "Dataflows" pane, we find our empty Dataflow.  In most cases, it's not very useful to have two different sets of transformations using the same data.  So, we'll throw away this Dataflow and use the one that's been provided to us.  However, it is important to note that it is possible to use the same Data Source multiple times within the same Data Preparation Package.
Remove Dataflow
Let's take a look at the existing Dataflow again.
Existing Dataflow
Steps Pane
The "Steps" pane shows us that three sets of transformation have been applied in this Dataflow.  First, the Dataflow was created.  Then, the column names were added since they did not exist in the original csv.  Finally, the Dataflow was filtered based on the "Species" column.  We can edit any of these steps by selecting the arrow next to them, then selecting "Edit".
Edit
Reference Dataflow
Rename Column
Filter Column
We can use these windows to see exactly what each step is doing and make any changes if we would like.
Step Options
Outside of editing the individual step, we also have the option of moving the step up or down in the Dataflow or deleting it entirely.  It's important to note that modifying steps further back in the Dataflow could potentially break steps that occur after it.  For instance, assume we are using column "A" in the calculation of column "B", then deleting column "A".  If we were to move the delete step before the calculation step, then the calculation of column "B" would break because column "A" no longer exists.
Halfway Complete
Another very important thing to notice about the "Steps" pane is that we can choose to look at the Dataflow after ANY number of steps.  For instance, if we select the "Rename Column3 to Petal Length" step, we can see the Dataflow as it looked after that step.  Notice that Column4 and Column5 have not been renamed yet.  Using this technique, we can add a new Step at this point, thereby adding it to the middle of the existing Dataflow.  This can be useful if we find that we missed a transformation along the line.
Inspectors
Let's finish by talking about "Inspectors".  Inspectors are nothing more than charts that give us a live view of our Dataflow.  We can see that we have a few different options to choose from.  More importantly, Inspectors are completely independent of Steps.  This means that we can create an Inspector to look at our data, then see how that particular data point changes with each step.  For instance, we have an Inspector that shows us the "Top 6 values of 'Species'".  If we move back to a step before the column "Species" existed, we see that this Inspector is no longer valid.  Obviously, this could be extremely helpful at examining the impact of certain filters or calculations.
No Species
There's way more to cover here than we have time for in this post.  Hopefully, this post opened your eyes to how easy it is to use the Built-In Data Preparation options in the Azure Machine Learning Workbench.  If you're eager to see more about Data Preparation in AML Workbench, read this.  Stay tuned for the next post where we'll be walking through some of the transformation options available in this extremely powerful tool.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, February 19, 2018

Azure Machine Learning Workbench: Built-In Data Sources

Today, we're going to continue our walkthrough of the "Classifying_Iris" template provided as part of the AML Workbench.  Previously, we've looked at Getting Started and Utilizing Different Environments.  In this post, we're going to focus on the built-in data source options that the AML Workbench provides.
Data
Let's start by opening the Data pane from the left side of the window.
Data Pane
Here, we have the option of selecting "Data Sources" or "Data Preparations".  "Data Sources" are what we use to bring our data in, as well perform some basic transformations to our data.  "Data Transformations" give us quite a few more options for transform and clean our data.  Additionally, we can have multiple "Data Transformations" using the same "Data Source".  This allows us to create separate datasets for testing different types of transformations or for creating different datasets entirely using the same base dataset.  Let's take a look at the "Data Source"
iris Data Source
In this tab, we can see a few different areas.  First, we can see the result of our data import in tabular format.  This gives us a quick glance at what the data looks like.  On the right side of the screen, we can see the steps that were taken to generate this data source.  For those familiar with the Query Editor in Power BI (formerly known as Power Query), this is a very similar interface.  We can alter any of the steps by clicking on the arrow beside them and selecting "Edit".  Let's do this for the first step, "Load iris.csv".
Edit Data Source
Edit Data Source Path
In this situation, the only option is to edit the location of the Data Source.  You can read more about supported data formats here.

Despite its spreadsheet feel and list of applied steps, the "Data Source" section has very few options.  In fact, the steps we see utilized are ALL of the steps available.  We cannot do any data transformation or manipulation in this tab.  However, we do have an interesting option at the top of the tab called "Metrics".
Metrics
iris Metrics
In this view, we can see a quick profile of the data (either a histogram or a bar chart based on the type of column), as well as a long list of metrics.  Here's a summary of the metrics provided.

Max Value: Largest Value in the Column

Min Value: Smallest Value in the Column

Count: Number of Records with Values in this column

Quantile at 50%: A measure of the "central" value in a dataset.  If the dataset was sorted, 50% of the values would be equal to or below this value.

Median: Same as Quantile at 50%

Kurtosis: Steepness of the distribution, i.e. a measure of the number of extreme observations it generates.

Quantile at 75%: If the dataset was sorted, 75% of the values would be equal to or below this value.

Number of Missing Values: Number of Records with No Value in this column

Column Data Type: The type of values that appears in the column.

Standard Deviation: Spread of the distribution, i.e. a measure of the distance between values in the column.

Variance: Spread of the distribution, i.e. a measure of the distance between values in the column., i.e. a measure of the distance between values in the column.  This is the square of the Standard Deviation.

Quantile at 25%: If the dataset was sorted, 25% of the values would be equal to or below this value.

Is Numeric Column: Whether the values in the columns are numbers and have the appropriate data type to match.

Number of NaNs: The number of records that contain values which are not numbers.  This does not consider the data type of the column and does not include missing values.

Mean Value: A measure of the "central" value in a dataset.  Calculated as the sum of all values in the column, divided by the number of values.  Commonly referred to as the "Average".

Unbiased Standard Error of the Mean: A measure of the stability of "Sample Mean" across samples.  If we assume our dataset is a sample of a larger distribution, then that distribution likely has a Mean.  However, since our sample is only part of the overall distribution, the mean of the sample will take different values based on which records are included in the sample.  Therefore, we can say that the sample mean has it's own distribution, known as a "Sampling Distribution".  This distribution likely has a standard deviation.  This is known as the "standard error of the sample mean".

Skewness: A measure of how NOT symmetric the distribution is, i.e. positive values signify the data has outliers larger than the mean, negative values signify the data has outliers smaller than the mean.

Most Common: The most common value in the column.  Commonly called the "Mode".  This only
applies to non-numeric columns.

Count of Most Common: The number of Records that contain the most common value.  This only applies to non-numeric columns.

Number of Unique Values: The number of distinct values within the column, i.e. every distinct value is counted only once, regardless of how many times it appears in the column.
Metrics Filter
At the top of Metrics view, we also have the ability to filter the metrics we see.  This is a great way to get a view of only the information we are interested in.

Prepare (from Data view)

Prepare (from Metrics view)
At the top of the Data Source tab, regardless of whether we are looking at the Data or Metrics view, there's another option called "Prepare".  This will take us to the Data Preparations tab, which functions very similarly to the Query Editor in Power BI.  We'll cover this in the next post.

Instead of going through the rest of the options available for editing in the Data Sources tab, let's create a new data source to see it from scratch.  We've made a copy of the "iris.csv" file on our local machine.
Add Data Source
At the top of the Data pane, we can click on the "+" button, then "Add Data Source".
Data Store
On the next screen, we have the option of choosing where our data comes from.  We have a few different options.  We can read files from our local machine or blob storage.  We can read Parquet files from our local machine.  Parquet is an open-source columnar file format common with big data solutions.  You can read more about it here.  We can also read from Excel files in local or blob storage.  Finally, we can read from a SQL Server Database as well.  For more information on Data Sources, read this.  In our case, we want to pull from a csv on our local machine.  Therefore, we want to use the File(s)/Directory option.
File Selection
In the next tab, we select the file(s) that we want to import.  In our case, this is the "iris - Copy.csv" file from our local machine.
File Details 1

File Details 2

File Details 3
The next tab, "File Details" has some awkward scrolling.  So, we had to make some interesting screenshot decisions to capture everything.  On this tab, AML Workbench has already looked at the file and guessed at most of these properties.  However, if we want to, we can change them.  We get to define our File Type.  The options are "Delimited File", "Fixed-Width File", "Plain Text File" and "JSON File".  In our case, we want to use a "Delimited File" with a Separator of "Comma".  We also get to decide whether we want to skip the first few lines.  In our case we do not want to do this.  We can also choose our File Encoding.  There are a number of options here.  In our case, we want to use utf-8.  We can also let the tool know whether our file(s) have headers, i.e. column names.  In this case, we do not.  The last option is whether we want to handle Quoted Line Breaks.  Honestly, we have no idea what this means as we've never seen it before.  Perhaps someone in the comments can let us know.  Finally, we get to see the results of our selections at the bottom of the tab.
Data Types 1

Data Types 2
On the next tab, AML Workbench automatically detects data types.  However, it's always a good idea to check them in case we have to make some changes.  The built-in options are "Numeric", "Date", "Boolean" (TRUE/FALSE) and "String".  These options all look correct.  It's also important to notice that AML Workbench added a field to the front of our dataset called "Path".  Since we only used one file, this field is pretty worthless.  However, if we were to pull in files from a directory, this could be a great to determine which file a record came from.
Sampling
Depending on the size of the file, we may want to work with a sample to improve performance.  The available options are "Top N", "Full File", "Random N%" and "Random N", with a default of "Top 10,000".  Our file is small; so there's no harm in pulling in the full file.  As we see above, we can also create multiple sampling rules, then choose which one we want to be active.  This is an easy way to develop using a sample, then swap over to "Full File" when we're ready for full-scale testing.  Depending on your Python skills, you may also consider utilizing more complex sampling techniques later using a Python notebook.  We'll cover these in a later post.
Path Column
Finally, we get to choose whether we want to include the path column in our output data.  As we talked about earlier, there's no reason to include it in this case.
iris - Copy Data Source
Hopefully this post opened your eyes to the different possibilities for importing data into Azure Machine Learning Workbench.  This is an extremely powerful tool and we've only begun to scratch the surface.  Stay tuned for the next post where we'll walk through the Built-In Data Preparations.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Senior Analytics Associate - Data Science
Syntelli Solutions
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com