Monday, May 19, 2014

Data Mining in Excel Part 3: Cleaning Outliers

Today, we will talk about the second feature in the Data Preparation Segment, Clean Data.
Data Preparation Segment
Most data that a business analyst would work with would probably not be perfectly clean.  There would be empty values, misspelled values, and worse.  This feature is designed to alleviate some of that.  When you click on "Clean Data", you have two options, Outliers and Re-Label.  In this post, we will be talking about the Outliers portion of this feature.
Clean Data
We added some bad income values to the data set.  So, how do I know if they are there?
Income (with -1)
The Explore Data feature we talked about in the previous post is perfect for this.  We can easily see that there are a large number of bad values in the Income column.  Now, let's get rid of them using the Outliers portion of the Clean Data feature.
Income Outliers
This feature shows us the distribution of our data again and we can see the same spike at -1 we saw before.  Unfortunately, this tool doesn't show us the values on the bottom axis, which is very disappointing.  However, it is pretty cool because you can move the sliders and see what data you would be removing.
Income Outliers (with Sliders)
However, in our case, we only need to remove the -1 values.  So, if we set the minimum at 0 and click "Next", we get a set of options.
Outlier Handling
Each of these options is useful in its own way.  For instance, if you were dealing with percentages, then you would want to cap the percentages at 0% and 100%.  So, you would use the "Change value to specified limits" feature in that case.  If you were interested in looking at the distribution of this column without this data, then you would either want to change these values to Null or delete the rows altogether.  You should only delete the rows if you have no need for any of the other information.  In some very rare cases, you want want to replace the values with the mean (average) value.  This process as a whole is called "Imputation".  If you're interested, you can learn more here.  For our case, we want to remove these observations because we only care about income.
Select Destination
Lastly, we need to decide whether or not we want to copy the data to a new sheet or change the data where it is.  You should be extremely careful when working with real data because you typically don't want to delete data.  If you're experimenting with Data Mining, it's best to always use a copy of the data.
Income (Clean)
Now that we've cleaned the data, we see that -1 values are gone.  Now, we can move on with our analysis.  Stay tuned for our next post where we'll talk about the Re-Label portion of the Clean Data feature.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

No comments:

Post a Comment