Breaking BI: March 2014

Monday, March 24, 2014

Partial Highlighting on Charts in Tableau

Today, we will talk about a concept that we're calling "Partial Highlighting." In Tableau, you can link dashboards in a number of ways on a dashboard. When you click on one chart, you can have the values in another chart either be highlighted or be filtered. Let's examine these two aspects individually. As usual, we will the Superstore Sales sample data set from Tableau.

First, when you use a highlight action in Tableau, it will only highlight the same values in the same field that you highlighted.

Highlighting

As you can see, when we click on Jumbo Box, the Jumbo Box values in Sheets 1 and 3 are highlighting. However, there is no highlighting on Sheet 2 because the Container field is not in the chart. Let's move on to filtering.

When you use a filter action in Tableau, it will add a filter to the chart so that only the values with the values you selected are shown.

Filtering

Now, when we click on Jumbo Box, the other Container values disappear completely from the other charts. These are extremely useful options. However, what if I wanted to click on Jumbo Box and have the bars in Sheet 2 highlight the portion of each bar that corresponds to Container = Jumbo Box? Power View, the visual analytics portion of Microsoft's Power BI stack, actually does this very well.

Partial Highlighting

As you can see, when I click on Jumbo Drum, the top right bar graph highlights the appropriate portion, while still allowing us to see what the total was so that we don't lose perspective. We recently had an epiphany about accomplishing this in Tableau and it's actually pretty easy!

In order to accomplish this, we're going what's known as a "Scaffold Data Source." This idea was pioneered by Joe Mako. You can find out more information by watching his "Think Data Thursday" presentation here. Let's start by creating a regular bar graph.

Sales by Order Priority

If we place this on a dashboard using the filter action, then this chart will filter entirely based on another selection. Next, we need to duplicate our data source so that we have two copies of it within Tableau.

Duplicate Data Source

Now, let's use the dual-axis feature to place the same set of bars on this chart again. But, this time we're going to use SUM( [Sales] ) from our duplicated data source.

Sales by Order Priority (Hidden Dual Axis)

We renamed the [Sales] field in the secondary data source so that we knew which [Sales] was the primary. Then, we created a dual axis bar graph, made sure the primary Sales was in front, synchronized the axes, hid the Blended Sales axis and colored the Blended Sales Bar bar a lighter shade of blue. This way, it looks exactly like a regular bar graph. Next, we need to make it so that these data sources blend on every field EXCEPT Container.

Unlinking Container

Once we do this, the secondary data source will not be filtered if we try to filter the primary data source on Container. See where this is going?

Partial Highlighting (Tableau)

Now, when we click on Jumbo Drum in Sheet 1, the primary bar filters for Sheet 4. However, since the secondary data source is not linked over Container, the underlying bar does not filter. This gives the appearance of partial filtering! The really cool thing about this technique is that you can do it for lots of other types of charts. All you need a little imagination. Thanks for reading. We hope you found this informative.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

Monday, March 10, 2014

Predictive Analytics in Tableau Part 11: How Data Moves Between R and Tableau

Today, we will end this series on an exploratory note. We've wondered throughout this series precisely how data travels between R and Tableau and what precautions need to be taken to ensure that we are getting accurate results. So, we've designed a sample data set that will allow us to test a couple of scenarios and develop solutions for solving this problem. Let's take a look at our data.

Data

First, we have a complete data set with years ranging from 2000 to 2014. We also have five different numeric fields. The first is ordered the same as the years. The second is ordered in reversed/descending order. The third is simply random. The fourth is the same as the first, but with the years 2003 and 2008 missing. The fifth is the same as the random one, but with the years 2003 and 2008 missing. Now, let's start by passing ascending data and seeing if we get it back in the same order.

Ascending (No Sort)

Ascending (No Sort) by Year

As expected, R returned the data back exactly like it should have. Now, let's try it with the random data to see if the values matter (they shouldn't).

Random (No Sort)

Random (No Sort) by Year

Again, no issues. Finally, let's see if the missing values cause any problems.

Missing (No Sort)

Missing (No Sort) by Year

We finally have an interesting result. Apparently, the SCRIPT_REAL function can't return NULL values. R refers to these null values as NA. This leads us to an interesting dilemma. Is Tableau passing a Null and receiving a 0, or is Tableau passing a 0 initially. We can test this. First, let's show a simple example.

NA + 1

This picture shows that R is just like most languages in that you can't perform arithmetic on Null values. So, let's use this to test what value R is receiving.

Missing (+ 1)

If R is receiving NAs, then this function should return NA, which Tableau would read as 0. However, if Tableau is passing 0, then this function should return 1. Let's see what happens.

Missing (+ 1) by Year

We see that Tableau is in fact passing 0 instead of Null. This seems awkward to us. Before we jump to any conclusions, let's confirm using a different method.

is.na

R has this nifty function called is.na() that returns true if the value is null. So, let's run this test using Tableau.

Missing (is.na)

Missing (is.na) by Year

We have confirmed that Tableau is not passing Null values to R. We don't see any reason why Tableau couldn't just pass NA in place of the Nulls. Perhaps it has something to do with Rserve? If you know anything, please let us know in the comments. This leads us to another issue. How do we combat this? We have an idea, but let's see how R handles it.

Coercing NA

This is great! It seems that if we pass the values to R as strings, then anything that isn't a number should be turned into NA. This leads us to another question. What does Tableau pass in place of a Null for Strings?

Missing (String)

Missing (String) by Year

We see that Tableau coerces Null values into the string value "Null". This is great. Now, let's see what happens when pass it to R and coerce it back to a number.

Missing (String Passed)

Missing (String Passed) by Year

Success! We now know that if you have missing data, you can fix the problem by coercing them to strings before you pass them to R, then coercing them back to numbers in R. However, this is a VERY serious issue. It's shocking to us that a tool like Tableau would lack such simple functionality as Null handling. We've put a post out of the Tableau forums to see if other people see it the same way. You can see the post here.

EDIT: Apparently this issue has been fixed in Tableau 8.1.3.

Now, let's move on to ordering. What would happen if we passed the descending values to R, then sorted them and passed them back to Tableau?

Descending (Ascending Sort)

Descending (Ascending Sort) by Year

As you can see, Tableau was completely unaware that the data was sorted and sent it back in the order it was received. This is a scary, yet unsurprising, result. There are ways to handle this though. Let's propose a real scenario for this. Imagine that you want to create a time series that predicts the next value. As we've seen in the last few posts, the data must be ordered in ascending order if you want to create a time series. So, what would happen if we tried to pass in an unordered time series?

Unordered Time Series (Year)

Unordered Time Series (Value)

Unordered Time Series (Table)

As you can see, the time series gets returned in ascending order, regardless of the order in which it was received. You might ask "Who would look at a time series out of order like you have in this table?" That's where the interesting idea comes in. Consider the traditional way you would look at time series data.

Time Series (Ordered)

How do you know what order the data is in? Can you assume that the data is in ascending order? Well, if we overlay Unordered Time Series (Year), we will see if any of the years don't match up.

Time Series (Ordered) with Labels

Following the labels, we see that everything is in order. Is this always the case? Our data was originally in ascending order by Year. What if we were to reorder it? When Tableau creates an extract, it automatically sorts the data. We're not sure of the precise algorithm, but we do know that it was sorting our data by date. So, we added another date in a different order in an attempt to trick Tableau into ordering the data improperly.

Data (Disordered)

Now, will our line chart still be in the correct order?

Time Series (Attempted Disordered)

As you can see, it doesn't matter whether the data is ordered or not. These line charts always seem to be ordered by Date. This is a very good thing. It adds robustness to our procedures. Now that we know these charts are always ordered, we can actually prevent a possible catastrophe.

Null Sorted

Notice how the NA gets removed when we sort? If we were to sort data in R when we had missing data, we would be removing our missing data, which would have huge repercussions. Namely, you can pass ten values to R and receive nine in return. Tableau will toss an error if this happens. Fortunately, this isn't a problem for our situations. However, keep this in mind if you do find yourself in a situation where you are sorting data that may be missing.

In short, Tableau expects to receive data in the order in which it passes it. Therefore, if you need to order your data, order it in Tableau BEFORE you pass it to R. Tableau also doesn't pass Null numeric values to R, it passes zeros. If you want to pass numeric data that may be missing, you may need to coerce the data to strings first. Thanks for reading. We hope you found this informative.

P.S. Zen Master Jonathan Drummey has done a similar analysis and it can be found here.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

Monday, March 3, 2014

Predictive Analytics in Tableau Part 10: Seasonal Decomposition

Today, we will talk about performing Seasonal Decomposition using Tableau 8.1's new R functionality. Seasonal Decomposition is the process by which you determine what patterns over time exist in your data. As usual, we will be using the Superstore Sales sample data set from Tableau.

Let's start by looking at our data.

Sales by Month

We can see that there is a definite "up-and-down" motion to the data. This motion is known as "seasonality". There also seems to be a very subtle downward trend to the data. However, all of this relies on our eyes. Let's use R to give us an actual decomposition. It's important to note that this technique isn't necessarily used for prediction. Therefore, the code is somewhat simplified because we won't be holding back observations. Here's the code for Sales (Trend):

SCRIPT_REAL("
library(forecast)

year <- .arg1
month <- .arg2
sales <- .arg3
len <- length(sales)

## Sorting the Data

date <- year + month / 12
dat <- cbind(year, month, sales)[sort(date, index.return = TRUE)$ix,]

## Decomposing the Time Series

timeser <- ts(dat[,3], start = c(dat[1,1], dat[1,2]), end = c(dat[len,1], dat[len,2]), frequency = 12)
as.vector(stl(timeser, s.window = 'period')[[1]][,2])
",

ATTR( YEAR( [Order Date] ) ), ATTR( MONTH( [Order Date] ) ), SUM( [Sales] ) )

Now, let's see what it looks like.

Sales by Month (with Tableau and R Trends)

On this chart, we chose to add Tableau's "Trend Line" as well for comparison. What Tableau calls a "Trend Line" is actually a Regression line, which we saw quite a bit of in the first few posts of this series. R's seasonal decomposition trend is an entirely different type of trend. As you can see, the season decomposition trend adds much more information that the "Trend Line" does. For any time series model, you don't want to consider the very early observations because they are based off of only a few observations. So, let's filter out the year 2009 using a lookup (this way it doesn't affect our R calculations) so that we can really see what's going on.

Sales by Month (with Tableau and R Trends) (without 2009)

Now, we can see that Tableau's "Trend Line" seems to imply that sales are increasing. However, the seasonal decomposition trend shows that it's more of a subtle wave than an increasing line. In cases of time series data, the Regression-based "Trend Line" that Tableau uses could lead to some faulty assessments. To be fair to Tableau, this "Trend Line" is perfect for scatterplots.

Sales by Quantity (with Trend)

Now that we've seen the overall trend, let's look at the seasonality. As usual, the code is identical except for the last line. The rest of the code can be found in the appendix at the end of the post.

as.vector(stl(timeser, s.window = 'period')[[1]][,1])

Let's see what it looks like.

Sales by Month (with Seasonality)

We see that the bottom chart repeats every year. So, we can use this to say that our sales peak in late autumn and winter (September - January). Strangely enough, our sales are average in November. Perhaps there's some driving factor to this. Maybe people are saving up money in November to buy Christmas presents in December. Alas, we digress. We can also see that sales are low in late spring and summer (April - August). This is hugely impactful information that you might not have been able to determine using other methods. On another note, imagine that we ran a huge sale in November 2011. How do we tell if the sale gave us higher than expected returns? Seasonal Decomposition can help here too. Here the last line of code for Sales (Remainder)

as.vector(stl(timeser, s.window = 'period')[[1]][,3])

Sales by Month (with Remainder)

This chart shows us what was left once the Trend and Seasonality were removed, in other words, the remainder. So, we see that November 2011 had sales significantly higher than expected. Can this be explained by the sale we were running? It's certainly seems so. Of course, our analysis wouldn't end here if we were actually analyzing this. Alas, this is just a demonstration.

Finally, let's look at a way to predict new observations using seasonal decomposition. With seasonal decomposition, our predictions would always be equal to the trend plus the seasonal effect. For new observations, we will always have the seasonal effect because it repeats every year. However, we won't have the trend. There are plenty of different methods you could come up with to determine the trend. For this method, we'll just say that the trend for new observations is equal to the last trend we know (December 2012). So, once we add a few blank lines to our data set (which we've done before to create new observations), we can create our calculation. Here are the last few lines:

seasdec <- stl(timeser, s.window = 'period')
for(i in 1:len.new){if(dat.new[i,2] == month.orig[len.new + 1]){start <- i; break;}}
pred <- seasdec[[1]][start:(start + hold.orig[1] - 1),1] + seasdec[[1]][len.new,2]
c(rep(NA,len.new),pred)

Let's see what it looks like.

Sales by Month (With Forecast)

We can see that the forecasting process seems to be working. It spikes in January and falls towards the middle of the year, which is what we expected. In fairness, we're convinced that the usefulness of this procedure is in the analysis using remainders and seasonality, not in the forecasting. Hopefully, this piqued your interest to give it a whirl. Thanks for reading. We hope you found this informative.

Brad Llewellyn
Data Analytics Consultant
Mariner, LLC
llewellyn.wb@gmail.com
https://www.linkedin.com/in/bradllewellyn

Appendix

Sales (Trend)

SCRIPT_REAL("
library(forecast)

year <- .arg1
month <- .arg2
sales <- .arg3
len <- length(sales)

## Sorting the Data

date <- year + month / 12
dat <- cbind(year, month, sales)[sort(date, index.return = TRUE)$ix,]

## Decomposing the Time Series

timeser <- ts(dat[,3], start = c(dat[1,1], dat[1,2]), end = c(dat[len,1], dat[len,2]), frequency = 12)
as.vector(stl(timeser, s.window = 'period')[[1]][,2])
",

ATTR( YEAR( [Order Date] ) ), ATTR( MONTH( [Order Date] ) ), SUM( [Sales] ) )

Sales (Seasonality)

SCRIPT_REAL("
library(forecast)

year <- .arg1
month <- .arg2
sales <- .arg3
len <- length(sales)

## Sorting the Data

date <- year + month / 12
dat <- cbind(year, month, sales)[sort(date, index.return = TRUE)$ix,]

## Decomposing the Time Series

timeser <- ts(dat[,3], start = c(dat[1,1], dat[1,2]), end = c(dat[len,1], dat[len,2]), frequency = 12)
as.vector(stl(timeser, s.window = 'period')[[1]][,1])
",

ATTR( YEAR( [Order Date] ) ), ATTR( MONTH( [Order Date] ) ), SUM( [Sales] ) )

Sales (Remainder)

SCRIPT_REAL("
library(forecast)

year <- .arg1
month <- .arg2
sales <- .arg3
len <- length(sales)

## Sorting the Data

date <- year + month / 12
dat <- cbind(year, month, sales)[sort(date, index.return = TRUE)$ix,]

## Decomposing the Time Series

timeser <- ts(dat[,3], start = c(dat[1,1], dat[1,2]), end = c(dat[len,1], dat[len,2]), frequency = 12)
as.vector(stl(timeser, s.window = 'period')[[1]][,3])
",

ATTR( YEAR( [Order Date] ) ), ATTR( MONTH( [Order Date] ) ), SUM( [Sales] ) )

Sales (Forecast)

SCRIPT_REAL("
library(forecast)

## Creating vectors

hold.orig <- .arg4
len.orig <- length( hold.orig )
len.new <- len.orig - hold.orig[1]

year.orig <- .arg1
month.orig <- .arg2
sales.orig <- .arg3

## Sorting the Data

date.orig <- year.orig + month.orig / 12
dat.orig <- cbind(year.orig, month.orig, sales.orig)[sort(date.orig, index.return = TRUE)$ix,]
dat.new <- dat.orig[1:len.new,]

## Decomposing the Time Series

timeser <- ts(dat.new[,3], start = c(dat.new[1,1], dat.new[1,2]), end = c(dat.new[len.new,1], dat.new[len.new,2]), frequency = 12)
seasdec <- stl(timeser, s.window = 'period')

## Predicting the new Observations

for(i in 1:len.new){if(dat.new[i,2] == month.orig[len.new + 1]){start <- i; break;}}
pred <- seasdec[[1]][start:(start + hold.orig[1] - 1),1] + seasdec[[1]][len.new,2]
c(rep(NA,len.new),pred)
",

ATTR( YEAR( [Order Date] ) ), ATTR( MONTH( [Order Date] ) ), SUM( [Sales] ), [Months to Forecast] )

About Me

Brad is a Service Engineer on Microsoft's FastTrack for Azure Team in Charlotte, NC. Brad helps individuals and organizations leverage Analytics and Azure to revolutionize themselves and their industries. He has an M.S. in Statistics from the University of South Carolina, MCSE Certification in Data Management and Analytics, MCSE Certification in Cloud Platform and Infrastructure and various MCSA Certifications in Business Intelligence and Advanced Analytics. Brad is an active blogger at breaking-bi.blogspot.com. He is also an organizer for the Charlotte BI Group, a local PASS chapter in Charlotte, NC. You can connect with him on LinkedIn at https://www.linkedin.com/in/bradllewellyn and on Twitter @BreakingBI.