Aug 242009
 
Last month I blogged about histograms, types of histograms and how to implement them in SSAS cubes. Since the entity-based ones are enabled in OLAP per se, I’ve focused on the event-based histograms and explained how to achieve them using a step-by-step approach. The inspiration for that article originated from this thread.
 
At the same time, another thread was describing a similar problem – how to make a dimension based on calculated measure’s values? At first, I thought it can be solved with a slight modification of the Histogram solution. The only difference should have been the use of a real utility dimension (instead of a dummy one). In other words, scoping members of that dimension with a proper calculation instead of defining a new calculated measure. However, not everything went smoothly. Finally, I’ve decided to make an example in Adventure Works, solve this interesting problem and elaborate the distilled solution here in details. What’s the topic this time? Buckets!
 
 
 
INTRODUCTION
 
If you ask me, Wikipedia doesn’t have a good explaination of what buckets are. In order to present this solution in a proper manner, first I’ll try to explaine buckets, then provide a solution on how to implement them in OLAP cubes.
 
Bucketization is a common process when dealing with data. Normally, you want to be able to perform analysis on every possible value of an entity, for example product, store, date, and their attributes as well. However, there are times when such a detailed analysis is either not advisable or not required. Loans, for example, are usually a huge dimension, customers also in some industries. Implementing an attribute based on the primary key of the respective dimension table can have serious consequences on the performance.
 
Continues values (or significant number of discrete values) are also sometimes bucketized. Values are split into discrete regions, bands, groups or ranges, which are nothing but various phrases for the same. Since the number of buckets is smaller than the number of distinct values, manipulation is easier and performance vs storage ratio potentially better. Yet, these buckets still represent the data in satisfactory way, meaning, we don’t lose information about it, we just reduce the resolution of analysis on that entity, since it suits us that way.
 
Typical examples can be found in Customer dimension (Yearly Income, Commute Distance) and Employee dimension (Base rate, Sick Leave Hours, Vacation Hours). Adventure Works, of course.
 
Some of the attributes mentioned above are implemented directly in dimension tables, as a range of values with their unique ID. The other using DiscretizationMethod / DiscretizationBucketCount properties of an attribute. The latter are wonderfully explained in a series of articles by William Pearson (Part 79-83, see the list below the article). A search on the web shows us that Chris Webb also blogged about buckets and how to build them dynamically few years ago.
 
The principle with histograms was this – we could see the frequency of occurence of the key attribute of a dimension per any of its attributes, as long as there was a distinct count measure defined on that key attribute. Those were entity-based histograms. Event-based histograms were also possible, but we had to introduce a dummy dimension with a certain number of members and a new calculated measure that calculates the frequency of events in a fact per each of them. The question is – are there entity and event-based buckets too? And the answer is – yes!
 
Buckets are reduced set of values of an attribute. Besides that, they are attribute like any other. Any measure can be analyzed across them. Such buckets are entity-based. They are built using previously mentioned techniques. What’s also true is that they are static, they don’t change in fact table. But, that’s not what we’re after. We want our buckets to be dynamic, calculated from a fact. And for that we need an utility dimension. Only this time calculations must be placed on it, using scope command, so that we’re able to analyze all measures, whether regular or calculated, across these new virtual buckets. Here’s how.
 
 
 
BUILDING BUCKETS
 
1. Open a copy of Adventure Works 2008 database in BIDS. Adjust Deployment part of Project properties in order not to overwrite your original database. Specify Buckets (or anything you like) for the name of the database.
 
2. Open Adventure Works DW.dsv. Add New Named Query.
 
 

 
Specify Bucket for name and then add this T-SQL script:
 
SELECT        0 AS KeyColumn, ‘Bucket 0’ AS NameColumn
UNION ALL
SELECT        1 AS KeyColumn, ‘Bucket 1’ AS NameColumn
UNION ALL
SELECT        2 AS KeyColumn, ‘Bucket 2’ AS NameColumn
UNION ALL
SELECT        3 AS KeyColumn, ‘Bucket 3’ AS NameColumn
UNION ALL
SELECT        4 AS KeyColumn, ‘Bucket 4’ AS NameColumn
UNION ALL
SELECT        5 AS KeyColumn, ‘Bucket 5’ AS NameColumn
UNION ALL
SELECT        6 AS KeyColumn, ‘Bucket 6’ AS NameColumn
UNION ALL
SELECT        7 AS KeyColumn, ‘Bucket 7’ AS NameColumn
UNION ALL
SELECT        8 AS KeyColumn, ‘Bucket 8’ AS NameColumn
UNION ALL
SELECT        9 AS KeyColumn, ‘Bucket 9’ AS NameColumn
UNION ALL
SELECT        10 AS KeyColumn, ‘Bucket 10’ AS NameColumn
 
Test, execute, save.
 
3. Build a new dimension from that query. Name the dimension Bucket. Use KeyColumn and NameColumn for the same attribute (one and only) and provide MemberValue also (it should be KeyColumn as well). Order by key.
 
 

 
4. Add that dimension to the cube. Don’t link to any measure group.
 
 

 
5. Go to Calculation tab. Position cursor just below the Calculate; command. Paste this script:
 

Create

 

 

 

 

 

Member CurrentCube.[Measures].[Weighted Discount Buckets] As
Int( [Measures].[Discount Amount] /
     
[Measures].[Reseller Sales Amount] * 10 )
, Format_String = ‘#,##0’
, Associated_Measure_Group = ‘Reseller Sales’
, Display_Folder = ‘WA’;

 

Scope

 

 

 

 

 

(
       [Bucket].[Bucket].[Bucket].
MEMBERS 
     );
This = Aggregate( Existing [Promotion].[Promotion].[Promotion].MEMBERS,
            iif( [Bucket].[Bucket].CurrentMember.MemberValue =
                 (
                   [Measures].[Weighted Discount Buckets],
                   [Bucket].[Bucket].[All]
                 ),
                 (
                 
 [Measures].CurrentMember,
                   [Bucket].[Bucket].[All]
                 ),
                
null
              
 )
          );
End Scope;

 

 

 

 

 

 

6. Build and deploy.
 
7. Jump to Cube Browser tab.
 
 

 
Put Bucket dimension on rows. Put Weighted Discount Buckets measure from WA folder first, then couple of measures from Reseller Sales measure group. I selected some having Sum aggregation, others being calculated, in order to show how buckets behave. You can also try to slice per country in Sales Territories dimension or any other related dimension – results adjust accordingly.
 
That’s it. Buckets are implemented!
 
 
 
EXPLAINATION
 
First we need to have a calculation for the buckets. In this example, we used weighted average of a discount percentage (which is Discount Amount over Reseller Sales Amount – we already have those two in our cube :-)). Discount percentage is related to a DimPromotion table. These two queries show us that discounts can be found only in reseller sales process – there are 12 distinct promotions having various discount percentages ranging from 0 to 0.5 (0% – 50%).
 
 

select

PromotionKey, UnitPriceDiscountPct
from dbo.FactResellerSales
group by PromotionKey, UnitPriceDiscountPct
order by PromotionKey, UnitPriceDiscountPct

  

select PromotionKey, UnitPriceDiscountPct
from dbo.FactInternetSales
group by PromotionKey, UnitPriceDiscountPct
order by PromotionKey, UnitPriceDiscountPct


Our buckets will be discount percentages, grouped in bands of ten (0-9 = 0, 10-19 = 1, … 91 – 100 = 10). All together, we need 10 members in our utility dimension and that’s what initial named query does.

Since that dimension is not linked to any measure group, by bringing it in cube browser, we only see the same values per each member – total values. In order to have correct values per each member, we need to reach for MDX script and write proper expressions. The important thing to notice is that expressions should be written before any other statement in MDX script in order to allow all calculated measures (that follow) to be aggregated the way they’re suppose to (after our scope!). What once was solve order, is now the order of statements in MDX script.

The first thing we do is we define our calculated measure that will be used to determine buckets. It’s the same definition as in [Measures].[Discount Percentage] calculated measure, except we rounded the values and grouped them in 10 buckets. That measure can later be made hidden, if wanted so.

The following scope handles how each bucket is to be calculated per any measure – as an aggregate of values across existing promotions where dimension bucket key is equal to bucket measure. That expression is actually our transformation of data, which allocates value on buckets. It’s sort of a projection of calculated bucket measure on bucket dimension. That’s how it can be visualized and understood. Inside, it is important to refer to the root member of bucket dimension, because it’s the only member that has data before transformations get applied. That’s why we used tuples.

CONCLUSION

We reached the end of this article. Let’s summarize it one more time.

It is possible to make analysis of cube data based on dynamically calculated measure/buckets. The solution is to build an utility dimension with predetermined number of members (buckets) and then assign a value for each bucket. Values are assigned using scope statement where the key of the bucket is compared to the value of the calculation (calculated bucket measure). Effectively, that is a transformation of data from a measure to a dimension. Speaking of which, what better way to finish this article than to provide a vivid example.

We all love music and have our favorite media player, either a device or an application on a computer. Small and simple ones provide only basic operations like Play, Pause, etc. The other, more advanced feature multimedia database (music catalog, tracking the number of times a song is being played) and visualization (spectrum analyzer) among other things.

Last time we’ve focused on PlayedCount field, that is, on the frequency of occurence of an event (a song being played). A histogram of that data would show how many songs (height of a vertical bar) have been played each number of times (points on x axis). Not a very interesting thing, right? A drill-down/drill-through on top of that would be more interesting (which songs are played N times). But there wouldn’t be drill if it weren’t for histogram in the first place, to separate the data. Therefore, histograms are important, although not always used as the final result.

In this article, we’ve focused on the visualization. Spectrum analyzer to be precise. Those small panels where several vertical bands show how loud the bass, middle and treble are. Or more bands, depending on the number of them it displays (i.e. 100 Hz, 300 kHz, 1 KHz, 3 kHz, 10 kHz). Bands are ranges of frequency which display the amplitude for any given part of a song.

Well, did you know that a song can also be considered a small fact? The amplitude in that case is the measure. Dimensions? Well, there’s a Time dimension for sure. Others? Not likely. What about the Frequency dimension? Doesn’t exist there. It is however required to be present in the end result. How come? Well, bands are nothing but buckets of frequency range, where the number of them is predetermined in advance. The only thing that’s missing is the calculation that projects the amplitude across buckets in spectrum analyzer. Ring a bell?

FFT (Fast Fourier Transformation). How about it Smješko namiguje?

 
Aug 142009
 
Chris Webb recently blogged about a strange phenomenon regarding sets in where clause and auto-exists. Today it occured to me that I had experienced a similar issue while doing some testing a year from now. A quick search on the right words (Tomislav Darren set slicer) and the exact thread popped up first in SSAS forum search engine.
 
After I’ve read it, I was very much intrigued by Chris’ post, so I’ve read it and tested it too, and then decided to spend few hours in order to find out what’s going on in the examples he provided, hoping to discover a general rule of this “weird” behavior and, most of all, the answer “why”.
 
Whether I succeeded or not, I’ll leave up to you. The explaination follows.
 
 
 
First, let’s make the summary of facts.
 
Here is the list of queries, how the MDX looks like, what do they return (not values, we focus on members), as well as short description of behavior/mechanism behind.
 
 
 
Query 1 – returns the value of 01-07-2001 (member in slicer wins in advance by forcing fixed and explicit coordinate on 20010701)
select [Measures].[Internet Sales Amount] on 0,
[Date].[Fiscal].[All Periods] on 1
from [Adventure Works]
where( [Date].[Calendar].[Date].&[20010701] )
 
Query 2 – returns the value of All Periods (axis wins because there’s a set in slicer)
select [Measures].[Internet Sales Amount] on 0,
[Date].[Fiscal].[All Periods] on 1
from [Adventure Works]
where( { [Date].[Calendar].[Date].&[20010701] } )
 
Query 3 – returns the value of 2001 (“nobody” wins, a tuple is formed from axis and slicer members, which in the end manifests as 2001 has won)
select [Measures].[Internet Sales Amount] on 0,
[Date].[Fiscal].[All Periods] on 1
from [Adventure Works]
where( { [Date].[Calendar Year].&[2001] } )
 
Query 4 – returns the value of 01-07-2001 (“nobody” wins, a tuple is formed from axis and slicer members, which in the end manifests as 20010701 has won)
select [Measures].[Internet Sales Amount] on 0,
[Date].[Fiscal Year].&[2002] on 1
from [Adventure Works]
where( { [Date].[Calendar].[Date].&[20010701] } )
 
Query 5 – returns the value of 01-07-2001 (again, axis wins because there’s a set in slicer)
select [Measures].[Internet Sales Amount] on 0,
[Date].[Calendar].[Date].&[20010701] on 1
from [Adventure Works]
where( { [Date].[Fiscal].[All Periods] } )
 
Query 6 – returns the value of All Periods (again, axis wins because there’s a set in slicer, similar to Query 2)
select [Measures].[Internet Sales Amount] on 0,
[Date].[Fiscal].[All Periods] on 1
from [Adventure Works]
where( { [Date].[Calendar].[Date].&[20010701], [Date].[Calendar].[Date].&[20010702] } )
 
 
OK, now something about colors, which will help us explain this faster and easier.
 
All queries except the first one have { } in the slicer. All hierarchies preserve their color throughout the queries. Fiscal hierarchies are warm-colored (orange, maroon), calendar hierarchies are cold-colored (green, blue). Each hierarchy has it’s own color, each member a nuance of that color.
 
Queries 1, 2 and 5 are similar. Let’s call them core queries. Query 6 is modified core query, it has another member in slicer. Queries 3 and 4 are special, they have one of the members present in core queries, and another one from a separate hierarchy. Attribute hierarchy, to be precise (one level only). On the other hand, core queries have user hierarchies (many levels).
 
 
 
Two things should be separated: autoexists and evaluation of cells.
 
The first of those two mechanism triggers in all queries because hierarchies in slicer and on axis belong to the same dimension. Therefore, only “existing” members on axis survive. In these examples it is not so obvious, but if you replace .[All members] with .MEMBERS (on axis only, not in slicer), you’ll see what I mean. Besides the root member, you’ll also get all members that a related. For example, this query:

select

[Measures].[Internet Sales Amount] on 0,
[Date].[Fiscal].
members on 1
from [Adventure Works]
where({[Date].[Calendar].[Date].&[20010701]})

gives you this result:
Date.Fiscal       Internet Sales Amount
All Periods              $29,358,677.22
· FY 2002                 $7,072,084.24
· · H1 FY 2002            $3,266,373.66
· · · Q1 FY 2002          $1,453,522.89
· · · · July 2001           $473,388.16
· · · · · July 1, 2001       $14,477.34
 
 
The issue in this weird case is not about autoexists. It performs perfectly. Therefore we will not focus on that. We will focus on something else – the evaluation of cells.
 
The second mechanism is the evaluation of cells. It means that for each cell of resulting table a tuple is being formed and then evaluated. Normally, a tuple consists of members. All hierarchies members. One from each. Some of them are explicitely there, some not (but are taken into account during tuple construction). If they are there, it also depends on where they come from (calculated member, slicer and axes are typical areas). In the same order they win over each other. Forcing the weaker to implicitely adjust to them. Members on axes adjust to slicer (autoexist), and slicer might get overwritten by a calculated member (if the formula says so). Except when we have sets in slicer. Then the situation is different.
 
From now on it’s all about my speculation.
 
 
 
They do participate in autoexist (as we stated before, no problem there), but I believe they do not force the coordinate the way a single member in a tuple in slicer does.
 
I believe set in tuple is made weaker than an explicit member from an axis on purpose. Because of that, if there comes a member in a tuple that is related to the hierarchy of that set, a set is implicitely replaced with a proper, related, single member in that hierarchy. If with nothing below, then with the root member.
 
And why is it made so?
 
Sets can have more than one member. In SSAS 2000, you would have to construct a calculated member and use that in slicer. Since there was no attribute relations in SSAS 2000, you could combine it freely. In SSAS 2005+ it’s different. Attribute relations exist. A potential calculated member (aggregate of set, maybe it is being done internally), just like all calculated members, is never related to any real member on any hierarchy. When selected, it forces the rest of hierarchies (from same dimension only, to be precise) to shift their current members to their root.
 
If you only have that hierarchy in context, then fine, all other shift to All, there’s no conflict.
 
But what if another hierarchy is forcing its member in a tuple as an explicit member (i.e. a member coming from an axis)? That means we would have a conflict and would have to choose how to solve it, who’s to win. In SSAS 2005 and 2008 attributes are related. Inside a tuple they must all be aligned. As I stated earlier, there are rules who wins. Calculated members, then slicer, then axes. In this case, I guess it has been decided that a member is stronger than a set. Somebody had to step back.
 
Why? Because there might have been problems otherwise. Members in a set may have different related members on other hierarchies. More than one per same hierarchy. If the set was stronger, then we would end up with more than one “currentmember” on other hierarchies, which a tuple doesn’t support. We would end up with a subcube. Therefore, sets must be weaker and adjust.
 
 
 
How about they intersect, find a common area in multidimensional space and then evaluate? Well, it’s not that easy.
 
Some combinations change before they have a chance to intersect (form a tuple) because hierarchies are related (Query 1, 2, 5, 6). Others do intersect. In Query 3 namely, there are two fiscal years that match 2001 calendar year. It can’t be so, therefore, that hierarchy is shifted to root member, not to interfere. Likewise, Query 4 is ok, members form a tuple, intersect and get evaluated.
 
 
 
And if they would intersect before the attribute relations mechanism? OK, but how? Who should adjust to whom when you have an intersection that is not a tuple but a subcube (set of tuples)?
 
Here’s an example. Construct a set in slicer that causes SSAS to puke. As I witnessed a year ago, but failed to understand why.
 
SELECT

{   [Measures].[Sales Amount] } on 0
FROM
 
    [Adventure Works]
WHERE
( { [Date].[Calendar].[Calendar Quarter].&[2003]&[4],
    [Date].[Calendar].[Calendar Quarter].&[2002]&[4] },
    [Date].[Calendar Year].&[2003] )

 

Attribute relations are being triggered in order to get the rest of the tuple. The error was about a fiscal hierarchy (not present in a query!). Of course, it couldn’t be evaluated differently. Hierarchies in slicer are equally good, none of them should be prefered to win. One of them is having a set that implies 2 implicite currentmembers on another hierarchy. The result? An error:
 

Executing the query …

The MDX function failed because the coordinate for the ‘Fiscal Year’ attribute contains a set.

Execution complete

 
 
 
Ups, I haven’t explained how come the queries evaluate the way they did.
 
Query 1 is understandable, a coordinate is set and is kept throughout iteration over axes. Because a member in slicer is also the intersection of itself with axis members, the result is its value.
 
Query 2 is evaluated as All Periods because there was a set in slicer (no matter it has one member only, it is still a set for SSAS). There were no calculated members, no members in slicer, so there was nobody to stop axis members to force the coordinate and make the others (set from slicer) to adjust to it this time. It’s not the whole story, but I don’t want to bring all the details yet. Follow on.
 
In Query 3 there was a member, year 2001, from another hierarchy. Attribute hierarchy. Again, it was a set in slicer. However, it won. How come? The same thing happened as in Query 2. But after the end of the tuple formation process (attribute relations mechanism in action), by some miracle 2001 survived. Or it wasn’t such a miracle?
 
If you take a look at user hierarchies in Adventure Works database, you’ll see that Date and Month are common attributes in both Calendar and Fiscal user hierarchy. That means that in the Query 2 the date in July was overwritten because it was used in the Fiscal hierarchy as well as in Calendar hierarchy. Currentmember of Fiscal hierarchy was All Period. That we had on axis. Having established one and only one current member per that hierarchy caused Date hierarchy (as well as Date.Calendar hierarchy and many others) to shift their current member to All member (or related if any).
 
Now, in Query 3 we have another situation. We have a member from a hiearchy that is not directly related to All Periods on Fiscal hierarchy. Therefore, there was no overwriting. Both currentmembers, All Periods from Fiscal hierarchy and 2001 from Year attribute hierarchy came in the tuple and the result was evaluated as result for 2001. As there was nothing on axis.
 
Query 4 is a similar case. Both members end up in tuple, tuple is evaluated as intersection of them, meaning – 1st of July.
 
Query 5 is the opposite of Query 2. Members have switched their places. Since both can be found in the same hierarchy (Fiscal hierarchy), member on axis wins. One more time, Date attribute is the lowest level in both Fiscal and Calendar hierarchies. By selecting a date, we select date in both hierarchies. Now, it only matters who comes in tuple first (axis member in case of set in slicer) as an explicite member, and who gets adjusted (set from slicer) because that hierarchy current member place is already taken.
 
Query 6? Same thing. No matter the number of members in set in slicer, axis member wins.
 
 
 
Is there a solution? Is it a bug afterall? Is there a workaround?
 
From what I analyzed and concluded, it was a decision. Which means – by design. However, a big part of this analysis is bases on assumptions.
 
Can it be improved?
 
I believe it can. Sets in where block can be tested if they contain a single member. MDX is full of shortcuts, different ways of writing this and that, one more wouldn’t hurt – single-member in a set in slicer is converted to a member. Afterall, the reverse path is present. A member placed on axis without {} around it, a member in lots of functions and so on, is converted to a set. Downconversion from a set to a single member would make Query 2 evaluate the same way Query 1 did.
 
 
 
But what if sets have multiple members in slicer? What coordinate should be set up?
 
Can’t tell. That’s the problem I guess.
 
Attribute relations mechanism is so strong and eager to chew everything first, that it breaks some tooths along the way, in situation like these. From time to time it might be better to consider letting autoexists be the first in line. In my last example, the one with an error, autoexisting the slicer (intersecting, so that what survives – survives) would leave only one quarter and the year, not causing the error. However, in more complex examples it wouldn’t help.