Implementing buckets in OLAP

Aug 242009

Last month I blogged about histograms, types of histograms and how to implement them in SSAS cubes. Since the entity-based ones are enabled in OLAP per se, I’ve focused on the event-based histograms and explained how to achieve them using a step-by-step approach. The inspiration for that article originated from this thread.

At the same time, another thread was describing a similar problem – how to make a dimension based on calculated measure’s values? At first, I thought it can be solved with a slight modification of the Histogram solution. The only difference should have been the use of a real utility dimension (instead of a dummy one). In other words, scoping members of that dimension with a proper calculation instead of defining a new calculated measure. However, not everything went smoothly. Finally, I’ve decided to make an example in Adventure Works, solve this interesting problem and elaborate the distilled solution here in details. What’s the topic this time? Buckets!

INTRODUCTION

If you ask me, Wikipedia doesn’t have a good explaination of what buckets are. In order to present this solution in a proper manner, first I’ll try to explaine buckets, then provide a solution on how to implement them in OLAP cubes.

Bucketization is a common process when dealing with data. Normally, you want to be able to perform analysis on every possible value of an entity, for example product, store, date, and their attributes as well. However, there are times when such a detailed analysis is either not advisable or not required. Loans, for example, are usually a huge dimension, customers also in some industries. Implementing an attribute based on the primary key of the respective dimension table can have serious consequences on the performance.

Continues values (or significant number of discrete values) are also sometimes bucketized. Values are split into discrete regions, bands, groups or ranges, which are nothing but various phrases for the same. Since the number of buckets is smaller than the number of distinct values, manipulation is easier and performance vs storage ratio potentially better. Yet, these buckets still represent the data in satisfactory way, meaning, we don’t lose information about it, we just reduce the resolution of analysis on that entity, since it suits us that way.

Typical examples can be found in Customer dimension (Yearly Income, Commute Distance) and Employee dimension (Base rate, Sick Leave Hours, Vacation Hours). Adventure Works, of course.

Some of the attributes mentioned above are implemented directly in dimension tables, as a range of values with their unique ID. The other using DiscretizationMethod / DiscretizationBucketCount properties of an attribute. The latter are wonderfully explained in a series of articles by William Pearson (Part 79-83, see the list below the article). A search on the web shows us that Chris Webb also blogged about buckets and how to build them dynamically few years ago.

The principle with histograms was this – we could see the frequency of occurence of the key attribute of a dimension per any of its attributes, as long as there was a distinct count measure defined on that key attribute. Those were entity-based histograms. Event-based histograms were also possible, but we had to introduce a dummy dimension with a certain number of members and a new calculated measure that calculates the frequency of events in a fact per each of them. The question is – are there entity and event-based buckets too? And the answer is – yes!

Buckets are reduced set of values of an attribute. Besides that, they are attribute like any other. Any measure can be analyzed across them. Such buckets are entity-based. They are built using previously mentioned techniques. What’s also true is that they are static, they don’t change in fact table. But, that’s not what we’re after. We want our buckets to be dynamic, calculated from a fact. And for that we need an utility dimension. Only this time calculations must be placed on it, using scope command, so that we’re able to analyze all measures, whether regular or calculated, across these new virtual buckets. Here’s how.

BUILDING BUCKETS

1. Open a copy of Adventure Works 2008 database in BIDS. Adjust Deployment part of Project properties in order not to overwrite your original database. Specify Buckets (or anything you like) for the name of the database.

2. Open Adventure Works DW.dsv. Add New Named Query.

Specify Bucket for name and then add this T-SQL script:

SELECT        0 AS KeyColumn, ‘Bucket 0’ AS NameColumn
UNION ALL
SELECT        1 AS KeyColumn, ‘Bucket 1’ AS NameColumn
UNION ALL
SELECT        2 AS KeyColumn, ‘Bucket 2’ AS NameColumn
UNION ALL
SELECT        3 AS KeyColumn, ‘Bucket 3’ AS NameColumn
UNION ALL
SELECT        4 AS KeyColumn, ‘Bucket 4’ AS NameColumn
UNION ALL
SELECT        5 AS KeyColumn, ‘Bucket 5’ AS NameColumn
UNION ALL
SELECT        6 AS KeyColumn, ‘Bucket 6’ AS NameColumn
UNION ALL
SELECT        7 AS KeyColumn, ‘Bucket 7’ AS NameColumn
UNION ALL
SELECT        8 AS KeyColumn, ‘Bucket 8’ AS NameColumn
UNION ALL
SELECT        9 AS KeyColumn, ‘Bucket 9’ AS NameColumn
UNION ALL
SELECT        10 AS KeyColumn, ‘Bucket 10’ AS NameColumn

Test, execute, save.

3. Build a new dimension from that query. Name the dimension Bucket. Use KeyColumn and NameColumn for the same attribute (one and only) and provide MemberValue also (it should be KeyColumn as well). Order by key.

4. Add that dimension to the cube. Don’t link to any measure group.

5. Go to Calculation tab. Position cursor just below the Calculate; command. Paste this script:

Create

Member CurrentCube.[Measures].[Weighted Discount Buckets] As
Int( [Measures].[Discount Amount] /
[Measures].[Reseller Sales Amount] * 10 )
, Format_String = ‘#,##0’
, Associated_Measure_Group = ‘Reseller Sales’
, Display_Folder = ‘WA’;

Scope

(
      [Bucket].[Bucket].[Bucket].MEMBERS
     );
This = Aggregate( Existing [Promotion].[Promotion].[Promotion].MEMBERS,
            iif( [Bucket].[Bucket].CurrentMember.MemberValue =
                 (
                  [Measures].[Weighted Discount Buckets],
                  [Bucket].[Bucket].[All]
                 ),
                 (
                  [Measures].CurrentMember,
                   [Bucket].[Bucket].[All]
                 ),
                 null
               )
          );
End Scope;

6. Build and deploy.

7. Jump to Cube Browser tab.

Put Bucket dimension on rows. Put Weighted Discount Buckets measure from WA folder first, then couple of measures from Reseller Sales measure group. I selected some having Sum aggregation, others being calculated, in order to show how buckets behave. You can also try to slice per country in Sales Territories dimension or any other related dimension – results adjust accordingly.

That’s it. Buckets are implemented!

EXPLAINATION

First we need to have a calculation for the buckets. In this example, we used weighted average of a discount percentage (which is Discount Amount over Reseller Sales Amount – we already have those two in our cube :-)). Discount percentage is related to a DimPromotion table. These two queries show us that discounts can be found only in reseller sales process – there are 12 distinct promotions having various discount percentages ranging from 0 to 0.5 (0% – 50%).

select

PromotionKey, UnitPriceDiscountPct
from dbo.FactResellerSales
group by PromotionKey, UnitPriceDiscountPct
order by PromotionKey, UnitPriceDiscountPct

select PromotionKey, UnitPriceDiscountPct
from dbo.FactInternetSales
group by PromotionKey, UnitPriceDiscountPct
order by PromotionKey, UnitPriceDiscountPct

Our buckets will be discount percentages, grouped in bands of ten (0-9 = 0, 10-19 = 1, … 91 – 100 = 10). All together, we need 10 members in our utility dimension and that’s what initial named query does.

Since that dimension is not linked to any measure group, by bringing it in cube browser, we only see the same values per each member – total values. In order to have correct values per each member, we need to reach for MDX script and write proper expressions. The important thing to notice is that expressions should be written before any other statement in MDX script in order to allow all calculated measures (that follow) to be aggregated the way they’re suppose to (after our scope!). What once was solve order, is now the order of statements in MDX script.

The first thing we do is we define our calculated measure that will be used to determine buckets. It’s the same definition as in [Measures].[Discount Percentage] calculated measure, except we rounded the values and grouped them in 10 buckets. That measure can later be made hidden, if wanted so.

The following scope handles how each bucket is to be calculated per any measure – as an aggregate of values across existing promotions where dimension bucket key is equal to bucket measure. That expression is actually our transformation of data, which allocates value on buckets. It’s sort of a projection of calculated bucket measure on bucket dimension. That’s how it can be visualized and understood. Inside, it is important to refer to the root member of bucket dimension, because it’s the only member that has data before transformations get applied. That’s why we used tuples.

CONCLUSION

We reached the end of this article. Let’s summarize it one more time.

It is possible to make analysis of cube data based on dynamically calculated measure/buckets. The solution is to build an utility dimension with predetermined number of members (buckets) and then assign a value for each bucket. Values are assigned using scope statement where the key of the bucket is compared to the value of the calculation (calculated bucket measure). Effectively, that is a transformation of data from a measure to a dimension. Speaking of which, what better way to finish this article than to provide a vivid example.

We all love music and have our favorite media player, either a device or an application on a computer. Small and simple ones provide only basic operations like Play, Pause, etc. The other, more advanced feature multimedia database (music catalog, tracking the number of times a song is being played) and visualization (spectrum analyzer) among other things.

Last time we’ve focused on PlayedCount field, that is, on the frequency of occurence of an event (a song being played). A histogram of that data would show how many songs (height of a vertical bar) have been played each number of times (points on x axis). Not a very interesting thing, right? A drill-down/drill-through on top of that would be more interesting (which songs are played N times). But there wouldn’t be drill if it weren’t for histogram in the first place, to separate the data. Therefore, histograms are important, although not always used as the final result.

In this article, we’ve focused on the visualization. Spectrum analyzer to be precise. Those small panels where several vertical bands show how loud the bass, middle and treble are. Or more bands, depending on the number of them it displays (i.e. 100 Hz, 300 kHz, 1 KHz, 3 kHz, 10 kHz). Bands are ranges of frequency which display the amplitude for any given part of a song.

Well, did you know that a song can also be considered a small fact? The amplitude in that case is the measure. Dimensions? Well, there’s a Time dimension for sure. Others? Not likely. What about the Frequency dimension? Doesn’t exist there. It is however required to be present in the end result. How come? Well, bands are nothing but buckets of frequency range, where the number of them is predetermined in advance. The only thing that’s missing is the calculation that projects the amplitude across buckets in spectrum analyzer. Ring a bell?

FFT (Fast Fourier Transformation). How about it ?

22 Responses to “Implementing buckets in OLAP”

Daniel Upton says:

17.09.2009 at 00:00

Great. It works! Now, I’ll review the MDX. I look forward to applying these MDX buckets. Thank you, Tomislav.

Reply
Sam Kane says:

18.03.2011 at 00:04

Here are this and some other articles on Attribute Discretization:

http://ssas-wiki.com/w/Articles#Attribute_Discretization

Reply
Puneet says:

22.07.2011 at 11:50

Rating Dimension:

(ID, MainVariance, MaxVariance, Rating, StartDate, EndDate)

Rating Dimension values:

SELECT 1 AS ID, 0 AS MinVariance, 0.10 AS MaxVariance, 1 AS Rating, ’01/01/1900′ AS StartDate, ’06/06/2079′ AS EndDate FROM ApplDim

UNION

SELECT 2 AS ID, 0.11 AS MinVariance, 0.30 AS MaxVariance, 2 AS Rating, ’01/01/1900′ AS StartDate, ’06/06/2079′ AS EndDate FROM ApplDim AS ApplDim_5

UNION

SELECT 3 AS ID, 0.31 AS MinVariance, 1 AS MaxVariance, 3 AS Rating, ’01/01/1900′ AS StartDate, ’06/06/2079′ AS EndDate

FROM ApplDim AS ApplDim_4

UNION

SELECT 4 AS ID, 0 AS MinVariance, 0.20 AS MaxVariance, 1 AS Rating, ’01/01/2011′ AS StartDate, ’12/31/2011′ AS EndDate

FROM ApplDim AS ApplDim_3

UNION

SELECT 5 AS ID, 0.21 AS MinVariance, 0.50 AS MaxVariance, 2 AS Rating, ’01/01/2011′ AS StartDate, ’12/31/2011′ AS EndDate

FROM ApplDim AS ApplDim_2

UNION

SELECT 6 AS ID, 0.51 AS MinVariance, 1 AS MaxVariance, 3 AS Rating, ’01/01/2011′ AS StartDate, ’12/31/2011′ AS EndDate

——–

Created a set in Cube as “RatingSet” :

RatingSet

{

[Rating].[Rating].[Rating] *

[Rating].[End Date].[End Date] *

[Rating].[Max Variance].[Max Variance] *

[Rating].[Min Variance].[Min Variance] *

[Rating].[Start Date].[Start Date]

}

Calculation which needs to be made dynamic:-

IIF( ([Measures].[Variance Percent]>=0 and [Measures].[Variance Percent]0.10 and [Measures].[Variance Percent]=[Rating].[Min Variance].[Min Variance].members.Item(0)

and [Measures].[Ownership Variance Percent][Rating].[Min Variance].[Min Variance].members.Item(1)

and [Measures].[Ownership Variance Percent]<= [Rating].[Max Variance].[Max Variance].members.Item(1))
,[Rating].[Rating].[Rating].members.item(1)

,[Rating].[Rating].[Rating].members.item(2)

)

)

——

Now, How to make it dynamic as above???????

Any help will be appreciated.

Please suggest asap.

Many Thanks in Advance.

Reply
- Puneet says:
  
  22.07.2011 at 11:59
  
  Calculation which needs to be made dynamic:-
  
  IIF( ([Measures].[Variance Percent]>=0 and [Measures].[Variance Percent]0.10 and [Measures].[Variance Percent]=[Rating].[Min Variance].[Min Variance].members.Item(0)
  
  and [Measures].[Ownership Variance Percent][Rating].[Min Variance].[Min Variance].members.Item(1)
  
  and [Measures].[Ownership Variance Percent]<= [Rating].[Max Variance].[Max Variance].members.Item(1))
  ,[Rating].[Rating].[Rating].members.item(1)
  
  ,[Rating].[Rating].[Rating].members.item(2)
  
  )
  
  )
  
  ——
  
  Now, How to make it dynamic as above???????
  
  Any help will be appreciated.
  
  Please suggest asap.
  
  Many Thanks in Advance.
  
  Reply
- Tomislav Piasevoli says:
  
  26.07.2011 at 01:14
  
  Hi Puneet, it’s not clear what exactly you need, so I suggest we continue this via e-mail, ok?
  
  Reply
  - Monday says:
    
    08.03.2012 at 17:44
    
    Tomislav/Puneet,
    
    Where you able to resolve this because I have a similar problem and was wondering if you can share the solution. I’m using SSAS 2005 Standard Edition
    
    Reply
    - Tomislav Piasevoli says:
      
      09.03.2012 at 01:13
      
      No.
      
      Reply
Puneet says:

22.07.2011 at 12:01

Calculation which needs to be made dynamic:-

IIF( ([Measures].[Variance Percent]>=0 and [Measures].[Variance Percent]0.10 and [Measures].[Variance Percent]<= 0.30)

,2

,3 –Rating

)

)

i.e. Variance Percent will be 0, 10, 30, 50 etc… and Rating will be 0, 1, 2 3 etc…

Reply
Monday says:

08.03.2012 at 17:50

Tomislav,

is this solution edition specific? Because I cannot get this to work in SSAS 2005 Standard Edition. Just using it just as specified. My result is equivalent of a cross join. To try to get it to work I added Bucket dimension exactly as specified and worked it and Its still a cross join

Reply
- Tomislav Piasevoli says:
  
  09.03.2012 at 01:12
  
  As far as I know, it shouldn’t be edition specific. Maybe version specific, because of potentially different AW sample databases on 2005 and 2008. Can you verify that?
  
  Reply
Arnon says:

15.04.2012 at 22:10

Hi Tomislav,
Thanks! Your solution works for me! this was exactly what I was looking for.
What I think I miss here is the ability to relate another mesure group to the bucket diminsion.
Lets say that i have one fact table with one mesure group – so far so good. What will happend if i creat another measure group from that fact for the distinct count measure. How can this measure “projected” on the buckets?

Thanks,
Arnon

Reply
- Tomislav Piasevoli says:
  
  01.05.2012 at 22:11
  
  Hi Arnon, I haven’t tested it, but I guess it should be the same if there is a common dimension in both measure groups (Promotion in my case). Have you tried that?
  
  Reply
nina says:

18.07.2012 at 21:29

how to implementation Histogram using JavaScript?pleas help

Reply
- Tomislav Piasevoli says:
  
  18.07.2012 at 23:45
  
  I guess you just need to fire the corresponing MDX queries against the SSAS cube. First you need to modify the cube as explained in this article. Then, to communicate with your SSAS server, you will need to use XMLA. See Books Online for more info: http://msdn.microsoft.com/en-us/library/ms186604 .
  
  Reply
Mike Honey says:

29.08.2012 at 09:39

Hi Tomislav, nice technique and thanks for sharing it.

I’m trying to apply it at scale (my equivalent of your Promotions dimension has 370,000 members). At query time it just chews up all the server RAM before failing with memory pressure errors. This happens even when the query is otherwise filtered to a small set of members. I can get the Calculated Measure back fairly quickly, but using the Buckets dimension is the trigger for the memory pressure error.

Do you have any ideas on how I could tune your MDX for this scenario? I’m imagining there is some magic NonEmpty function or similar I could apply to reduce the workload?

Thanks – Mike

Reply
- Mike Honey says:
  
  13.03.2013 at 05:49
  
  I’ve got this performing to where it seems to be viable over my largish dimension. It runs a reasonable query (over a few hundred members) in around 10 secs, and can do the whole dimension in about 3mins.
  
  The key seemed to be adding NonEmpty e.g Aggregate ( NonEmpty ( Existing …
  
  I also added a Non_Empty_Behaviour property inside the SCOPE, against the driving measure, which seems very important for performance.
  
  My buckets are irregularly shaped, so I used a Member Property to pass in the lower and upper boundary of each bucket.
  
  Again, thanks for the tip!
  
  Reply
  - Tomislav Piasevoli says:
    
    14.03.2013 at 00:32
    
    Thank you for sharing your improvements.
    
    Reply
Mick Wagner says:

18.03.2013 at 20:27

Hi Tomislav,

Great method and presentation! Thanks for sharing.

This code works great for me but I feel that the names of the buckets, (“Bucket 1”, “Bucket2”, etc) are not very user friendly and don’t provide much information. I would like to provide the value ranges that each represents as its name. Since all of these values are dynamic based on the MDX query and the set involved, they can’t be hardcoded.

I’ve done some research and it looks like the undocumented MDX function “CreatePropertySet” may be the key. Chris Webb did something similar to this but I’m not sure if he’s doing the same thing as you.

http://cwebbbi.wordpress.com/2005/02/04/bucketisation/
http://www.sqlmag.com/article/quering/mdx-secret-code-revealed-16302

Thanks in advance,
Mick

Reply
- Tomislav Piasevoli says:
  
  04.04.2013 at 00:21
  
  I might have a look. I am not familiar with that function.
  
  Thanks you.
  
  Reply
Thomas Vande Sijpe says:

14.11.2013 at 14:32

Hi Tomislav,

I tried your solution and I almost got it working, but not quite.
I followed your solution and I guess that I’m doing something wrong somewhere along the way, but I can’t figure out what.

I’m trying to get an employee count by length of of employment, discretized by the above mentioned solution, but all I get is a value for the ‘All’-member of my buckets.

When I use the following statement:
“SELECT
[Measures].[Employee – Count] ON COLUMNS
, [Length Of Employment].[Length Of Employment].MEMBERS ON ROWS
FROM [myCube]”

, the result I get is the following:
Employee – Count
All 66
0Y – <2Y (null)
2Y – <4Y (null)
4Y – <6Y (null)
6Y – <8Y (null)
8Y – <10Y (null)
+10Y (null)
N/A (null)

I placed the calculated member and the scope script immediately after the CALCULATE as mentioned above, but apparently it doesn't to the aggregation as I expect.

I would expect a result like the following:
Employee – Count
All 66
0Y – <2Y 15
2Y – <4Y 20
4Y – <6Y 15
6Y – <8Y 10
8Y – <10Y 5
+10Y 1
N/A (null)

Can you share your thoughts on what I might be doing wrong?

Thanks,
Thomas

Reply
Andrew says:

27.03.2014 at 14:43

Hi Tomislav,

Great article, looks like something i am looking for.

But just can’t get it working. The screenshots in youre article dissappeared http 404 error. Maybe there is some thing I missed. Can you fix the screenshots..

Andrew

Reply
- Tomislav Piasevoli says:
  
  31.03.2014 at 00:53
  
  Hi Andrew, thank you for pointing to the problem with pictures. Looks like something went wrong with them over time. I’ll try to fix them.
  
  Reply

Tomislav Piasevoli – BI Blog

Implementing buckets in OLAP

22 Responses to “Implementing buckets in OLAP”

Leave a Reply Cancel reply