Last month I blogged about histograms
, types of histograms and how to implement them in SSAS cubes. Since the entity-based ones are enabled in OLAP per se
, I’ve focused on the event-based histograms and explained how to achieve them using a step-by-step approach. The inspiration for that article originated from this thread
At the same time, another thread
was describing a similar problem – how to make a dimension based on calculated measure’s values? At first, I thought it can be solved with a slight modification of the Histogram solution. The only difference should have been the use of a real utility dimension (instead of a dummy one). In other words, scoping members of that dimension with a proper calculation instead of defining a new calculated measure. However, not everything went smoothly. Finally, I’ve decided to make an example in Adventure Works, solve this interesting problem and elaborate the distilled solution here in details. What’s the topic this time? Buckets
If you ask me, Wikipedia doesn’t have a good explaination
of what buckets are. In order to present this solution in a proper manner, first I’ll try to explaine buckets, then provide a solution on how to implement them in OLAP cubes.
Bucketization is a common process when dealing with data. Normally, you want to be able to perform analysis on every possible value of an entity, for example product, store, date, and their attributes as well. However, there are times when such a detailed analysis is either not advisable or not required. Loans, for example, are usually a huge dimension, customers also in some industries. Implementing an attribute based on the primary key of the respective dimension table can have serious consequences on the performance.
Continues values (or significant number of discrete values) are also sometimes bucketized. Values are split into discrete regions, bands, groups or ranges, which are nothing but various phrases for the same. Since the number of buckets is smaller than the number of distinct values, manipulation is easier and performance vs storage ratio potentially better. Yet, these buckets still represent the data in satisfactory way, meaning, we don’t lose information about it, we just reduce the resolution of analysis on that entity, since it suits us that way.
Typical examples can be found in Customer dimension (Yearly Income, Commute Distance) and Employee dimension (Base rate, Sick Leave Hours, Vacation Hours). Adventure Works, of course.
Some of the attributes mentioned above are implemented directly in dimension tables, as a range of values with their unique ID. The other using DiscretizationMethod
properties of an attribute. The latter are wonderfully explained in a series of articles by William Pearson
(Part 79-83, see the list below the article). A search on the web shows us that Chris Webb also blogged about buckets and how to build them dynamically
few years ago.
The principle with histograms was this – we could see the frequency of occurence of the key attribute of a dimension per any of its attributes, as long as there was a distinct count measure defined on that key attribute. Those were entity-based histograms. Event-based histograms were also possible, but we had to introduce a dummy dimension with a certain number of members and a new calculated measure that calculates the frequency of events in a fact per each of them. The question is – are there entity and event-based buckets too? And the answer is – yes!
Buckets are reduced set of values of an attribute. Besides that, they are attribute like any other. Any measure can be analyzed across them. Such buckets are entity-based. They are built using previously mentioned techniques. What’s also true is that they are static, they don’t change in fact table. But, that’s not what we’re after. We want our buckets to be dynamic, calculated from a fact. And for that we need an utility dimension. Only this time calculations must be placed on it, using scope command, so that we’re able to analyze all measures, whether regular or calculated, across these new virtual buckets. Here’s how.
1. Open a copy of Adventure Works 2008 database in BIDS. Adjust Deployment part of Project properties in order not to overwrite your original database. Specify Buckets (or anything you like) for the name of the database.
2. Open Adventure Works DW.dsv. Add New Named Query.
Specify Bucket for name and then add this T-SQL script:
SELECT 0 AS KeyColumn, ‘Bucket 0’ AS NameColumn
SELECT 1 AS KeyColumn, ‘Bucket 1’ AS NameColumn
SELECT 2 AS KeyColumn, ‘Bucket 2’ AS NameColumn
SELECT 3 AS KeyColumn, ‘Bucket 3’ AS NameColumn
SELECT 4 AS KeyColumn, ‘Bucket 4’ AS NameColumn
SELECT 5 AS KeyColumn, ‘Bucket 5’ AS NameColumn
SELECT 6 AS KeyColumn, ‘Bucket 6’ AS NameColumn
SELECT 7 AS KeyColumn, ‘Bucket 7’ AS NameColumn
SELECT 8 AS KeyColumn, ‘Bucket 8’ AS NameColumn
SELECT 9 AS KeyColumn, ‘Bucket 9’ AS NameColumn
SELECT 10 AS KeyColumn, ‘Bucket 10’ AS NameColumn
Test, execute, save.
3. Build a new dimension from that query. Name the dimension Bucket. Use KeyColumn and NameColumn for the same attribute (one and only) and provide MemberValue also (it should be KeyColumn as well). Order by key.
4. Add that dimension to the cube. Don’t link to any measure group.
5. Go to Calculation tab. Position cursor just below the Calculate; command. Paste this script:
Member CurrentCube.[Measures].[Weighted Discount Buckets] As
Int( [Measures].[Discount Amount] /
[Measures].[Reseller Sales Amount] * 10 )
, Format_String = ‘#,##0’
, Associated_Measure_Group = ‘Reseller Sales’
, Display_Folder = ‘WA’;
This = Aggregate( Existing [Promotion].[Promotion].[Promotion].MEMBERS,
iif( [Bucket].[Bucket].CurrentMember.MemberValue =
[Measures].[Weighted Discount Buckets],
6. Build and deploy.
7. Jump to Cube Browser tab.
Put Bucket dimension on rows. Put Weighted Discount Buckets measure from WA folder first, then couple of measures from Reseller Sales measure group. I selected some having Sum aggregation, others being calculated, in order to show how buckets behave. You can also try to slice per country in Sales Territories dimension or any other related dimension – results adjust accordingly.
That’s it. Buckets are implemented!
First we need to have a calculation for the buckets. In this example, we used weighted average of a discount percentage (which is Discount Amount over Reseller Sales Amount – we already have those two in our cube :-)). Discount percentage is related to a DimPromotion table. These two queries show us that discounts can be found only in reseller sales process – there are 12 distinct promotions having various discount percentages ranging from 0 to 0.5 (0% – 50%).
group by PromotionKey, UnitPriceDiscountPct
order by PromotionKey, UnitPriceDiscountPct
select PromotionKey, UnitPriceDiscountPct
group by PromotionKey, UnitPriceDiscountPct
order by PromotionKey, UnitPriceDiscountPct
Our buckets will be discount percentages, grouped in bands of ten (0-9 = 0, 10-19 = 1, … 91 – 100 = 10). All together, we need 10 members in our utility dimension and that’s what initial named query does.
Since that dimension is not linked to any measure group, by bringing it in cube browser, we only see the same values per each member – total values. In order to have correct values per each member, we need to reach for MDX script and write proper expressions. The important thing to notice is that expressions should be written before any other statement in MDX script in order to allow all calculated measures (that follow) to be aggregated the way they’re suppose to (after our scope!). What once was solve order, is now the order of statements in MDX script.
The first thing we do is we define our calculated measure that will be used to determine buckets. It’s the same definition as in [Measures].[Discount Percentage] calculated measure, except we rounded the values and grouped them in 10 buckets. That measure can later be made hidden, if wanted so.
The following scope handles how each bucket is to be calculated per any measure – as an aggregate of values across existing promotions where dimension bucket key is equal to bucket measure. That expression is actually our transformation of data, which allocates value on buckets. It’s sort of a projection of calculated bucket measure on bucket dimension. That’s how it can be visualized and understood. Inside, it is important to refer to the root member of bucket dimension, because it’s the only member that has data before transformations get applied. That’s why we used tuples.
We reached the end of this article. Let’s summarize it one more time.
It is possible to make analysis of cube data based on dynamically calculated measure/buckets. The solution is to build an utility dimension with predetermined number of members (buckets) and then assign a value for each bucket. Values are assigned using scope statement where the key of the bucket is compared to the value of the calculation (calculated bucket measure). Effectively, that is a transformation of data from a measure to a dimension. Speaking of which, what better way to finish this article than to provide a vivid example.
We all love music and have our favorite media player, either a device or an application on a computer. Small and simple ones provide only basic operations like Play, Pause, etc. The other, more advanced feature multimedia database (music catalog, tracking the number of times a song is being played) and visualization (spectrum analyzer) among other things.
Last time we’ve focused on PlayedCount field, that is, on the frequency of occurence of an event (a song being played). A histogram of that data would show how many songs (height of a vertical bar) have been played each number of times (points on x axis). Not a very interesting thing, right? A drill-down/drill-through on top of that would be more interesting (which songs are played N times). But there wouldn’t be drill if it weren’t for histogram in the first place, to separate the data. Therefore, histograms are important, although not always used as the final result.
In this article, we’ve focused on the visualization. Spectrum analyzer to be precise. Those small panels where several vertical bands show how loud the bass, middle and treble are. Or more bands, depending on the number of them it displays (i.e. 100 Hz, 300 kHz, 1 KHz, 3 kHz, 10 kHz). Bands are ranges of frequency which display the amplitude for any given part of a song.
Well, did you know that a song can also be considered a small fact? The amplitude in that case is the measure. Dimensions? Well, there’s a Time dimension for sure. Others? Not likely. What about the Frequency dimension? Doesn’t exist there. It is however required to be present in the end result. How come? Well, bands are nothing but buckets of frequency range, where the number of them is predetermined in advance. The only thing that’s missing is the calculation that projects the amplitude across buckets in spectrum analyzer. Ring a bell?
FFT (Fast Fourier Transformation). How about it ?