OLAP solutions – Tomislav Piasevoli

Implementing buckets in OLAP

Aug 242009

Last month I blogged about histograms, types of histograms and how to implement them in SSAS cubes. Since the entity-based ones are enabled in OLAP per se, I’ve focused on the event-based histograms and explained how to achieve them using a step-by-step approach. The inspiration for that article originated from this thread.

At the same time, another thread was describing a similar problem – how to make a dimension based on calculated measure’s values? At first, I thought it can be solved with a slight modification of the Histogram solution. The only difference should have been the use of a real utility dimension (instead of a dummy one). In other words, scoping members of that dimension with a proper calculation instead of defining a new calculated measure. However, not everything went smoothly. Finally, I’ve decided to make an example in Adventure Works, solve this interesting problem and elaborate the distilled solution here in details. What’s the topic this time? Buckets!

INTRODUCTION

If you ask me, Wikipedia doesn’t have a good explaination of what buckets are. In order to present this solution in a proper manner, first I’ll try to explaine buckets, then provide a solution on how to implement them in OLAP cubes.

Bucketization is a common process when dealing with data. Normally, you want to be able to perform analysis on every possible value of an entity, for example product, store, date, and their attributes as well. However, there are times when such a detailed analysis is either not advisable or not required. Loans, for example, are usually a huge dimension, customers also in some industries. Implementing an attribute based on the primary key of the respective dimension table can have serious consequences on the performance.

Continues values (or significant number of discrete values) are also sometimes bucketized. Values are split into discrete regions, bands, groups or ranges, which are nothing but various phrases for the same. Since the number of buckets is smaller than the number of distinct values, manipulation is easier and performance vs storage ratio potentially better. Yet, these buckets still represent the data in satisfactory way, meaning, we don’t lose information about it, we just reduce the resolution of analysis on that entity, since it suits us that way.

Typical examples can be found in Customer dimension (Yearly Income, Commute Distance) and Employee dimension (Base rate, Sick Leave Hours, Vacation Hours). Adventure Works, of course.

Some of the attributes mentioned above are implemented directly in dimension tables, as a range of values with their unique ID. The other using DiscretizationMethod / DiscretizationBucketCount properties of an attribute. The latter are wonderfully explained in a series of articles by William Pearson (Part 79-83, see the list below the article). A search on the web shows us that Chris Webb also blogged about buckets and how to build them dynamically few years ago.

The principle with histograms was this – we could see the frequency of occurence of the key attribute of a dimension per any of its attributes, as long as there was a distinct count measure defined on that key attribute. Those were entity-based histograms. Event-based histograms were also possible, but we had to introduce a dummy dimension with a certain number of members and a new calculated measure that calculates the frequency of events in a fact per each of them. The question is – are there entity and event-based buckets too? And the answer is – yes!

Buckets are reduced set of values of an attribute. Besides that, they are attribute like any other. Any measure can be analyzed across them. Such buckets are entity-based. They are built using previously mentioned techniques. What’s also true is that they are static, they don’t change in fact table. But, that’s not what we’re after. We want our buckets to be dynamic, calculated from a fact. And for that we need an utility dimension. Only this time calculations must be placed on it, using scope command, so that we’re able to analyze all measures, whether regular or calculated, across these new virtual buckets. Here’s how.

BUILDING BUCKETS

1. Open a copy of Adventure Works 2008 database in BIDS. Adjust Deployment part of Project properties in order not to overwrite your original database. Specify Buckets (or anything you like) for the name of the database.

2. Open Adventure Works DW.dsv. Add New Named Query.

Specify Bucket for name and then add this T-SQL script:

SELECT        0 AS KeyColumn, ‘Bucket 0’ AS NameColumn
UNION ALL
SELECT        1 AS KeyColumn, ‘Bucket 1’ AS NameColumn
UNION ALL
SELECT        2 AS KeyColumn, ‘Bucket 2’ AS NameColumn
UNION ALL
SELECT        3 AS KeyColumn, ‘Bucket 3’ AS NameColumn
UNION ALL
SELECT        4 AS KeyColumn, ‘Bucket 4’ AS NameColumn
UNION ALL
SELECT        5 AS KeyColumn, ‘Bucket 5’ AS NameColumn
UNION ALL
SELECT        6 AS KeyColumn, ‘Bucket 6’ AS NameColumn
UNION ALL
SELECT        7 AS KeyColumn, ‘Bucket 7’ AS NameColumn
UNION ALL
SELECT        8 AS KeyColumn, ‘Bucket 8’ AS NameColumn
UNION ALL
SELECT        9 AS KeyColumn, ‘Bucket 9’ AS NameColumn
UNION ALL
SELECT        10 AS KeyColumn, ‘Bucket 10’ AS NameColumn

Test, execute, save.

3. Build a new dimension from that query. Name the dimension Bucket. Use KeyColumn and NameColumn for the same attribute (one and only) and provide MemberValue also (it should be KeyColumn as well). Order by key.

4. Add that dimension to the cube. Don’t link to any measure group.

5. Go to Calculation tab. Position cursor just below the Calculate; command. Paste this script:

Create

Member CurrentCube.[Measures].[Weighted Discount Buckets] As
Int( [Measures].[Discount Amount] /
[Measures].[Reseller Sales Amount] * 10 )
, Format_String = ‘#,##0’
, Associated_Measure_Group = ‘Reseller Sales’
, Display_Folder = ‘WA’;

Scope

(
      [Bucket].[Bucket].[Bucket].MEMBERS
     );
This = Aggregate( Existing [Promotion].[Promotion].[Promotion].MEMBERS,
            iif( [Bucket].[Bucket].CurrentMember.MemberValue =
                 (
                  [Measures].[Weighted Discount Buckets],
                  [Bucket].[Bucket].[All]
                 ),
                 (
                  [Measures].CurrentMember,
                   [Bucket].[Bucket].[All]
                 ),
                 null
               )
          );
End Scope;

6. Build and deploy.

7. Jump to Cube Browser tab.

Put Bucket dimension on rows. Put Weighted Discount Buckets measure from WA folder first, then couple of measures from Reseller Sales measure group. I selected some having Sum aggregation, others being calculated, in order to show how buckets behave. You can also try to slice per country in Sales Territories dimension or any other related dimension – results adjust accordingly.

That’s it. Buckets are implemented!

EXPLAINATION

First we need to have a calculation for the buckets. In this example, we used weighted average of a discount percentage (which is Discount Amount over Reseller Sales Amount – we already have those two in our cube :-)). Discount percentage is related to a DimPromotion table. These two queries show us that discounts can be found only in reseller sales process – there are 12 distinct promotions having various discount percentages ranging from 0 to 0.5 (0% – 50%).

select

PromotionKey, UnitPriceDiscountPct
from dbo.FactResellerSales
group by PromotionKey, UnitPriceDiscountPct
order by PromotionKey, UnitPriceDiscountPct

select PromotionKey, UnitPriceDiscountPct
from dbo.FactInternetSales
group by PromotionKey, UnitPriceDiscountPct
order by PromotionKey, UnitPriceDiscountPct

Our buckets will be discount percentages, grouped in bands of ten (0-9 = 0, 10-19 = 1, … 91 – 100 = 10). All together, we need 10 members in our utility dimension and that’s what initial named query does.

Since that dimension is not linked to any measure group, by bringing it in cube browser, we only see the same values per each member – total values. In order to have correct values per each member, we need to reach for MDX script and write proper expressions. The important thing to notice is that expressions should be written before any other statement in MDX script in order to allow all calculated measures (that follow) to be aggregated the way they’re suppose to (after our scope!). What once was solve order, is now the order of statements in MDX script.

The first thing we do is we define our calculated measure that will be used to determine buckets. It’s the same definition as in [Measures].[Discount Percentage] calculated measure, except we rounded the values and grouped them in 10 buckets. That measure can later be made hidden, if wanted so.

The following scope handles how each bucket is to be calculated per any measure – as an aggregate of values across existing promotions where dimension bucket key is equal to bucket measure. That expression is actually our transformation of data, which allocates value on buckets. It’s sort of a projection of calculated bucket measure on bucket dimension. That’s how it can be visualized and understood. Inside, it is important to refer to the root member of bucket dimension, because it’s the only member that has data before transformations get applied. That’s why we used tuples.

CONCLUSION

We reached the end of this article. Let’s summarize it one more time.

It is possible to make analysis of cube data based on dynamically calculated measure/buckets. The solution is to build an utility dimension with predetermined number of members (buckets) and then assign a value for each bucket. Values are assigned using scope statement where the key of the bucket is compared to the value of the calculation (calculated bucket measure). Effectively, that is a transformation of data from a measure to a dimension. Speaking of which, what better way to finish this article than to provide a vivid example.

We all love music and have our favorite media player, either a device or an application on a computer. Small and simple ones provide only basic operations like Play, Pause, etc. The other, more advanced feature multimedia database (music catalog, tracking the number of times a song is being played) and visualization (spectrum analyzer) among other things.

Last time we’ve focused on PlayedCount field, that is, on the frequency of occurence of an event (a song being played). A histogram of that data would show how many songs (height of a vertical bar) have been played each number of times (points on x axis). Not a very interesting thing, right? A drill-down/drill-through on top of that would be more interesting (which songs are played N times). But there wouldn’t be drill if it weren’t for histogram in the first place, to separate the data. Therefore, histograms are important, although not always used as the final result.

In this article, we’ve focused on the visualization. Spectrum analyzer to be precise. Those small panels where several vertical bands show how loud the bass, middle and treble are. Or more bands, depending on the number of them it displays (i.e. 100 Hz, 300 kHz, 1 KHz, 3 kHz, 10 kHz). Bands are ranges of frequency which display the amplitude for any given part of a song.

Well, did you know that a song can also be considered a small fact? The amplitude in that case is the measure. Dimensions? Well, there’s a Time dimension for sure. Others? Not likely. What about the Frequency dimension? Doesn’t exist there. It is however required to be present in the end result. How come? Well, bands are nothing but buckets of frequency range, where the number of them is predetermined in advance. The only thing that’s missing is the calculation that projects the amplitude across buckets in spectrum analyzer. Ring a bell?

FFT (Fast Fourier Transformation). How about it ?

Implementing histograms in OLAP

OLAP solutions 25 Responses »

Jul 242009

Histograms are an important tool in data analysis and as such are used in a variety of fields: digital photography, databases, business, and so on. They represent the distribution of frequencies of an entity or event.

By its design, OLAP handles entities much better than events and the same is with histograms – entity-based histograms are supported easily, event-based not. Yet, just like there was a way to count “in-flight” events using Mosha’s solution and handle begin and end times using the approach of Richard Tkachuk and Thomas Kejser, there is certainly a way how to score one more time over events in OLAP, this time by enabling event-based histograms. And here’s how …

In order to use histograms, the first requirement is to have DistinctCount measure, either physical, defined in a separate measure group, or calculated, defined in MDX script. The first approach is faster, although requires extra space.

The second requirement is to have an entity (dimension), preferably with some attributes. The third is to have a fact table (cube).

All in all, nothing special except DistinctCount measure, one per each entity of interest (customers, products, …). A fact and a dimension with its attributes is something we can expect to find in every cube.

Take Adventure Works 2008, for example. It meets all 3 requirements. Therefore, we’ll use the Standard version of it as our example here (which can be downloaded from Codeplex in case you don’t have it yet). But let’s first return to the definition we used for histogram, to analyze the difference between two types of histogram.

We said there are entity-based frequency distribution and event-based frequency distribution.

When we say entity, we’re actually thinking about an attribute of an entity, like shoe size, weight, height, number of children, maritual status, and so on. When we say event, we’re thinking about occurrence of an entity (or its attribute, since they are related) in a series of events. And series of events is another name for fact table.

Adventure Works 2008 features Customer Count measure, which can be found in Internet Customers measure group. By using that measure over any attribute of Customer dimension (particulary those found in Demographic folder) we get entity-based histograms. Those histograms show us how many customers are single and how many are married, how many earn 10-30K$ and how many 130-170K$, and so on. If the histogram has higher values on the left side in both cases, we can conclude there are more single than married persons in our pool and more of those earning less than average. Of course, in another pool (slice per customer city or some other attribute) the situation might be different.

The important thing to notice about entity-based histogram is this – they can only be sliced by additional attributes of that entity (dimension), not by other dimensions or their attributes. Which means this type of histogram is static, time-irrelevant, entity-oriented.

The other type of histogram is therefore dynamic, time-sensitive, fact-oriented. Actually, any-dimension-sensitive, since it is fact-oriented. An example of such histogram is this: count of customers on y-axis, distinct number of orders on x-axis. In other words, we have 0, 1, 2, and so on on x-axis and our DistinctCount measure on y-axis. With one problem – numerical values (measures) seems to be on both axis. And that’s something OLAP doesn’t support.

0, 1, 2 and so on represent distinct number of orders. Although it looks like they should be calculated in MDX using some count expression, we can not do that. Our task is to display a table/graph. That means we need to iterate over some dimension (rows) and provide the values (measure) across it (columns). We’re talking about 2D area, a grid.

Although it is possible to generate N calculated measures named as 0, 1, 2 and so on and to provide a proper calculation for each of them, there are many obstacles in that approach. First, there can be many bands/intervals and MDX doesn’t support loops in its DDL (can’t generate calculated measures in a loop, only one by one explicitely). Which would require a lot of manual work. Secondly, new intervals might appear over time. Some customers might score more than the designer of those calculated measures have planned in advance.

All that leads to a search for another approach. And there is one.

A better solution would be to implement the distinct number of orders as a new dimension, a dummy dimension. Which is similar to shell/utility dimension (for those of you who used it sometime), but quite different also.

Normally, you make shell/utility dimension with one member only and many calculated ones. Or, you do it with one member and many dummy members, in order to avoid problems with calculated members in some tools :-).

Here we’ll have all regular members, no calculated ones, and we’ll have N+1 of them, where N is the maximum number of orders. All natural numbers, including zero as the first record (that’s +1). In contrast to shell/utility dimension, our dummy dimension won’t have any calculations on its members. Hence we call it “dummy.”

We’ll start by creating a named query in DSV, naming it “Reseller Order Histogram” (don’t worry about the pop-up with an error, just click OK):

SELECT
0 AS Interval

UNION ALL

SELECT
    TOP
    (
        SELECT
            TOP 1
            Frequency AS Interval
        FROM
            (
            SELECT
                ResellerKey,
                COUNT(SalesOrderNumber) AS Frequency
            FROM
                ( SELECT
                      Distinct
                      ResellerKey,
                      SalesOrderNumber
                  FROM
                      dbo.FactResellerSales
               ) AS t1
        GROUP BY
            ResellerKey
             ) AS t2
        ORDER BY
            Interval DESC
    )
    ROW_NUMBER() OVER(ORDER BY ResellerKey) AS Interval
FROM
    dbo.FactResellerSales

That T-SQL gives us N+1 rows where N is the maximum number of orders made by any reseller in a fact. In Adventure Works 2008, that’s 12.

It’s a bit complex query, but if you execute inner parts of it one by one, separately, you’ll understand what’s going on. Also, I’m not expert on T-SQL, perhaps it could be optimized :-).

Then we’ll build a simple dimension “Reseller Orders Histogram”, and set the name and value columns to that same Interval field (see the picture below).

We’ll process full that dimension, then add it to Adventure Works cube without linking it to any measure group. Finally, we’ll add this part of code below at the end of MDX script in order to create calculation (new measures) that we’ll use in combination with Reseller Orders Histogram dimension:

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency] AS
     Sum( EXISTING [Reseller].[Reseller].[Reseller].MEMBERS,
          iif( [Reseller Orders Histogram].[Interval]
               .CurrentMember.MemberValue =
               [Measures].[Reseller Order Count],
               1,
               null
             )
        )
, Format_String = ‘#,#’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

Scope( ( [Measures].[Reseller Orders Frequency],
         Root( [Reseller Orders Histogram] ) ) );
    This = Sum( [Reseller Orders Histogram].[Interval].[Interval].MEMBERS,
                [Measures].[Reseller Orders Frequency]
              );
End Scope;

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency Total] AS
( [Measures].[Reseller Orders Frequency],
[Reseller Orders Histogram].[Interval].DefaultMember )
, Visible = 0
, Format_String = ‘#,#’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency %] AS
    iif( [Measures].[Reseller Orders Frequency Total] = 0,
         null,
         [Measures].[Reseller Orders Frequency] /
         [Measures].[Reseller Orders Frequency Total]
       )
, Format_String = ‘Percent’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

After deploying the cube, we can test it in cube browser (by reconnecting first) in order to see changes.

Open measures, then measure group Reseller Orders, then folder Histogram. Drop both measures in data area. Drop Reseller Orders Histogram dimension on rows. See how the values populate the grid. Then test by adding years from Date dimension on columns. See how total stays the same (unless some years are unchecked) while individual years have their own data. You should know that you might experience some delays, depending on what you bring in pivot (don’t exaggerate).

That’s it. You made event-based histogram. Hope you liked it :-).

In the end, it’s worth saying that this solution can be reduced to a fixed number of intervals (to speed up the results or to have better, smaller graphs in cases when there are many intervals, i.e. > 50). One way would be to keep the first intervals, while grouping the last ones: 0, 1, 2, … 25, 25+. Variant of it would be to group on both sides, in case all data focuses in middle. Finally, the third option would be to reduce the resolution of x-axis (sampling, making bands) by combining several adjecent intervals to one: 0, 1-10, 11-20, 21-30, etc.. Keep in mind that SQL and MDX calculations should be adjusted in each of these scenarios.

Final note: I deliberately switched the axes when I made cube browser example. It’s because data can be seen much better this way (otherwise it spans to much on the right). Normally, when you design a graph for it, you should use bar chart, preferably oriented vertically, as in screenshot below.