Implementing histograms in OLAP

Jul 242009

Histograms are an important tool in data analysis and as such are used in a variety of fields: digital photography, databases, business, and so on. They represent the distribution of frequencies of an entity or event.

By its design, OLAP handles entities much better than events and the same is with histograms – entity-based histograms are supported easily, event-based not. Yet, just like there was a way to count “in-flight” events using Mosha’s solution and handle begin and end times using the approach of Richard Tkachuk and Thomas Kejser, there is certainly a way how to score one more time over events in OLAP, this time by enabling event-based histograms. And here’s how …

In order to use histograms, the first requirement is to have DistinctCount measure, either physical, defined in a separate measure group, or calculated, defined in MDX script. The first approach is faster, although requires extra space.

The second requirement is to have an entity (dimension), preferably with some attributes. The third is to have a fact table (cube).

All in all, nothing special except DistinctCount measure, one per each entity of interest (customers, products, …). A fact and a dimension with its attributes is something we can expect to find in every cube.

Take Adventure Works 2008, for example. It meets all 3 requirements. Therefore, we’ll use the Standard version of it as our example here (which can be downloaded from Codeplex in case you don’t have it yet). But let’s first return to the definition we used for histogram, to analyze the difference between two types of histogram.

We said there are entity-based frequency distribution and event-based frequency distribution.

When we say entity, we’re actually thinking about an attribute of an entity, like shoe size, weight, height, number of children, maritual status, and so on. When we say event, we’re thinking about occurrence of an entity (or its attribute, since they are related) in a series of events. And series of events is another name for fact table.

Adventure Works 2008 features Customer Count measure, which can be found in Internet Customers measure group. By using that measure over any attribute of Customer dimension (particulary those found in Demographic folder) we get entity-based histograms. Those histograms show us how many customers are single and how many are married, how many earn 10-30K$ and how many 130-170K$, and so on. If the histogram has higher values on the left side in both cases, we can conclude there are more single than married persons in our pool and more of those earning less than average. Of course, in another pool (slice per customer city or some other attribute) the situation might be different.

The important thing to notice about entity-based histogram is this – they can only be sliced by additional attributes of that entity (dimension), not by other dimensions or their attributes. Which means this type of histogram is static, time-irrelevant, entity-oriented.

The other type of histogram is therefore dynamic, time-sensitive, fact-oriented. Actually, any-dimension-sensitive, since it is fact-oriented. An example of such histogram is this: count of customers on y-axis, distinct number of orders on x-axis. In other words, we have 0, 1, 2, and so on on x-axis and our DistinctCount measure on y-axis. With one problem – numerical values (measures) seems to be on both axis. And that’s something OLAP doesn’t support.

0, 1, 2 and so on represent distinct number of orders. Although it looks like they should be calculated in MDX using some count expression, we can not do that. Our task is to display a table/graph. That means we need to iterate over some dimension (rows) and provide the values (measure) across it (columns). We’re talking about 2D area, a grid.

Although it is possible to generate N calculated measures named as 0, 1, 2 and so on and to provide a proper calculation for each of them, there are many obstacles in that approach. First, there can be many bands/intervals and MDX doesn’t support loops in its DDL (can’t generate calculated measures in a loop, only one by one explicitely). Which would require a lot of manual work. Secondly, new intervals might appear over time. Some customers might score more than the designer of those calculated measures have planned in advance.

All that leads to a search for another approach. And there is one.

A better solution would be to implement the distinct number of orders as a new dimension, a dummy dimension. Which is similar to shell/utility dimension (for those of you who used it sometime), but quite different also.

Normally, you make shell/utility dimension with one member only and many calculated ones. Or, you do it with one member and many dummy members, in order to avoid problems with calculated members in some tools :-).

Here we’ll have all regular members, no calculated ones, and we’ll have N+1 of them, where N is the maximum number of orders. All natural numbers, including zero as the first record (that’s +1). In contrast to shell/utility dimension, our dummy dimension won’t have any calculations on its members. Hence we call it “dummy.”

We’ll start by creating a named query in DSV, naming it “Reseller Order Histogram” (don’t worry about the pop-up with an error, just click OK):

SELECT
0 AS Interval

UNION ALL

SELECT
    TOP
    (
        SELECT
            TOP 1
            Frequency AS Interval
        FROM
            (
            SELECT
                ResellerKey,
                COUNT(SalesOrderNumber) AS Frequency
            FROM
                ( SELECT
                      Distinct
                      ResellerKey,
                      SalesOrderNumber
                  FROM
                      dbo.FactResellerSales
               ) AS t1
        GROUP BY
            ResellerKey
             ) AS t2
        ORDER BY
            Interval DESC
    )
    ROW_NUMBER() OVER(ORDER BY ResellerKey) AS Interval
FROM
    dbo.FactResellerSales

That T-SQL gives us N+1 rows where N is the maximum number of orders made by any reseller in a fact. In Adventure Works 2008, that’s 12.

It’s a bit complex query, but if you execute inner parts of it one by one, separately, you’ll understand what’s going on. Also, I’m not expert on T-SQL, perhaps it could be optimized :-).

Then we’ll build a simple dimension “Reseller Orders Histogram”, and set the name and value columns to that same Interval field (see the picture below).

We’ll process full that dimension, then add it to Adventure Works cube without linking it to any measure group. Finally, we’ll add this part of code below at the end of MDX script in order to create calculation (new measures) that we’ll use in combination with Reseller Orders Histogram dimension:

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency] AS
     Sum( EXISTING [Reseller].[Reseller].[Reseller].MEMBERS,
          iif( [Reseller Orders Histogram].[Interval]
               .CurrentMember.MemberValue =
               [Measures].[Reseller Order Count],
               1,
               null
             )
        )
, Format_String = ‘#,#’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

Scope( ( [Measures].[Reseller Orders Frequency],
         Root( [Reseller Orders Histogram] ) ) );
    This = Sum( [Reseller Orders Histogram].[Interval].[Interval].MEMBERS,
                [Measures].[Reseller Orders Frequency]
              );
End Scope;

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency Total] AS
( [Measures].[Reseller Orders Frequency],
[Reseller Orders Histogram].[Interval].DefaultMember )
, Visible = 0
, Format_String = ‘#,#’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency %] AS
    iif( [Measures].[Reseller Orders Frequency Total] = 0,
         null,
         [Measures].[Reseller Orders Frequency] /
         [Measures].[Reseller Orders Frequency Total]
       )
, Format_String = ‘Percent’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

After deploying the cube, we can test it in cube browser (by reconnecting first) in order to see changes.

Open measures, then measure group Reseller Orders, then folder Histogram. Drop both measures in data area. Drop Reseller Orders Histogram dimension on rows. See how the values populate the grid. Then test by adding years from Date dimension on columns. See how total stays the same (unless some years are unchecked) while individual years have their own data. You should know that you might experience some delays, depending on what you bring in pivot (don’t exaggerate).

That’s it. You made event-based histogram. Hope you liked it :-).

In the end, it’s worth saying that this solution can be reduced to a fixed number of intervals (to speed up the results or to have better, smaller graphs in cases when there are many intervals, i.e. > 50). One way would be to keep the first intervals, while grouping the last ones: 0, 1, 2, … 25, 25+. Variant of it would be to group on both sides, in case all data focuses in middle. Finally, the third option would be to reduce the resolution of x-axis (sampling, making bands) by combining several adjecent intervals to one: 0, 1-10, 11-20, 21-30, etc.. Keep in mind that SQL and MDX calculations should be adjusted in each of these scenarios.

Final note: I deliberately switched the axes when I made cube browser example. It’s because data can be seen much better this way (otherwise it spans to much on the right). Normally, when you design a graph for it, you should use bar chart, preferably oriented vertically, as in screenshot below.

25 Responses to “Implementing histograms in OLAP”

Tomislav Piasevoli says:

24.07.2009 at 00:00

If you’re wondering how much did it take to make a post like this, I’d tell ya it took one strong spark of inspiration and one night less of sleep. Yeah, that’s right, the pleasure of achievement is so great, I can hardly feel I haven’t positioned myselft horizontally lately. Dues gotta be paid sooner or later, but luckily today is Friday, so they don’t have to be postponed.

Anyway, what I wanted to add is this. I showed this post to my collegue, Aleksandar Ratkovi?, who happens to be good with relational databases. We could say he polishes relational databases while I polish multidimensional ones. He kindly helped me optimize T-SQL for my named set, so I bring you the improved version:

SELECT
0 AS Interval
UNION ALL
SELECT
TOP
(
SELECT
MAX(Frequency) AS Interval
FROM
(
SELECT
ResellerKey,
COUNT(SalesOrderNumber) AS Frequency
FROM
( SELECT Distinct ResellerKey, SalesOrderNumber
FROM dbo.FactResellerSales
) AS t1
GROUP BY
ResellerKey
) AS t2
)
ROW_NUMBER() OVER(ORDER BY ResellerKey) AS Interval
FROM
dbo.FactResellerSales

In short, Top 1 and Order By DESC construct was replaced by a simple Max function.

Thanks Saša!

Reply
Someone on Windows Live says:

27.07.2009 at 01:00

Excellent solution, very well explained! 🙂

Reply
Classic Classic says:

14.10.2009 at 01:00

ASSOCIATED_MEASURE_GROUP is new in 2008,

So how to achieve that in SSAS 2005? Thank you very much!

Reply
Tomislav Piasevoli says:

11.11.2009 at 13:00

Hi, just omit that line if you’re on 2005.

Sorry for late response.

Reply
Michael Acosta says:

17.10.2010 at 02:16

Outrageously good article that elegantly solves a difficult problem

Reply
Shankar says:

17.11.2010 at 10:49

Hi Tomislav,

It’s a very nice concept. You’ve mentioned about grouping in this. Could you please describe more on how to do the grouping?

” In the end, it’s worth saying that this solution can be reduced to a fixed number of intervals (to speed up the results or to have better, smaller graphs in cases when there are many intervals, i.e. > 50). One way would be to keep the first intervals, while grouping the last ones: 0, 1, 2, … 25, 25+. Variant of it would be to group on both sides, in case all data focuses in middle. Finally, the third option would be to reduce the resolution of x-axis (sampling, making bands) by combining several adjecent intervals to one: 0, 1-10, 11-20, 21-30, etc.. Keep in mind that SQL and MDX calculations should be adjusted in each of these scenarios. ”

It would be really of good help.

Thanks,
Shankar.B

Reply
- Tomislav Piasevoli says:
  
  23.11.2010 at 03:23
  
  In short, the solution would be to prepare adequate table in DW/DSV and modify the calculation accordingly.
  For example, if you want to limit the upper value with 25+, then you’d need to generate a table with 26 rows: 0, 1, 2, … 25, 25+. Of course the key for 25+ should be 26. 25+ and other values could be used as a name for members. Again, MemberValue should be the same as the key, an integer.
  The calculation must have a second iif() (wrapped around the existing one) to test for the case of 25+ member (that is, to test for 26). In that case you’d use “… .CurrentMember.MemberValue <= Measures. ..." condition. The false part would be equal to what's currently there in the iif() inside the sum. That's the only change in MDX script. I believe you'll manage to deduce the other variants based on this.
  
  Reply
  - Airborne Geek says:
    
    15.03.2013 at 18:47
    
    Hi Tomislav,
    
    I’m attempting to implement the “open ended” 25+ bucket as described here, but am having problems geting it to work right. Everything else is in good shape, and the ope-ended top bucket’s result is correct, but all of the other buckets’ results contain the sum of all of the buckets above them (so it’s acting like some kind of running total).
    
    I thought I had a handle on what was going on, and have tried various things like making the outer iff() reference the top bucket’s value specifically like this:
    Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency +10] AS
    Sum( EXISTING [Reseller].[Reseller].[Reseller].MEMBERS,
    iif( [Reseller Orders Histogram–Max10].[Interval]
    .CurrentMember.MemberValue <=
    &[26].[Reseller Order Count],
    1,
    iif( [Reseller Orders Histogram–Max10].[Interval]
    .CurrentMember.MemberValue =
    [Measures].[Reseller Order Count],
    1,
    null
    )
    )
    )
    …and that did stop the "running total" behavior, but instead the top bucket's contents are now added to the result of all of the other buckets. In some ways, closer, but still far away 😉
    
    I'm pretty new at this, so I apologize if there's an easy solution that I'm missing. I appreciate this post and its companion Bucketization one very much.
    
    Thank you!
    
    Reply
    - Tomislav Piasevoli says:
      
      04.04.2013 at 00:31
      
      Hi, that’s because you specified < = in the first iif() and a dynamic value in the second one. Here, try this, I think it might work: Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency +10] AS Sum( EXISTING [Reseller].[Reseller].[Reseller].MEMBERS, iif( [Reseller Orders Histogram--Max10].[Interval] .CurrentMember.MemberValue = [Measures].[Reseller Order Count], 1, iif( [Reseller Orders Histogram--Max10].[Interval] .CurrentMember.MemberValue = 25 AND [Measures].[Reseller Order Count] >= 25
      1,
      null
      )
      )
      )
      
      Only for the last bucket when the value is greater (or equal may be redundant) than 25 collect the 1’s in the sum, but on that bucket only.
      
      Regards,
      
      T
      
      Reply
Jon says:

08.12.2010 at 20:22

It looks as though you used a calculated member to build the histogram. Do you have any examples for doing the same with a DistinctCount measure?

Reply
- Tomislav Piasevoli says:
  
  03.03.2011 at 02:10
  
  No, that’s by design in SSAS cubes.
  It’s relatively simple to make an example.
  Suppose you have a distinct count measure over a dimension key in the fact table. In that case, put any attribute of that dimension in the pivot table and that measure will give you the histogram. In other words, if you build a distinct count measure over CustomerID column in your fact table, the measure will give you the number of customers for each region, city, group or other attributes in the Customer dimension. Notice it’s not the same as asking for count of customers in the dimension table. That one is always fixed, it doesn’t change in time. Contrary to that, count over a fact column returns the number for the current context (relative to other dimensions like time, etc.).
  OK?
  
  Reply
Mark says:

01.12.2011 at 23:10

Hi Tomislav,
I want to thank you for your books and blog, they are very helpful!

I am trying to use your histogram approach on a problem I have. I am trying to see the number of transactions for each customer using a year to date calculation, so the first month should show the frequency (0 to 10) just within that month; the second should show the frequency for the previous month and the current month, and so on.

So I’ve taken your approach and added a calculated member and a frequency dimension as you described above, so my query is as follows:
WITH MEMBER [Measures].[Freq to Date] AS Sum ( YTD([Dim Date].[TxnDate].CurrentMember) ,[Measures].[Transaction Frequency] ) SELECT [Measures].[Freq to Date] ON 0 , [Dim Date].[Year INT].Children* [Dim Date].[Month INT].Children* [DM Interval].[Interval].Children ON 1 FROM [Cube];

Two problems with this query, it performs horribly and my 0 frequency continually adds each month. So if the first month the frequency for 0 was 10, then the second month the frequency should not be 20 (which is what sum does and this works for all frequency except 0), I tried aggregate, but I can’t use that with a calculated member.

Any suggestions?

Reply
- Tomislav Piasevoli says:
  
  02.12.2011 at 00:16
  
  Hi Mark, what is the definition of Transaction Frequency measure?
  
  Reply
  - Mark says:
    
    02.12.2011 at 00:39
    
    That would be the following :
    
    Create MEMBER CurrentCube.[Measures].[Transaction Frequency] AS SUM ( EXISTING [Dim Customer].[Customer ID].[Customer ID].MEMBERS, iif( [DM Interval].[Interval].CurrentMember.MemberValue = [Measures].[Txn Count],1,null ) ) ,Format_String = '#,#' ,Associated_Measure_Group = 'MG01' ,Display_Folder = 'Histogram' ;
    
    Reply
    - Tomislav Piasevoli says:
      
      02.12.2011 at 01:32
      
      Mark, have you considered placing YTD part in this base calculation instead? For example, like this:
      
      Create MEMBER CurrentCube.[Measures].[Transaction Frequency] AS SUM ( EXISTING [Dim Customer].[Customer ID].[Customer ID].MEMBERS, iif( [DM Interval].[Interval].CurrentMember.MemberValue = YTD( …, [Measures].[Txn Count] ),1,null
      )
      )
      ,Format_String = ‘#,#’
      ,Associated_Measure_Group = ‘MG01’
      ,Display_Folder = ‘Histogram’
      ;
      
      And another thing. In the book there’s a better code (I think faster) doing the same thing. See page 266.
      
      Reply
      - Mark says:
        
        02.12.2011 at 17:29
        
        Interestingly, I tried the approch in your book (which I will plug as the best MDX learning tool I have come across to date) and performance was still not so good. I am fairly certain the query is getting held up in the forumla engine, since I see a profiler event for Query Dimension from cache and 1 CPU goes to 100% utilization.
        
        I did not think of adding the YTD in the calculated member, but am experimenting with that now.
      - Tomislav Piasevoli says:
        
        03.12.2011 at 00:13
        
        Yes, everything points to FE. OK, we’ll deal with that later, let’s get the correct formula first and then we’ll try to optimize it.
        I’m waiting to hear more about whether moving YTD function deeper in calculation helps (that is, in the other measure). If not, please explain the required calculation behavior again.
      - Mark says:
        
        03.12.2011 at 00:50
        
        Yes and no, Let me share my mdx. (I took it out of the cube calculation for testing).
        
        What I really want is the frequency of the events from the beginning of the year through the month.
        
        So if 10 events happened for 10 customers in the first month. The 1 frequency would should 10 for that month, but if in the next month 5 of those same customers had another event, the next month would show 5 for frequency 1 and 5 for frequency 2.
        
        with MEMBER [Measures].[Transaction Fq] AS Sum(EXISTING [Dim Customer].[Customer ID].[Customer ID].MEMBERS, iif( [DM Interval].[Interval].CurrentMember.MemberValue = [Measures].[Txn Count], 1 ,null ) ) MEMBER [Measures].[Freq YTD] AS Sum ( ytd(),[Measures].[Transaction FQ] )
        select {[Measures].[Freq YTD],[Measures].[Transaction Fq]} on 0, NON EMPTY [Dim Date].[TxnDate].[Month] * [DM Interval].[Interval].Children ON 1 FROM [Cube]
      - Tomislav Piasevoli says:
        
        03.12.2011 at 01:30
        
        OK, now I think I understand it better. And see where the problem is – calculations are mixed.
        The first measure in your query (Transaction Fq) is generally done right except for one thing – the Txn Count measure should be defined as a calculated measure that returns the YTD value of the number of events for a particular customer. On the other hand, we have a measure like that, it’s the Freq YTD measure. But inside that measure you should use the Txn Count measure. You see, simply change those measures inside existing calculated measures and switch the order of calculated measures (because the latter will now refer to the former one). And I think the solution should work as required now.
        Explanation. The logic of the Transaction Fq measure says that for each interval we want to count the number of customers that had the same number of monthly events/transactions up to that date. In your example, in the first month there will be only 5 customers because the other 5 will have 2 as their count of events YTD, so they will match the second interval.
        Here are the new definitions, to make it easy for you:
        
        WITH
        MEMBER [Measures].[Count YTD] AS Sum ( ytd(), [Measures].[Txn Count] )
        MEMBER [Measures].[Transaction Fq] AS Sum(EXISTING [Dim Customer].[Customer ID].[Customer ID].MEMBERS, iif( [DM Interval].[Interval].CurrentMember.MemberValue = [Measures].[Count YTD], 1 ,null ) )
        
        One more thing. If you won’t slice by other attributes of your customer dimension, you can omit the EXISTING keyword and gain on some speed perhaps.
        
        Hope it helps this time.
      - Mark says:
        
        03.12.2011 at 01:43
        
        This is exactly what I intended, and your explaination makes perfect sense! I hope it helps others as well.
hua yang says:

04.03.2012 at 17:09

Hi Tomislav,
Thank you for your MDX cookbook and your previous help.

I am trying to use your histograms example in chapter 6 to solve my problem. We have five product categories we call it QBA Buckets. If a customer made purchase in all five categories, we say the customer hit all QBA Buckets, the customer is a QBA. If a customer’s purchase made only in three categories, we say the customer hit QBA 3. We need to see how many customers hit 5 buckets, how many hit 4,how many hit 3….

I have a DistinctCount measure called QBABuchet Count. Following your example, I have come put the following script.

Scope( [Measures].[QBA Customer Count],
[Customer QBA Histogram].[Frequency].[Frequency].Members,
[Customer].[InvoiceAccount].[InvoiceAccount].Members
);
This = iif( [Customer QBA Histogram].[Frequency].CurrentMember.MemberValue =[Measures].[QBABucket Count],
1,
null
);
End scope;
Scope( [Measures].[QBA Customer Count],
[Customer QBA Histogram].[Frequency].[All]
);
This = Sum([Customer QBA Histogram].[Frequency].[Frequency].Members,
[Measures].[QBA Customer Count]
);
End Scope;

I have only total around 45000 customers, the result in 0 buckets is 2387498. Something is not right. And the performance is terrible. Any suggestion, please?

Thank you.
Hua

Reply
- Tomislav Piasevoli says:
  
  05.03.2012 at 11:55
  
  Hm, I think your case is not a classic histogram case described in the book. Take a look at this blog post:
  
  https://tomislav.piasevoli.com/2009/08/24/implementing-buckets-in-olap
  
  Maybe it gives you an idea.
  
  Reply
  - hua yang says:
    
    05.03.2012 at 14:58
    
    Thank you Tomislav for your reply.
    
    I read the post you mentioned above a couple of times, I do not know why I understand that is for entity base histograms?
    You introduced two solutions in your book, one is using Dummy Dimension and other is using a physical measure as a placeholder. The first method works for my case, I got the result I have expected; but not the second one.
    It seems to me that the scope statement iterated my rows in the fact table instead of entities of my customers. Can I use existing key somewhere in the Scope statement to force the statement to evaluate in the context? If so, where should I put that keyword?
    And I am using SQL 2012 RC0. Do you think SQL RC0 behaves differently from SQL 2008?
    
    Thank you again for your time.
    Hua
    
    Reply
    - Tomislav Piasevoli says:
      
      13.03.2012 at 01:21
      
      Hi Hua,
      the version shouldn’t make a difference. Maybe the physical measure solution doesn’t work because you have a distinct count measure and that doesn’t add up as other aggregations like Sum or Count? If that’s the case, then there is no way to calculate that other than recalculating it again on a higher grain (entities) as a calculated measure – which brings us back where we were.
      
      T
      
      Reply
Kasper de Jonge PowerPivot Blog » Implementing histograms in Excel 2013 using DAX query tables and PowerPivot says:

24.07.2012 at 18:34

[…] got a great question last month on how to implement a histogram in PowerPivot, similar to this great blog post by Tomislav Piasevoli for Multi Dimensional. I decided to solve the problem in Excel 2013, PowerPivot and DAX and show […]

Tomislav Piasevoli – BI Blog

Implementing histograms in OLAP

25 Responses to “Implementing histograms in OLAP”

Leave a Reply Cancel reply