July 2009 – Tomislav Piasevoli

Dynamic sets

Jul 272009

The reason I’m discussing this often mentioned topic is the following. After many hours spent on the analysis how do dynamic sets behave in SSAS front-ends, I’ve been asked a perfectly reasonable question: Why do we have dynamic set and when should it be used? I believe this article explains the story behind dynamic sets and their purpose, as well as when and how they should be used. It also warns the reader that the problem is not about the implementation of dynamic sets in SSAS, but about something else. What? You’ll see when you read it through.

INTRODUCING DYNAMIC SETS

SSAS 2005 and 2008 feature MDX script as a place to provide calculations. The reason is obvious – concentration of calculations on a server results in better query performance (precomputations, cache, etc.) and easier maintenance.

SSAS 2005 MDX script featured cube-based calculated members/sets support. While cube-based calculated members were equivalent in behavior to their query-based counterparts (except for some newly available internal optimization possibilities), cube-bases calculated sets were not. They were useful, but many times that was not enough. They didn’t behave exactly the same way query-based sets did and, for many, this was a big issue, because they couldn’t provide the same logic in cube as in query. In order to comply to requests, SSAS team introduced new kind of sets in its next version of SSAS – dynamic sets.

Dynamic sets, unlike static sets, are fully equivalent to calculated sets defined in a query. Which means BI developers were finally able to achieve the same functionality in MDX script as once in query.

It’s important to notice that sets in SSAS 2005 have not been converted to dynamic, once implemented in SSAS 2008. Today, both versions of sets exist, both are useful. One variant was named STATIC, the other DYNAMIC. And it’s solely up to a BI developer to prefix them appropriately in their definition. Default are static sets (for historic reasons, 2005->2008 compatibility), which means prefix can be ommited.

DESIGN AND USAGE

The rule for usage was simple. If a BI developer needs a hardcoded collection of members, which doesn’t change no matter what in query/pivot, he/she should use static sets. In all other scenarios, a dynamic set can be used. So, how come there are problems here and there using them?

We said that query-based sets should have their cube-based equivalent. And they do, at least in SSAS 2008. When we implement cube-based sets, we enjoy the comfort of being able to refer to them in any query. Equally so, end-users are able to use them in a pivot table (SSAS front-ends) which is an extra plus. That is unless we hide them, which rarely is the case. Part of our intention is to provide those sets directly to end-users, not just to be able to refer to them in our queries.

Dynamic sets are context-sensitive. Not row, not column, just slicer (and subselect) sensitive. Which means that end-users can experiment by slicing them in a pivot table and see how they change. A typical example is “Top 10 something” set. When defined as dynamic in a cube, it should cover a whole range of scenarios for end-users. Whether sliced by additional attributes or different dimensions, that set should adjust accordingly. A dream come through for any BI developer – a cube-based solution covering many reports. Because the best solutions are those covered by design. Just like a pivot table on top of an OLAP cube, a solution-by-design, covering endless T-SQL queries in a relational world. However, not everything wen’t smoothly, at least not on the other side …

SUPPORTING DYNAMIC SETS

Queries are written by BI developers, not end-users. Which means they and their sets are under tight control, hidden and preferably tested enough. Pivot-table in front-ends are not. In contrast, they are fully exposed to end-users and therefore often limited, not just regarding cube-based sets, but other features as well. Why? It’s simple, the more you restrict your front-end, the less worries you have later, knowing that potential end-user actions and their combinations are beyond your imagination. One of the common example is hiding MDX – you play it safe with your pivot and features you support in it.

Each front-end tries to keep things under control by balancing between end-users requests and features it would rather skip than support. In a multidimensional world possibilities are endless and therefore – once you had decided to support a SSAS feature, you’ve opened a Pandora’s box whether you know it or not, because everything’s related and all features should fit together just fine.

Every front-end has these dilemmas – do we support this feature, if so in what scope, in what situations, what do we support as additional actions to it, what combinations of existing features are supported with it, etc. (our CubePlayer is no exception there). It’s hard to implement everything properly, to foreseen things. I’m not saying impossible, but, let say experience plays a big role here.

Things went wrong when little by little it became obvious that the favorite front-end of many (also a relatively young one, so we shouldn’t be too harsh on it) had a problem with dynamic sets. Cause of the problem? Intense use of subselect.

When an end-user placed a set on rows (or columns), a peculiar syntax was formed and that same set also ended in subselect.

By now you’ve must have guessed – we’re talking about Excel 2007.

HOUSTON, WE’VE HAD A PROBLEM

Each front-end generates one or more MDX statements in the backgroud for every end-user action in its pivot (almost every action). Regardless of areas and their names in the front-end’s pivot, which is up to a particular software vendor, there are rows, columns, slicer and subselect areas as 4 basic parts of every MDX query (some of them can be skipped, though).

MDX query is the thing that is being sent to SSAS server, so it up to the front-end to properly convert what the end user provided in pivot into a correct MDX. The problem is, there can be several solutions to it. It’s all about how to pick the best solution, the one that will work not just in this case, but also with everything we support in our front-end (which brings us back to front-end dilemma). Using the subselect too often as an approach was not a good decision, because dynamic sets, if nothing else, were obviously negleckted. Or not tested properly. The artificial Adventure Works database, a reference for many things, may be misleading (see this SSAS forum thread for more info).

Two of those areas, subselect and slicer, are of importance here. They are similar in nature but also quite different. In order not to loose focus here, we will only state what’s relevant for dynamic sets – the subselect is evaluated before the slicer!

SSAS server can be simplified like this. Internally, first it ask what do we have in subselect and then evaluates it. It is important to state here that this evaluation is performed in the context of cube dimensions default members (or another subselect inside it, if they are cascaded, but let’s keep things simple). Then the slicer gets evaluated. Then the axes. Then cells, either by block-mode, or cell-by-cell. Finally, NON EMPTY and HAVING operations are performed on axes.

Our set is supposed to be on an axis, where the end-user has placed it. That way it would be evaluated after the slicer, adjusting accordingly. Excel 2007 on the other hand placed it in the subselect. It can’t be evaluated against the slicer because the slicer doesn’t exist yet (it’s like knowing the future). But what does exist?

Default members! Dynamic set placed in subselect is evaluated against cube hierarchies default members. And what is static set evaluated against, in MDX script? Also default members. Which makes dynamic and static sets the same in Excel 2007, both static in nature.

CONCLUSION

Dynamic sets are implemented just fine in SSAS 2008. That is, they are equal to query-based sets, adjusting to slicer promptly.

It is up to the front-end you’re going to use on them and its support towards them. That’s where the problems might occur, either because of lack of support, or worse, because of improper support.

Finally, it’s up to us, consultants, to educate ourselves properly and then communicate with potential users, advising them what is supported and what is not in solutions we intend to implement. The final decision, of course, is client’s.

Implementing histograms in OLAP

OLAP solutions 25 Responses »

Jul 242009

Histograms are an important tool in data analysis and as such are used in a variety of fields: digital photography, databases, business, and so on. They represent the distribution of frequencies of an entity or event.

By its design, OLAP handles entities much better than events and the same is with histograms – entity-based histograms are supported easily, event-based not. Yet, just like there was a way to count “in-flight” events using Mosha’s solution and handle begin and end times using the approach of Richard Tkachuk and Thomas Kejser, there is certainly a way how to score one more time over events in OLAP, this time by enabling event-based histograms. And here’s how …

In order to use histograms, the first requirement is to have DistinctCount measure, either physical, defined in a separate measure group, or calculated, defined in MDX script. The first approach is faster, although requires extra space.

The second requirement is to have an entity (dimension), preferably with some attributes. The third is to have a fact table (cube).

All in all, nothing special except DistinctCount measure, one per each entity of interest (customers, products, …). A fact and a dimension with its attributes is something we can expect to find in every cube.

Take Adventure Works 2008, for example. It meets all 3 requirements. Therefore, we’ll use the Standard version of it as our example here (which can be downloaded from Codeplex in case you don’t have it yet). But let’s first return to the definition we used for histogram, to analyze the difference between two types of histogram.

We said there are entity-based frequency distribution and event-based frequency distribution.

When we say entity, we’re actually thinking about an attribute of an entity, like shoe size, weight, height, number of children, maritual status, and so on. When we say event, we’re thinking about occurrence of an entity (or its attribute, since they are related) in a series of events. And series of events is another name for fact table.

Adventure Works 2008 features Customer Count measure, which can be found in Internet Customers measure group. By using that measure over any attribute of Customer dimension (particulary those found in Demographic folder) we get entity-based histograms. Those histograms show us how many customers are single and how many are married, how many earn 10-30K$ and how many 130-170K$, and so on. If the histogram has higher values on the left side in both cases, we can conclude there are more single than married persons in our pool and more of those earning less than average. Of course, in another pool (slice per customer city or some other attribute) the situation might be different.

The important thing to notice about entity-based histogram is this – they can only be sliced by additional attributes of that entity (dimension), not by other dimensions or their attributes. Which means this type of histogram is static, time-irrelevant, entity-oriented.

The other type of histogram is therefore dynamic, time-sensitive, fact-oriented. Actually, any-dimension-sensitive, since it is fact-oriented. An example of such histogram is this: count of customers on y-axis, distinct number of orders on x-axis. In other words, we have 0, 1, 2, and so on on x-axis and our DistinctCount measure on y-axis. With one problem – numerical values (measures) seems to be on both axis. And that’s something OLAP doesn’t support.

0, 1, 2 and so on represent distinct number of orders. Although it looks like they should be calculated in MDX using some count expression, we can not do that. Our task is to display a table/graph. That means we need to iterate over some dimension (rows) and provide the values (measure) across it (columns). We’re talking about 2D area, a grid.

Although it is possible to generate N calculated measures named as 0, 1, 2 and so on and to provide a proper calculation for each of them, there are many obstacles in that approach. First, there can be many bands/intervals and MDX doesn’t support loops in its DDL (can’t generate calculated measures in a loop, only one by one explicitely). Which would require a lot of manual work. Secondly, new intervals might appear over time. Some customers might score more than the designer of those calculated measures have planned in advance.

All that leads to a search for another approach. And there is one.

A better solution would be to implement the distinct number of orders as a new dimension, a dummy dimension. Which is similar to shell/utility dimension (for those of you who used it sometime), but quite different also.

Normally, you make shell/utility dimension with one member only and many calculated ones. Or, you do it with one member and many dummy members, in order to avoid problems with calculated members in some tools :-).

Here we’ll have all regular members, no calculated ones, and we’ll have N+1 of them, where N is the maximum number of orders. All natural numbers, including zero as the first record (that’s +1). In contrast to shell/utility dimension, our dummy dimension won’t have any calculations on its members. Hence we call it “dummy.”

We’ll start by creating a named query in DSV, naming it “Reseller Order Histogram” (don’t worry about the pop-up with an error, just click OK):

SELECT
0 AS Interval

UNION ALL

SELECT
    TOP
    (
        SELECT
            TOP 1
            Frequency AS Interval
        FROM
            (
            SELECT
                ResellerKey,
                COUNT(SalesOrderNumber) AS Frequency
            FROM
                ( SELECT
                      Distinct
                      ResellerKey,
                      SalesOrderNumber
                  FROM
                      dbo.FactResellerSales
               ) AS t1
        GROUP BY
            ResellerKey
             ) AS t2
        ORDER BY
            Interval DESC
    )
    ROW_NUMBER() OVER(ORDER BY ResellerKey) AS Interval
FROM
    dbo.FactResellerSales

That T-SQL gives us N+1 rows where N is the maximum number of orders made by any reseller in a fact. In Adventure Works 2008, that’s 12.

It’s a bit complex query, but if you execute inner parts of it one by one, separately, you’ll understand what’s going on. Also, I’m not expert on T-SQL, perhaps it could be optimized :-).

Then we’ll build a simple dimension “Reseller Orders Histogram”, and set the name and value columns to that same Interval field (see the picture below).

We’ll process full that dimension, then add it to Adventure Works cube without linking it to any measure group. Finally, we’ll add this part of code below at the end of MDX script in order to create calculation (new measures) that we’ll use in combination with Reseller Orders Histogram dimension:

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency] AS
     Sum( EXISTING [Reseller].[Reseller].[Reseller].MEMBERS,
          iif( [Reseller Orders Histogram].[Interval]
               .CurrentMember.MemberValue =
               [Measures].[Reseller Order Count],
               1,
               null
             )
        )
, Format_String = ‘#,#’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

Scope( ( [Measures].[Reseller Orders Frequency],
         Root( [Reseller Orders Histogram] ) ) );
    This = Sum( [Reseller Orders Histogram].[Interval].[Interval].MEMBERS,
                [Measures].[Reseller Orders Frequency]
              );
End Scope;

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency Total] AS
( [Measures].[Reseller Orders Frequency],
[Reseller Orders Histogram].[Interval].DefaultMember )
, Visible = 0
, Format_String = ‘#,#’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

Create MEMBER CurrentCube.[Measures].[Reseller Orders Frequency %] AS
    iif( [Measures].[Reseller Orders Frequency Total] = 0,
         null,
         [Measures].[Reseller Orders Frequency] /
         [Measures].[Reseller Orders Frequency Total]
       )
, Format_String = ‘Percent’
, Associated_Measure_Group = ‘Reseller Orders’
, Display_Folder = ‘Histogram’
;

After deploying the cube, we can test it in cube browser (by reconnecting first) in order to see changes.

Open measures, then measure group Reseller Orders, then folder Histogram. Drop both measures in data area. Drop Reseller Orders Histogram dimension on rows. See how the values populate the grid. Then test by adding years from Date dimension on columns. See how total stays the same (unless some years are unchecked) while individual years have their own data. You should know that you might experience some delays, depending on what you bring in pivot (don’t exaggerate).

That’s it. You made event-based histogram. Hope you liked it :-).

In the end, it’s worth saying that this solution can be reduced to a fixed number of intervals (to speed up the results or to have better, smaller graphs in cases when there are many intervals, i.e. > 50). One way would be to keep the first intervals, while grouping the last ones: 0, 1, 2, … 25, 25+. Variant of it would be to group on both sides, in case all data focuses in middle. Finally, the third option would be to reduce the resolution of x-axis (sampling, making bands) by combining several adjecent intervals to one: 0, 1-10, 11-20, 21-30, etc.. Keep in mind that SQL and MDX calculations should be adjusted in each of these scenarios.

Final note: I deliberately switched the axes when I made cube browser example. It’s because data can be seen much better this way (otherwise it spans to much on the right). Normally, when you design a graph for it, you should use bar chart, preferably oriented vertically, as in screenshot below.