On 2nd of September, I watched a marathon of SQL Server sessions named “24 Hours of PASS”. At least one small part of it. One session that made me think was Steve Jones’ Building a better blog. I wanted to learn few tricks about blogging in order to make mine a better place. What surprised me was Steve’s advice to blog about everything you do, in order to show others what topics you covered so far, regardless whether it was covered by someone else before or not. Now, I simply couldn’t agree with that. For me, writing a blog article, and by that I don’t mean a short announcement or similar, is writing about something others haven’t done so far or at least, not in that way. It’s about creating something new, a challenge to convert ideas in practice and demonstrate it in public. Otherwise, what’s the point in duplicate content? Then again, maybe I just don’t understand the point in having a blog.
And so I thought to myself, surely there must be several types of bloggers. One, for instance, that act as messengers, concentrating mostly on what’s new and what’s considered to be best practice. News and knowledge aggregators, we might call them. Then, there are occasional practitioners, people who don’t blog often and rarely do so regarding news. When they blog, it’s mostly about something they’ve done recently and found interesting to share. Next, there are regular bloggers, who post weekly if not often, regardless of what, just to keep the pace. And so on. Which makes me think. Could we perform a real-life analysis on this topic? Could we analyze bloggers writing about Business Intelligence? Why not, I know a good place where this kind of information can be obtained. Our precious – SSAS-Info.com portal, hosted by Vidas Matelis.
STARTING THE PROJECT
Although the strategic decision to go in this project was done at the same time when the idea came to me, the tactical decision followed a month later.
I mean, I saw this right from the start as a community project, this being merely its initial phase. A project where we all could participate, explore possibilities, share the findings and project’s know-how and have fun at the same time. Because this time it would not be about Northwinds, Foodmarts or AdventureWorks, this time it would be about us. Just imagine, us being inside a cube. Not as analysts, but the one being analyzed for change. Feels weird, yet tempting.
Considering all circumstances, it had to be done before PASS Summit 2009, because that’s a great place to gather, at least some of us will be there, and discuss the findings and further possibilities. Therefore, upon receiving emergency note from tactical management, the operational team had to act fast. And so I did.
I’ve contacted Vidas who kindly sent me CSV dump of his site. One big table containing all the records, like articles, news, faq, etc. Here’s 5 for cooperativeness!
Since my virtual environment wasn’t ready at the moment, I started in Excel 2007 and managed to parse, clean, distill and consolidate the data using Excel functions, formulas and copy-paste technique. Like a real BA :-). I also transposed some data (tags) into M2M model using Query. Details from that experience in some other article, perhaps. I’m glad I did it in the end. Says enough?
Before going any further I’ve sent it back to Vidas. And I recommend it as a good practice for future users of PowerPivot – once they get what they want, loop-back to IT in order to clean/consolidate data at the very source. Don’t be mistaken, data will always be dirty. Yes, I know, it’s an extra step, but that way data becomes better and better for analysis. Love thy brother thou Gemini.
He gladly cleaned it following my findings inside. At the same time my virtual environment was ready, so I moved to Excel 2010 and Gemini add-in.
I couldn’t make a model directly from the same Excel holding several tables for analysis, therefore, I saved that Excel and used it in a new one as an ODBC datasource.
Connecting went smoothly. I couldn’t see the data, just headers, but I assumed it had something to do with being CTP still (and perhaps something else). I just ignored that and returned to Excel, add a new pivot, filled it with some fields and the data was there. It works!
PERFORMING THE ANALYSIS
I made Four Charts view (under Pivot dropdown) and started to search for interesting and meaningful results (and also the ones where I’m in chart yet it doesn’t look suspicious ). Finally, I settled with this. What do you think?
I connected the orange Section slicer to charts on left (colored orange too). That slicer separates different part of SSAS-Info site, articles being the most fun par of it for analysis (hence selected). The upper radar graph shows the quantity of articles per category. Three most blogged categories (as seen in the radar graph) are: MDX, Management and Design. The bottom bar graph shows Top 10 bloggers.
The blue slicer, Category, represents topics being blogged about. Those top 3 are selected manually in order for the blue stacked bar graph to show their total as well as their share for each of Top 10 bloggers in those topics. As you can see, almost all of Top 10 bloggers are among Top 10 bloggers in these categories too. Vidas Matelis and Teo Lachev focus their attention primarily on other subjects, while Mi
crosoft and I blog above average in top 3 categories. Blue graph is linked to both orange and blue slicers and supports multi-select quite beautifully, because of stacked bar graph being used here.
Green pie graph is linked to green Author slicer (and orange slicer too). Here multi-select is not recommended, because this graph is not designed for that. However, it shows a distribution of categories for an author. Chris Webb, seen in this beautiful graph, appears as a true multi-vitamin for consultancy because of a wide variety of interests he blogs about and their nice distribution.
Which brings us to my initial dilemma. Using this graph it is relatively easy to analyze how versatile one chooses to be in his blog (because of categories he decides to write about). Mosha Pasumansky, me and many other BI bloggers are inclined to one or two areas. Partially because most of us are under 10 posts, which is a relatively small number compared to a number of categories.
The same analysis can be done in radar graph, either by hooking the orange radar graph to Author slicer, or by changing the bottom-right graph type from pie to radar, since it is already connected as required. That shows us some BI bloggers like Carl Rabeler and Dave Fackler have square in radar graph (focus on 4 categories). Many of Microsoft employees have a more or less wide wedge, with SSAS 2008 being the longest side. Other bloggers have triangles, etc.
Chris Webb, Vidas Matelis, Thomas Ivarsson, Dan English and others show more balanced pie covering many categories. And so on. And so on.
Btw, having so many celebrities in my article (not knowing about it too), I feel like Michael Jackson shooting Liberian Girl video. And I apologize in advance for not mentioning all of you (~150). Don’t worry, I have a plenty of ideas for the sequel 😉
DRAWING THE CONCLUSION
Having seen many of these charts, I conclude that my assumption from the beginning of this article is true. Patterns do exist. Therefore, they should be analyzed.
My comparison with vitamins might not be bad at all. I mean, we both help the society. Next, each of us is targeting one or more areas. Finally, each of us, like vitamins from different vendors, have more or less better perception/absorption. That is, what we write about and how we write about it. Yet, unlike vitamins, we have the option to evolve.
My (not first, but second) impression regarding PowerPivot is that it is envisioned as a very pleasant analytical environment, featuring slicers as the most impressive component of it.
Although the same analysis could have been done in Excel 2007 or in SSAS 200*, I’ve chosen Gemini because I participate in CTP program, haven’t posted anything about it so far and wanted to learn. And what better way than practicing by setting up a task and then fulfilling it. Incrementally, this being the first step.
As for in-memory cube generation is concerned, that part is still to come. I made some tests, but that’s not it. A careful reader should remember that my standards prevent me from posting unless it is something valuable or unique. In CTP3 perhaps.
WHERE TO NEXT?
This experiment showed interesting results. At least to me. And it pointed out one important thing – data is everywhere! Yet it’s not being analyzed, just collected. Opportunities are numerous. Business Intelligence, at least its technology part, enables everyone in gaining fast insight into data, encouraging the creative process which then leads to better decision-making and actions.
Opportunities are numerous here too. This was only a small portion of what can be done. I see this project as a potentially flourishing area for the whole community where each of us could cover his area of interest. Btw, this is just one of them I have on my mind, and I intend to start them when the time comes.
For example, this data is static (captured a week ago). One possible project would make it come alive, enabling real-time (or daily) analysis (although weekly could be fine too).
The next one is Time Analysis. Records in the database behind SSAS-Info portal don’t have original article’s date but rather a date/time of when it was manually entered into database. I’ve omitted that in this example, I might make another one soon. Now, that difference means we have uncertainty of a couple of days for more recent items, and unpredictable offset for older items. Those dates could be collected and merged into DW.
Then there are numerous data-mining possibilities. For example, clustering of authors in order to obtain clustered groups based on categories or tags. We could also classify authors as nighthawks and daylight writers. Or by regularity in their posts (regular, occasional), day in week affinity, etc, etc.
This could also be a nice case for data-cleaning experts, to show their strength and ideas in that field, in Excel as well as on SQL Server platform. And perhaps by comparing good and bad sides of each path.
Finally, there are M2M challenges not to be neglected. All together, a nice playground for the upcoming CTP3.
Yes, I know what you’re thinking. OK, fine, no problem. Feedback would however be very appreciated.
If, on the other hand, you are interested in participating in this project, Vidas and me have no objection that you use raw or prepared data, which btw can be found in below.
Feel free to contact us regarding cooperation, advices, anything.
And again, don’t be shy, give thumb up or down, feedback, anything. Let me know what you’re thinking. That way I can make another analysis. Just kidding
PS: A small notice to Redmond team – the same idea could be applied to MSDN forum. Extending the text-mining of tags and posts recently introduced there, if I’m not mistaken.