We have an "age" dimension in our SSAS Cube. It's basically just the one attribute that's the person's whole number age at the time an event happened. We've had a requirement to further break it down into adult/child with a sub group of adult/geriatric and pediatric/neonatal.
When adding these new attributes to the dimension and a hierarchy, do I have to go into the aggregation designs and rebuild the ones that reference the dimension?
We aren't changing the key of the patient age, just adding the extra data.
Unfortunately, you aggregations won't include new level automatically, but they will help anyway: engine can use lower than your new level aggregations of the same dimension as it's faster than retrieving from data files.
Please also remember '1/3 rule': aggregations should be less than 1/3 the size of the fact table.
You can see details in amazing white paper 'Analysis Services 2008 R2 Performance Guide' http://download.microsoft.com/download/6/5/6/6567C845-FC8D-4D62-920F-C027A349C889/SSASPerfGuide2008R2.pdf (3.4 Aggregations, page 60).
Related
Project requires a usable data warehouse (DW) in SQL Server tables. They prefer no analysis services, with the SQL Server DW providing everything they need.
They are starting to use PowerBI, and have expressed the desire to provide all facts and measures in SQL Server tables, as opposed to a multi dimensional cube. Client has also used SSRS (to a large degree), and some Excel (by users).
At first they only required the Revenue FACT for Period x Product x Location. This uses a periodic snapshot type of fact, not a transactional grain fact.
So, then, to provide YTD measures for all periods, my first challenge was filling in the "empty" facts, for which there was no revenue, but there was revenue in prior (and/or subsequent) periods. The empty fact table includes a column for the YTD measure.
I solved that by creating empty facts -- no revenue -- for the no revenue periods, such as this:
Period 1, Loc1, Widget1, $1 revenue, $1 YTD
Period 2, Loc1, Widget1, $0 revenue, $1 YTD (this $0 "fact" created for YTD)
Period 3, Loc1, Widget1, $1 revenue, $2 YTD
I'm just using YTD as an example, but the requirements include measures for Last 12 Months, and Annualized, in addition to YTD.
I found that the best way to tally YTD was actually to create records to hold the measure (YTD) where there is no fact coming from the transactional data (meaning: no revenue for that combination of dimensions).
Now the requirements need the revenue fact by two more dimensions, Market Segment and Customer. This means I need to refactor my existing stored procedures to do the same process, but now for a more granular fact:
Period x Widget x Location x Market x Customer
This will result in creating many more records to hold the YTD (and other) measures. There will be far more records for these measures than are were real facts.
Here are what I believe are possible solutions:
Just do it in SQL DW table(s). This makes it easy to use wherever needed. (like it is now)
Do this in Power BI -- assume DAX expression in PBIX?
SSAS Tabular -- is tabular an appropriate place calculate the YTD, etc. measures, or should it be handled at the reporting layer?
For what it's worth, the client is reluctant to use SSAS Tabular because they want to keep the number of layers to a minimum.
Follow up questions:
Is there a SQL Server architecture to provide this sort of solution as I did it, maybe reducing the number of records necessary?
If they use PowerBI for YTD, 12M, Annualized measures, what do I need to provide in the SQL DW, anything more than the facts?
Is this something that SSAS Tabular solves, inherently?
This is my experience:
I have always maintained that the DW should be where all of your data is. Then any client tool can use that DW and get the same answer.
I came across the same issue in a recent project: Generating "same day last year" type calcs (along with YTD, Fin YTD etc.). SQL Server seemed like the obvious place to put these 'sparse' facts but as I discovered (as you have) the sparsity just gets bigger and bigger and more complicated as the dimensions increase, and you end up blowing out the size and continually coming back and chasing down of those missing sparse facts, and worst of all having to come up with weird 'allocation' rules to push measures down to the require level of detail
IMHO DAX is the place to do this but there is a lot of pain in learning the language, especially of you come from a traditional relational background. But I really do think it's the best thing since SQL, if you can just get past the learning curve.
One of the most obvious advantages of using DAX, not the DW, is that DAX recognises what your current filters are in the client tool (in Power BI, excel, or whatever) at run time and can adjust it's calc automatically. Obviously you can't do that with figures in the DW. For example you can recognise the person or chart or row is filtered on a given date, so your current year/prior calcs automatically calc the correct YTD based on the date.
DAX has a number of 'calendar' type functions (called "time intelligence"), but they only work for a particular type of calendar and there are a lot of constraints, so usually you end up needing to create your own calendar table and build functions around that calendar table.
My advice is to start here: https://www.daxpatterns.com/ and try generating some YTD calcs in DAX
For what it's worth, the client is reluctant to use SSAS Tabular
because they want to keep the number of layers to a minimum.
Power BI already has a (required) modelling layer that effectively uses SSAS tabular internally, so you already have an additional logical layer. It's just in the same tool as the reporting layer. The difference is that doing modelling only in Power BI currently isn't an "Enterprise" approach. Features such as model version control, partitioned loads, advanced row level security aren't supported by Power BI (although who knows what next month will bring)
Layers are not bad things as long as you keep them under control. Otherwise we should just go back to monolithic cobol programs.
It's certainly possible to start doing your modelling purely in Power BI then at a later stage when you need the features, control and scalabiliyt, migrate to SSAS Tabular.
One thing to consider is that SSAS Tabular PaaS offering in Azure can get pretty pricey, but if you ever need partitioned loads (i.e. load just this weeks data into a very large cube with a lot of history), you'll need to use it.
Is there a SQL Server architecture to provide this sort of solution as I did it, maybe reducing the number of records necessary?
I guess that architecture would be defining the records in views. That has a lot of obvious downfalls. There is a 'sparse' designator but that just optimise storage for fields that have lots of NULLs, which may not even be the case.
If they use PowerBI for YTD, 12M, Annualized measures, what do I need to provide in the SQL DW, anything more than the facts?
You definitely need a comprehensive calendar table defining the fiscal year
Is this something that SSAS Tabular solves, inherently?
If you only want to report by calendar periods (1st Jan to 31st Dec) then built in time intelligence is "inherent" but if you want to report by fiscal periods, the time intelligence can't be used. Regardless you still need to define the DAX calcs. and they can get really big
First, SSAS Tabular and Power BI use the same engine. So they are equally applicable.
And the ability to define measures that can be calculated across any slice of the data, using any of a large number categorical attributes is one of the main reasons why you want something like SSAS Tabular or Power BI in front of your SQL Server. (The others are caching, simplified end-user reporting, the ability to mash-up data across sources, and custom security.)
Ideally, SQL Server should provide the Facts, along with single-column joins to any dimension tables, including a Date Dimension table. Power BI / SSAS Tabular would then layer on the DAX Measure definitions, the Filter Flow behavior and perhaps Row Level Security.
I have a 'Employee' dimension which will be changed (modified) everyday, I made monthly partitions in cube and only process full the current month partition. Lately found that the past month's aggregation will not be dropped. Tired to 'ProcessUpdate' on this dimension and 'ProcessIndex' on partition but remained same. Also tried the setting 'ProcessAffectObjects'and 'ProcessIndex' again, still the same, tried both on lazyprocessing true and false with no luck.
So my question is how to drop the stale aggregation on past month and rebuild them explicit ?
It is a distinct count measurement and no aggregation designed via wizard
Tried drop the index by using 'ProcessClearIndexes' in XMAL command, it worked fine and use 'ProcessIndexes' did rebuild the indexes and aggregation, saw them from the SSMS query execution message .
So might it only be related to the distinct count , just because it is a non-aggregation measurement ?
"Non-additive measures create the following problems on a typical OLAP system:
Roll-ups are not possible. When pre-calculating results during cube processing, the system cannot deduce summaries from other summaries. All results must be calculated from the detail data. This situation places a heavy burden in processing time.
All results must be pre-calculated. With non-additive measures, there is no way to deduce the result for a higher-level summary query from one pre-calculated aggregation. Failure to pre-calculate the results in advance means that the results are not available. It is impossible to perform and maintain incremental updates to the system. A single transaction added to the cube usually invalidates huge portions of previously pre-calculated results. In order to recover from this, a complete recalculation is needed."
"Aggregations
As mentioned before, DISTINCT COUNTs are not additive (and this is the main reason why these measures are so problematic). Therefore, the aggregations, which are all derived from additive operators, are completely useless;"
someone answered my question on MSDN
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/7302227f-11b8-4adc-98ff-72b6c395775b/ssas-update-a-dimension-wont-drop-aggregation-process-index-wont-rebuild-aggregation?forum=sqlanalysisservices
If you use materialized reference dimensions ensure you do ProcessFull to reprocess the fact tables again fully. The reason is that the join to the intermediate dimension happens in the measure group partition processing query:
http://sqlblog.com/blogs/alberto_ferrari/archive/2009/02/25/ssas-reference-materialized-dimension-might-produce-incorrect-results.aspx
In SQL Server 2008+, we'd like to enable tracking of historical changes to a "Customers" table in an operational database.
It's a new table and our app controls all writing to the database, so we don't need evil hacks like triggers. Instead we will build the change tracking into our business object layer, but we need to figure out the right database schema to use.
The number of rows will be under 100,000 and number of changes per record will average 1.5 per year.
There are at least two ways we've been looking at modelling this:
As a Type 2 Slowly Changing Dimension table called CustomersHistory, with columns for EffectiveStartDate, EffectiveEndDate (set to NULL for the current version of the customer), and auditing columns like ChangeReason and ChangedByUsername. Then we'd build a Customers view over that table which is filtered to EffectiveEndDate=NULL. Most parts of our app would query using that view, and only parts that need to be history-aware would query the underlying table. For performance, we could materialize the view and/or add a filtered index on EffectiveEndDate=NULL.
With a separate audit table. Every change to a Customer record writes once to the Customer table and again to a CustomerHistory audit table.
From a quick review of StackOverflow questions, #2 seems to be much more popular. But is this because most DB apps have to deal with legacy and rogue writers?
Given that we're starting from a blank slate, what are pros and cons of either approach? Which would you recommend?
In general, the issue with SCD Type- II is, if the average number of changes in the values of the attributes are very high, you end-up having a very fat dimension table. This growing dimension table joined with a huge fact table slows down the query performance gradually. It's like slow-poisoning.. Initially you don't see the impact. When you realize it, it's too late!
Now I understand that you will create a separate materialized view with EffectiveEndDate = NULL and that will be used in most of your joins. Additionally for you, the data volume is comparatively low (100,000). With average changes of only 1.5 per year, I don't think data volume / query performance etc. are going to be your problem in the near future.
In other words, your table is truly a slowly changing dimension (as opposed to a rapidly changing dimension - where your option #2 is a better fit). In your case, I will prefer option #1.
Every year we keep a historical copy of one of our cubes. This year someone decided they wanted to pay us money to add an attribute to the cube which did not previously exists. Fine, I like money, but the issue is we don't have a backup of the database that we built this cube off of.
So a question arises in my head, do we need that original database to add a new attribute to this cube? Is it possible for us to add a new attribute to the cube and only process this attribute without having the cube orignal datasource?
Not having a great understanding of what is happening under the hood when I add an attribute to a SSAS cube and process, I can't say if this is or isn't possible. I could imagine that possibly, the cube has a snapshot in memory of the datasource that it can work off of. I can also imagine that this would be ridiculously inefficient so there is a chance this is no way in heck possible
EDIT: It at least would seem feasible to add a calculated member that makes use of existing data in the cube.
I also should mention that I tried to add an attribute to such a cube and received an error:
"Dimension [Partner] cannot be saved File system error failed to copy
file C:\\MYSQLSERVER\OLAP\DATA\2013_Cube.db\\.dim\.dstore to C:\\MYSQLSERVER\OLAP\DATA\2013_Cube.db\\.dim\.dstore file exists"
Sorry I faked those filepaths a little.
This task is very difficult. The only way I can imagine would be to manually reconstruct the original database based on the Data Source View (it has cached metadata), and then try to generate the data to populate it using a SSAS query tool (e.g. Excel, SSRS, OLE DB Provider for Analysis Services).
If you want to add one attribute in a dimension, you might be able to limit that effort to the source data for the dimension in question.
First let me explain based on the steps of the process how a cube stores the data!!!
Get the datasource - data!!! That is get access to the original databases/files etc. At this point all the data are at the primary source. All data are normalized one way or the other.
Construct a data warehouse. ELT process. At this point you combine all your data in a denormalized wharehouse, without foreign keys or any constraint. All data are now in an intermediate state in a denormalized sql database and ready to be used in the cube.
Construct the OLAP cube. The Data Warehouse is now your data-source. All data are now aggregated in rows inside the cube with their corresponding values. The redundancy is enormous and the data are 100% denormalized, they hardly follow a patern (Of course they do but it is not always easily understandable).
An example at this state would be a row like this
Company -> Department -> Room | Value(Employees)
ET LTD -> IT -> Room 4 -> | 4
The exactly same row would exist for Value(Revenue).
So in essence all data exist inside the SSAS Database (The cube).
Reconstructing the Database would mean a Great Deal of reverse engineering.
You could make a new C# program using MDX connectors and queries to get the data, and MSsql connectors to save them inside an OLTP database. MDX has a steep learning curve and few citations on websites, so the above method is not advisable.
There is no way that I know of to get the data from excel, as excel gets the pivot table data in a dynamic way from the DataConnection.
I have a SSAS cube in which one of my dimension has 5 million recrods. When I try to view data for the dimension, report or excel pivot becomes lengthy and also the performance is poor. I cant categorize that particular dimension data. Only way I can think of to restrict data is select top 10K rows from the dimension which has metric values. Apart from restricting it to top 10K dimension records can anyone please suggest other possibilities?
Have you set up aggregations? I would venture to guess that the majority of the time being spent getting your data to a viewing point has to do with your measures. If I was you I would try adding in aggregations or upping the aggregation percent in order to alleviate some of the pressure at querytime by passing this workload to the processing time of the dimension/cube.
Generally, people set their aggregation levels at about 30% to start.
If you have done this already, I would think about upgrading your hardware on the server that your cube sits on. (depending on what you already have)
These are just suggestions as it could also be an issue in your cube design that is causing a lengthy runtime.
I would suggest you to create a hierarchy for showing 5 million records. Group by substring in Level 1,( if required some characters in Level 2), then the data falling under that group. For example :
Level 1 Value
A Apple
A Ant
This would mean that you wont be showing all 5 million records at once and it is very effective now to use aggregations too.