SSAS Tabular - Multiple Models? - ssas

We're starting to build an SSAS tabular model and wondering if most people have one model or multiple. If multiple, do you duplicate tables that are needed by each, or is there a way to share tables between models? I think I know the answer, but I'm hoping those with more experience can confirm what we've found...
From what I've researched I think...
- you can't share tables across models - any "common" tables would have to be duplicated in and deployed with each model and would take up memory
- we should create one model, use perspectives to organize the tables and make it easier to work with
- multiple models could be acceptable if there is little or no common data across models
thanks

You are correct, there is no way to share tables between models.
Perspectives can help.
The question of whether to have one model or more depends on the user audience. Who are the users? How analytically savvy are they? Will they have a reasonable understanding of the model structure?
One issue that affects my rather unsophisticated users is when a dimension does not relate to all fact tables. In this scenario, as is expected, measures on the fact table calculate identical values for every member of the unrelated dimension. For less knowledgeable users, this situation is confusing.

I agree with Ari's answer, and am posting this answer to explain my own experience.
We use a few large models for more sophisticated users that are in memory and processed once a week. We have agreed with the business that these models will not be available during processing, so we are able to processes without holding a transaction open which allows us to keep many more smaller models because we do not need to keep 2x the size of our largest model available to the instance. We use perspectives to simplify the presentation and reduce the confusion for the multiple fact tables. Even with perspectives, the models are rather complex, and it takes some training to get users used to working with the different facts.
We also use smaller models, usually more targeted to a specific audience/need. Many are processed daily, and use transactional processing to ensure the are available to users as much as possible. There are several dimensions that are used in several of our small models, but we are able to filter them so that user's do not see the full list of members which reduces size, and has been a huge benefit for my users, because they only see members that have a fact that they are analyzing instead of a list of every member associated with any fact.
We use views to ensure conformity between models when a dimension is used in multiple models. In my opinion this is very important, as it is very confusing when I have the same dimension with slightly different attribute names.
To sum up (pun intended)...
I like developing and working with large models. I think they answer more questions with less work.
Most users I have worked with prefer smaller, more concise models. Your server hardware/processing requirements may direct you to smaller models as well, even though some of the dimensions will be duplicated.

Related

Dimensional and cube developme data models

I have created a dimensional model which is similar in structure to the financial reporting design in the AdventureworksDW environment, where the value of each account is held as a single value column in the fact table and the dimensions give the data its semantic meaning.
There are over a thousand columns in this model so it works well for adding or deleting additional columns. Here is a really good blog on this design: http://garrettedmondson.wordpress.com/2011/10/26/dimensional-modeling-financial-data-in-ssas/
Although this model works well for querying the dimensional model, and there are examples supporting this model for dimensional analysis, I'm concerned that this model is not standard for cube development or data mining which seem to prefer wider tables.
Questions:
Is this design categorized as Entity-Attribute-Value (EAV)?
Would a design using multiple fact tables be better? So many wide fact tables (up to 10) with up to 200-300 columns each, but fewer rows.
Should I expect more performance issues with the much wider tables?
You are right that specific design is considered as EAV model.
By using such a design, you can easily add new accounts, hierarchies etc. You dont need to update your model.
I would not recommend one column per measure aproach. Most account will be null in most of the rows. Also with such a design, you need to read all of your measures even if you need to retrieve only one of them.
We heavily use account dimension in our cubes. Unfortunately things like shared members are not easy to handle in SSAS like in Essbase.
You need to create an Account dimension which is parent-child and also you need to have the key of this account dimension in the fact table as usual.
By using account dimension, you get nice support for time balance functionality. Using time balance functionality of SSAS supposed to be faster than custom MDX code.
We are converting unary operators and parent-child relationships to formulas at the moment.
So basically we have normal formulas, and parents in hierarchies also works as formulas.
At the end we are flattening the hierarchy. So it is not possible to drill down in account dimension. We are using account dimension as a calculation engine only.
It is possible to have proper hierarchies as well, but we decided not to mix custom rollup members and unary operators at the same time.
Shared members and all our formulas implemented as custom rollup members.

Where should I begin with this database design?

I have 5 tables all unnormalised and I need to create an ER model, a logical model, normalise the data and also a bunch of queries.
Where would you begin? Would you normalise the data first? Create the ER model and the relationships?
There are two ways to start data modelling: top-down and bottom-up.
The top-down approach is to ask what things (tangible and intangible) are important to your system. These things become your entities. You then go through the process of figuring out how your entities are related to each other (your relationships) and then flesh out the entities with attributes. The result is your conceptual or logical model. This can be represented in ERD form if you like.
Either along the way or after your entities, relationships and attributes are defined, you go through the normalization process and make other implementation decisions to arrive at your physical model - which can also be represented as an ERD.
The bottom-up approach is to take your existing relations - i.e. whatever screens, reports, datastores, or whatever existing data representations you have and then perform a canonical synthesis to reduce the entire set of data representation into a single, coherent, normalized model. This is done by normalizing each of your views of data and looking for commonalities that let you bring items together into a single model.
Which approach you use depends a little bit on personal choice, and quite a bit on whether you have existing data views to start from.
I think you should first prepare the list of entities and attributes. so that you will be able to get the complete details of the data information.
Once you are clear with the data information. You can start creating the master table and Normalize then.
Then after the complete database is design with normalization, You can create the ER diagram very easily.
I would start by evaluating and then preparing the list of entites and attributes within your data.
I would do it in this order.
Relationships
Create the ER model.
Normalise the data.
I know many others will have a different opinion but this is the way I would go ahead with it :)

Thoughts on dimension measures for BI

I am working with a consultant who recommends creating a measure dimension and then adding the measure dimension key to our fact table.
I can see how this can make adding new measures easier by just adding rows instead of physically creating columns in the fact table. I can also see how this can add work to the ETL process, adds another join to the star schema, one generic column in fact table to hold all measure data etc.
I'm interested in how others have dealt with this situation. We currently have close to twenty measures.
Instinctively, I don't like it: it's the EAV model, which is not very popular (you can Google the reasons why).
The EAV model is generally considered to be a headache to query and maintain
Different measures go together with different dimensions; this approach could easily turn into "one giant fact table for everything" instead of multiple smaller fact tables for specific reporting areas
I suspect you would end up creating views to give the appearance of multiple fact tables anyway
You will multiply the number of rows in your fact table by the number of measures, resulting in a much bigger physical table
Even with a good indexing/partitioning scheme, queries that include more than one measure will have to read a lot more rows to get the data
What about measures with different data types?
Is this easily supported in your reporting tool?
I'm sure there are other issues, but those are the ones that come to mind immediately. As a rule of thumb, if someone suggests an EAV implementation in any context, you should be very wary and ask them exactly what advantages it offers and how it will be managed as the data and complexity increase. But I think you've already identified some key areas of concern.
SSAS will do this, and I know of a major vendor of insurance policy administration software that provided a M.I. solution for their system that works like this. You do get some flexibility from the approach in that you can add measures without having to deploy a build of the cube, although for 20 measures I don't think you need to worry about that.
'Measures' is essentially another dimension (and often referred to as such in the documentation). I believe SSAS uses a largely column-oriented structure behind the scenes.
However, a naive application of this approach does have some issues that could come and bite you to a greater or lesser extent.
You only have one measure, [Value], [Amount] or whatever it's called. If your tool won't let you inject calculated measures at the front-end then you can't sort the whole data set on the value of one of your attribute types. ProClarity and report builder >=2.0 will do this but Excel won't.
You can't do ratios or other calculated measures in this way. You will have to either embed them in the cube script (meaning you need to deploy a build to add them) or use a tool that lets you define them in the client.
Although it doesn't make a lot of differece to the cube it will be slow to query on the database and increase storage requirements. It's also fiddly to query on the database.

Is OLAP/MDX a good way to process data w/ unknown values at various aggregation levels

I'm new to OLAP, so perhaps I don't know the right terminology to use for this question, but bear with me here.
I work with lots of hierarchical, multidimensional data where parent/aggregated cells mostly have data, but child/leaf cells are often missing data (attribute values are unknown but non-zero). I currently use a combination of scripting and SQL to work with it, but that's getting unwieldy. It seems like OLAP cubes and MDX are better suited to the structure of the data, but not necessarily to tasks I need to do with it. For example:
OLAP seems mainly designed for read-only reporting; I do a lot of modifications to the data in batch processes
OLAP seems to like having complete leaf-level data to calculate aggregates; my data has missing values at various levels
Examples of what I want to do:
Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
Queries and calculations need to be able to handle the unknowns properly. Ideally be able to easily query how much of an aggregated cell's value is made up of estimated vs. known values, possibly compute confidence/error statistics, or check whether we can derive an exact value for an unknown when it has a known parent and all known siblings, etc.
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
Could an OLAP server and MDX be a good tool for this type of work? Are there any other tools that would work well for manipulating hierarchical/multidimensional/gap-filled data?
That's some needs for an OLAP system, interesting and challenging :-) :
- Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
You can change the way cubes aggregate values in a hierarchy. Doing this in one hierarchy is fine doing this using in multiple hierarchies might start to get complicated. It's worth checking twice if there is a mathematical 'unique' solution to the problem with multiple 'special' hierarchies.
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Here you can use writeback (MDX function Update cube), but I think it's a bit too simple for your needs. Implementation depend on the vendors. Pay attention creating cells can kill your memory as for large cubes you can quickly have millions of cells in a subcube.
What is the sparsity of your model ? -> number of cells with data / number of total cells
Some models have sparsities of 1e-30, here it's easy to explode if you're updating all cells ;-).
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
This is looking complicated The issue here is the complexity of the algos, a possible solution using MDX language and how they match with the OLAP engige (fast enough). You're taking the risk it explodes, but have a look at Scope function
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
That should not be a real challenge..
To answer your question, I don't think so. We've a similar problem - on the genetical field - and we are going to solve the problem 'adding' a dedicated calculation module to our OLAP solution. It's an interesting on going project

Efficient Ad-hoc SQL OLAP Structure

Over the years I have read a lot of people's opinions on how to get better performance out of their SQL (Microsoft SQL Server, just so we are all on the same page...) queries. However, they all seem to be tightly tied to either a high-performance OLTP setup or a data warehouse OLAP setup (cubes-galore...). However, my situation today is kind of in the middle of the 2, hence my indecision.
I have a general DB structure of [Contacts], [Sites], [SiteContacts] (the junction table of [Sites] and [Contacts]), [SiteTraits], and [ContractTraits]. I have nearly 3 million contacts with about 50 fields (between [Contacts] and [ContactTraits]) relating to just the contact, and about 600 thousand sites with about 150 fields (between [Sites] and [SiteTraits]) relating to just the sites. Basically it’s a pretty big flattened table or view… Most of the columns are int, bit, char(3), or short varchar(s). My problem is that a good portion of these columns are available to be used in ad-hoc queries by the user, and as quickly as possible because the main UI for this will be a website. I know the most common filters, but even with heavy indexing on them I think this will still be a beast… This data is read-only; the data doesn’t change at all during the day and the database will only be refreshed with the latest information during scheduled downtime. So I see this situation like an OLAP database with the read requirements of an OLTP database.
I see 3 options; 1. Break the table into smaller divisible units sub-query everything, 2. make one flat table and really go to town on the indexing 3. Create an OLAP cube and sub-query the rest based on what filter values I don’t put as the cube dimensions, and. I have not done much with OLAP cubes so I frankly don’t even know if that is an option, but from what I’ve done with them in the past I think it might be an option. Also, just to clarify what I mean when I say “sub-query everything” is instead of having a WHERE clause on the outer select, there would be one (if applicable) for each table being brought into the query and then the tables are INNER JOINed, to eliminate a really large Cartesian Product. As for the second option of the one large table, I have heard and seen conflicting results with that approach as it will save on joins but at the same time a table scan takes much longer.
Ideas anyone? Do I need to share what I’m smoking? I think this could turn into a pretty good discussion if everyone puts in their 2 cents. Oh, and feel free to tell me if I’m way off base with the OLAP cube idea if that’s the case, I’m new to that stuff too.
Thanks in advance to any and all opinions and help with this dilemma I’ve found myself in.
You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.
In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.
The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.
It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).
I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.
The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.
OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.
What are the most common queries? For single pieces of data or aggregates?
If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).
On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.
OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.
Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.