Dimensional and cube developme data models

Dimensional and cube developme data models - ssas

I have created a dimensional model which is similar in structure to the financial reporting design in the AdventureworksDW environment, where the value of each account is held as a single value column in the fact table and the dimensions give the data its semantic meaning.
There are over a thousand columns in this model so it works well for adding or deleting additional columns. Here is a really good blog on this design: http://garrettedmondson.wordpress.com/2011/10/26/dimensional-modeling-financial-data-in-ssas/
Although this model works well for querying the dimensional model, and there are examples supporting this model for dimensional analysis, I'm concerned that this model is not standard for cube development or data mining which seem to prefer wider tables.
Questions:
Is this design categorized as Entity-Attribute-Value (EAV)?
Would a design using multiple fact tables be better? So many wide fact tables (up to 10) with up to 200-300 columns each, but fewer rows.
Should I expect more performance issues with the much wider tables?

You are right that specific design is considered as EAV model.
By using such a design, you can easily add new accounts, hierarchies etc. You dont need to update your model.
I would not recommend one column per measure aproach. Most account will be null in most of the rows. Also with such a design, you need to read all of your measures even if you need to retrieve only one of them.
We heavily use account dimension in our cubes. Unfortunately things like shared members are not easy to handle in SSAS like in Essbase.
You need to create an Account dimension which is parent-child and also you need to have the key of this account dimension in the fact table as usual.
By using account dimension, you get nice support for time balance functionality. Using time balance functionality of SSAS supposed to be faster than custom MDX code.
We are converting unary operators and parent-child relationships to formulas at the moment.
So basically we have normal formulas, and parents in hierarchies also works as formulas.
At the end we are flattening the hierarchy. So it is not possible to drill down in account dimension. We are using account dimension as a calculation engine only.
It is possible to have proper hierarchies as well, but we decided not to mix custom rollup members and unary operators at the same time.
Shared members and all our formulas implemented as custom rollup members.

Related

Factless Fact Table, but with Facts?

Problem: I am working with a SaaS company that provides monthly services. We are trying to create a data model to track customer related metrics such as count, signups, cancellations, and reactivations. I’ve done extensive research online, but the closest I’ve found is accumulating snapshots with start/end dates, which doesn’t make sense with a SaaS company where a customer can reactivate an account.
My initial thought is to create a Factless Fact table for customer, however this factless table would also have keys to event dimension tables, I.e. DimSignupType, DimCancellationType, DimReactivationType, etc and boolean measures for isSignup, isCancellation, and isReactivation. I think this is counterintuitive because a factless fact table shouldn’t have facts, but I need track those and feel multiple fact tables is worse because I would have to join them together in the view.
Is there a better approach to this problem?
Edit based on feedback: The main goal of this is to create a dimensional model that is maintainable, but also something I create a view for with other dimensional tables that allows less technical users to discover insights with tools like Tableau. At the end of the day I need to provide a large flat view with multiple measures and dimensions that allows for easy analytical discovery. Common questions may be, "How many signups do we have MTD for this customer type vs last mtd?", "How many cancelations did we have due to Non-Payment this month compared to last", "How many reactivations from Non-Payment did we have this month compared to last?", etc. A lot of this meta data comes from Dimension tables I would join to the factless fact table based on keys, however it still requires a focus on Signups, Cancellations and Reactivations being tracked as Facts for reporting purposes. So I don't know the best modelling approach for it that abides by traditional standards. It almost seems like a Snapshot Fact Table that contains keys to dimensional tables that describe events to be aggregated. I just don't know what that would be called.
I feel the most flexible solution in terms of data management and ease of use would be a factless fact table modeled in a daily snapshot manner with "facts" for signup, cancellation and, reactivations that link to types.

SSAS Tabular - Multiple Models?

We're starting to build an SSAS tabular model and wondering if most people have one model or multiple. If multiple, do you duplicate tables that are needed by each, or is there a way to share tables between models? I think I know the answer, but I'm hoping those with more experience can confirm what we've found...
From what I've researched I think...
- you can't share tables across models - any "common" tables would have to be duplicated in and deployed with each model and would take up memory
- we should create one model, use perspectives to organize the tables and make it easier to work with
- multiple models could be acceptable if there is little or no common data across models
thanks

You are correct, there is no way to share tables between models.
Perspectives can help.
The question of whether to have one model or more depends on the user audience. Who are the users? How analytically savvy are they? Will they have a reasonable understanding of the model structure?
One issue that affects my rather unsophisticated users is when a dimension does not relate to all fact tables. In this scenario, as is expected, measures on the fact table calculate identical values for every member of the unrelated dimension. For less knowledgeable users, this situation is confusing.

I agree with Ari's answer, and am posting this answer to explain my own experience.
We use a few large models for more sophisticated users that are in memory and processed once a week. We have agreed with the business that these models will not be available during processing, so we are able to processes without holding a transaction open which allows us to keep many more smaller models because we do not need to keep 2x the size of our largest model available to the instance. We use perspectives to simplify the presentation and reduce the confusion for the multiple fact tables. Even with perspectives, the models are rather complex, and it takes some training to get users used to working with the different facts.
We also use smaller models, usually more targeted to a specific audience/need. Many are processed daily, and use transactional processing to ensure the are available to users as much as possible. There are several dimensions that are used in several of our small models, but we are able to filter them so that user's do not see the full list of members which reduces size, and has been a huge benefit for my users, because they only see members that have a fact that they are analyzing instead of a list of every member associated with any fact.
We use views to ensure conformity between models when a dimension is used in multiple models. In my opinion this is very important, as it is very confusing when I have the same dimension with slightly different attribute names.
To sum up (pun intended)...
I like developing and working with large models. I think they answer more questions with less work.
Most users I have worked with prefer smaller, more concise models. Your server hardware/processing requirements may direct you to smaller models as well, even though some of the dimensions will be duplicated.

SSAS - data in three places?

New to DW concepts and SSAS. I'm reading alot that normalized relational dbs are optimal for OLTP due to a typical workload of many one-transaction batches. And denormalization is generally better for DW/BI applications because the nature of queries used for reporting are more batch-based... there were other reasons that I don't recall right now.
It sounds like the advice says to create a denormalized model and populate it from the base relationship model and then build your cubes off the denormalized model. Assuming you're using MOLAP storage type, your cube will store and incrementally update your data in a multidimensional model that it builds behind the scenes.
So now we have essentially the same data stored three times!
Am I reading that right? Why do we even need that intermediate denormalized table? It can't be to optimize report queries because those are being run against the multidimensional SSAS data store. Why not just build your cubes against a dsv whose definition is basically a view of the relational db?

The multidimensional model needs the relational model to be available in star schemas (that is what you call "denormalized model") for loading the data. And in many cases, there is some processing like combining data from different sources, keeping the data for reporting longer than it is needed in the OLTP world, keeping historical views like old regional or department structures available for analyzing which are not necessary and hence overwritten in the OLTP world. Hence, this intermediate step makes sense in many cases. You might also want to have clear cut of times, i. e. always report data for complete days (or, in some cases, months), and not have some data for the last day available and some not, which makes comparison of numbers for a day easier than comparing e. g. the sales of today containing only the data up to 10 o'clock with the sales of the whole day yesterday.
In some simple cases, the intermediate relational data structure need not be available physically. A few days ago I prepared a prototype cube where the star schema was just a set of views on the source data. In this case, of course, the data was only physically available in the original source form and in the cube. The structure of the source data did not make the views that inefficient, and thus data loading to the cube was fast enough for the prototype.

Thoughts on dimension measures for BI

I am working with a consultant who recommends creating a measure dimension and then adding the measure dimension key to our fact table.
I can see how this can make adding new measures easier by just adding rows instead of physically creating columns in the fact table. I can also see how this can add work to the ETL process, adds another join to the star schema, one generic column in fact table to hold all measure data etc.
I'm interested in how others have dealt with this situation. We currently have close to twenty measures.

Instinctively, I don't like it: it's the EAV model, which is not very popular (you can Google the reasons why).
The EAV model is generally considered to be a headache to query and maintain
Different measures go together with different dimensions; this approach could easily turn into "one giant fact table for everything" instead of multiple smaller fact tables for specific reporting areas
I suspect you would end up creating views to give the appearance of multiple fact tables anyway
You will multiply the number of rows in your fact table by the number of measures, resulting in a much bigger physical table
Even with a good indexing/partitioning scheme, queries that include more than one measure will have to read a lot more rows to get the data
What about measures with different data types?
Is this easily supported in your reporting tool?
I'm sure there are other issues, but those are the ones that come to mind immediately. As a rule of thumb, if someone suggests an EAV implementation in any context, you should be very wary and ask them exactly what advantages it offers and how it will be managed as the data and complexity increase. But I think you've already identified some key areas of concern.

SSAS will do this, and I know of a major vendor of insurance policy administration software that provided a M.I. solution for their system that works like this. You do get some flexibility from the approach in that you can add measures without having to deploy a build of the cube, although for 20 measures I don't think you need to worry about that.
'Measures' is essentially another dimension (and often referred to as such in the documentation). I believe SSAS uses a largely column-oriented structure behind the scenes.
However, a naive application of this approach does have some issues that could come and bite you to a greater or lesser extent.
You only have one measure, [Value], [Amount] or whatever it's called. If your tool won't let you inject calculated measures at the front-end then you can't sort the whole data set on the value of one of your attribute types. ProClarity and report builder >=2.0 will do this but Excel won't.
You can't do ratios or other calculated measures in this way. You will have to either embed them in the cube script (meaning you need to deploy a build to add them) or use a tool that lets you define them in the client.
Although it doesn't make a lot of differece to the cube it will be slow to query on the database and increase storage requirements. It's also fiddly to query on the database.

Is OLAP/MDX a good way to process data w/ unknown values at various aggregation levels

I'm new to OLAP, so perhaps I don't know the right terminology to use for this question, but bear with me here.
I work with lots of hierarchical, multidimensional data where parent/aggregated cells mostly have data, but child/leaf cells are often missing data (attribute values are unknown but non-zero). I currently use a combination of scripting and SQL to work with it, but that's getting unwieldy. It seems like OLAP cubes and MDX are better suited to the structure of the data, but not necessarily to tasks I need to do with it. For example:
OLAP seems mainly designed for read-only reporting; I do a lot of modifications to the data in batch processes
OLAP seems to like having complete leaf-level data to calculate aggregates; my data has missing values at various levels
Examples of what I want to do:
Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
Queries and calculations need to be able to handle the unknowns properly. Ideally be able to easily query how much of an aggregated cell's value is made up of estimated vs. known values, possibly compute confidence/error statistics, or check whether we can derive an exact value for an unknown when it has a known parent and all known siblings, etc.
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
Could an OLAP server and MDX be a good tool for this type of work? Are there any other tools that would work well for manipulating hierarchical/multidimensional/gap-filled data?

That's some needs for an OLAP system, interesting and challenging :-) :
- Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
You can change the way cubes aggregate values in a hierarchy. Doing this in one hierarchy is fine doing this using in multiple hierarchies might start to get complicated. It's worth checking twice if there is a mathematical 'unique' solution to the problem with multiple 'special' hierarchies.
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Here you can use writeback (MDX function Update cube), but I think it's a bit too simple for your needs. Implementation depend on the vendors. Pay attention creating cells can kill your memory as for large cubes you can quickly have millions of cells in a subcube.
What is the sparsity of your model ? -> number of cells with data / number of total cells
Some models have sparsities of 1e-30, here it's easy to explode if you're updating all cells ;-).
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
This is looking complicated The issue here is the complexity of the algos, a possible solution using MDX language and how they match with the OLAP engige (fast enough). You're taking the risk it explodes, but have a look at Scope function
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
That should not be a real challenge..
To answer your question, I don't think so. We've a similar problem - on the genetical field - and we are going to solve the problem 'adding' a dedicated calculation module to our OLAP solution. It's an interesting on going project

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas