I've been struggling with a scenario for a while which I want to solve properly. In the image below I've created a model which contains the relevant tables from a more complex data model.
The requirement
Calculate this:
DIVIDE
SUM(Fact_Marketvalue.Marketvalue) x SUM(Fact_Rating.Carbonemission)
where the Fact_Ratings.Carbonemission > 5)
BY
SUM(Fact_Rating.Carbonemission)
where the Fact_Ratings.Carbonemission > 5
The result should by sliceable by Dim_Category, Dim_Customer, Dim_Security and more dimension tables in my model.
Right now I created many-many relationships between Fact_MarketValue and Fact_Ratings on the Security ID, but I want to avoid this relationship. It does compute the correct result, but at query time in the report it is quite slow, and the model becomes complex.
What would be the best thing to do? Create a view in SQL which holds all information needed? Or should I create a bridge table, if so, with what information? Or maybe I am missing some DAX magic..
I have tried solving this by creating the following DAX Measure. But I am not happy with the many-many relationship. It does compute the correct result but it it is slow and complex in my model.
Related
I am testing Pentaho 5.3 with Impala. My schema defines a dimension by relating it to its dimension table. However, when I try to use the Filter mondrian tries to obtain the dimensions from the fact table and since the table it is big it takes a long time just to load the dimensions to filter by. I do use the approxRowCount in my dimension definition.
I also have an instillation of Pentaho 5.0 with PG using a similar dataset and exact same schema and when I use the filter dimensions are loaded instantaneously. So it seems to me the issue is not schema related.
Could anyone tell me if this behavior (when trying to use Filter mondrian aggregating dimension data from the fact table instead of dimension table) is due to Pentaho settings or what could cause it?
Thank you in advance!
If anyone else ever wonders about this behavior it could be related to the fact that joins are much more efficient in PG than a NoSql database like Impala. I though that when using the Filter mondrian obtains the dimensions from the dimension table if provided with one. However, it seems it does a join of dimension and fact table before displaying dimensions.
Hope that helps!
Though I have relatively good exposer in SQL Server, but I am still a newbie in SSAS.
We are to create set of reports in SSRS and have the Data source as SSAS CUBE.
Some of the reports involves data from atleast 3 or 4 tables and also would involve Grouping and all possible things from SQL Environment (like finding the max record for a day and joining that with 4 more tables and apply filtering logic on top of it)
So the actual question is should I need to have these logics implemented in Cubes or have them processed in SQL Database (using Named Query in SSAS) and get the result to be stored in Cube which would be shown in the report? I understand that my latter option would involve creation of more Cubes depending on each report being developed.
I was been told to create Cubes with the data from Transaction Tables and do entire logic creation using MDX queries (as source in SSRS). I am not sure if that is a viable solution.
Any help in this would be much appreciated; Thanks for reading my note.
Aru
EDIT: We are using SQL Server 2012 version for our development.
OLAP cubes are great at performing aggregations of data, effectively grouping over the majority of columns all at once. You should not strive to implement all the grouping at the named query or relational views level as this will prevent you from being able to drill down through the data in the cube and will result in unnecessary overhead on the relational database when processing the cube.
I would start off by planning to pull in the most granular data from your relational database into your cube and only perform filtering or grouping in the named queries or views if data volumes or processing time are a concern. SSAS will perform some default aggregations of the data to allow for fast queries at the most grouped level.
More complex concerns such as max(someColumn) for a particular day can still be achieved in the cube by using different aggregations, but you do get into complex scenarios if you want to aggregate by one function (MAX) only to the day level and then by another function across other dimensions (e.g. summing the max of each day per country). In that case it may well be worth performing the max-per-day calculation in a named query or view and loading that into its own measure group to be aggregated by SUM after that.
It sounds like you're at the beginning of the learning path for OLAP, so I'd encourage you to look at resources from the Kimball Group (no affiliation) including, if you have time, the excellent book "The Data Warehouse Toolkit". At a minimum, please look into Dimensional Modelling techniques as your cube design will be a good deal easier if you produce a dimensional model (likely a star schema) in either views or named queries.
I would look at BISM Tabular if your model is not complicated. It compresses and stores data in memory. As for data processing I would suggest to keep all calculations and grouping in database layer (create views).
All the calculations and grouping should be done at database level atleast in form of views.
There are mainly two ways to store data (MOLAP and ROLAP). Use MOLAP storage model for deal with tables that store transactions kind of data.
The customer's expectation from transaction data (from my experience) would be to understand the sales based upon time dimension. Something like Get total sales in last week or last quarter. etc.
MDX scripts are basically kind of SQL scripts that Cube can understand. No logic should be there. based upon what Parameters are chosen in SSRS report, MDX query should be prepared. Small analytical functions such as subtotal, average can be dome by MDX but not complex calculations.
We have a problem with a Fact table in our cube.
We know it is not the best practice of developing a Dimensional db but we have dimension and fact table combined in 1 table.
This is because Dimensional data isn't much (5 fields). But moving on to the problem.
We have added this table to our cube and for testing we added 1 measure(count of rows). . Like the image says we have the grand total for every sub category this isn't correct.
.
Does anyone have an idea where we have to look for the problem.
Kind regards,
Phoenix
You have not defined a relationship between your sub category dimension and your fact table. That has resulted in the full count mapping to all of the sub category attributes, hence the same value repeating
Add a relation between cube measure group and your dimension on the second tab (Dimension Usage) in cube (It's 'regular' and key-level on both sides in most cases).
If this relation is exist try to recreate it again. Sometimes it happens after several manual changes in 'advanced' mode.
Check dimension mapping in fact table. If everything is ok, try to add new dimension with only one level the first time, than add another one etc. I know it sounds like shaman tricks but still...
And always use SQL Server Profiler on both servers (SQL, SSAS) to capture exact query that returns wrong value. Maybe the mistake is somewhere else.
I think I know what a domain table is (it basically contains all the possible values that some other column can contain), and I've looked up dimension table in Wikipedia. Unfortunately, I'm having a hard time understanding the description they have there, because they explain it in terms of another piece of jargon: "fact table", which is explained to "consists of the measurements, metrics or facts of a business process." To me, that's very tautological, which is not helpful. Can someone explain this in plain English?
Short version:
Domains represent data you've pulled out of your fact table to make the fact table smaller.
Dimensions represent axes that you've pre-aggregated along for faster querying.
Here is a long version in plain English:
You start with some set of facts. For instance every single sale your company has received, with date, product, price, geographical location, customer name - whatever your full combination of information might be - for each sale. You can put these facts into a huge table.
A large variety of queries that you want to run are in principle some fairly simple query on your fact table. However, your fact table is freaking huge. You need to make the queries faster.
(1) The first trick to making it faster is to move data out of it so it is smaller. So you can take every column that is "long text", put its possible values into a domain table, and replace the original column with an id into that table. This will make your fact table much smaller, and you can still get at your original data if you need it. This makes it much faster to query all rows since they take up less data.
That's fine if you have a small enough data set that querying the whole fact table is acceptably fast. But a lot of companies have too much data for this to be enough, so they have to be smarter.
(2) The second trick to making it faster is to pre-compute queries. Here is one way to do this. Identify a set of dimensions, and then pre-compute along dimensions and combinations of dimensions.
For instance customer name is one dimensions, some queries are per customer name, and others are across all customers. So you can add to your fact table pre-computed facts that have pre-aggregated data across all customers, and customer name has become a dimension.
Another good candidate for a dimension is geographical location. You can add summary records that aggregate, by county, by state, and across all locations. This summarizing is done after you've done the customer name aggregation, and so it will automatically have a record for total sales for all customers in a given zip code.
Repeat for any number of other dimensions.
Now when someone comes along with a query, odds are good that their query can be rewritten to take advantage of your pre-aggregated dimensions to only look at a few pre-aggregated facts, rather than all of the individual sales records. This will greatly speed up queries.
In practice, this winds up pre-aggregating more than you really need. So people building data warehouses do clever things which let them trade off effort spent pre-aggregating combinations that nobody may want versus run-time effort of having to compute a combination on the fly that could have been done in advance.
You can start with http://en.wikipedia.org/wiki/Star_schema if you
want to dig deeper on this topic.
Fact Tables and Dimension Tables, taken together, make up a Star Schema. A Star Schema is a representation, in SQL tables, of a Multidimensional data model. A multidimensional data model stores statistics, "facts", as values in a multidimensional space, where the "location" in each dimension establishes part of the context for the fact. The multidimensional data model was developed in the context of advancing the concept of data warehousing.
Dimension tables provide a key to each dimension, and attributes relevant to that dimension.
An MDDB can be stored in a data cube specially built for that purpose instead of using an SQL (relational) database. Cognos is one vendor that has its own data cube product out there. There are some advantages to using an SQL database and a star schema, over using a special purpose data cube product. There are other advantages to using a data cube product. Sometimes the advantages to the SQL plus Star schema approach outweigh the advantages of a data cube product.
Some of the advantages obtained by normalization can be obtained by designing a Snowflake Schema instead of a Star schema. However, neither star schema nor snowflake schema are going to be free from update anomalies. They are generally used in data warehousing or reporting databases, and copying data from an operational DB into one of these databases is something of a programming challenge. There are tools sold for this purpose.
A Fact table is a table which contains measurements or metrics or facts of business processes. Example:
"monthly sales number" in the Sales business process
"monthly profit amount" in the Profit business process
Most of them are additive (sales, profit), some are semi-additive (balance as of) and some are not additive (unit price).
The level of detail in Fact table is called the "grain" of the table i.e. the granularity can be fine or coarse. Fact table also contains Foreign Keys for the dimension tables.
Whereas Dimension Tables are those tables which contain attributes that helps in describing the facts of the fact table.
The following are types of Dimension Tables:
Slowly Changing Dimensions
Junk Dimensions
Confirmed Dimensions
Degenerated Dimensions
To know more you can go through the Data Warehousing Tutorials
I am building a poor man's data warehouse using a RDBMS. I have identified the key 'attributes' to be recorded as:
sex (true/false)
demographic classification (A, B, C etc)
place of birth
date of birth
weight (recorded daily): The fact that is being recorded
My requirements are to be able to run 'OLAP' queries that allow me to:
'slice and dice'
'drill up/down' the data and
generally, be able to view the data from different perspectives
After reading up on this topic area, the general consensus seems to be that this is best implemented using dimension tables rather than normalized tables.
Assuming that this assertion is true (i.e. the solution is best implemented using fact and dimension tables), I would like to seek some help in the design of these tables.
'Natural' (or obvious) dimensions are:
Date dimension
Geographical location
Which have hierarchical attributes. However, I am struggling with how to model the following fields:
sex (true/false)
demographic classification (A, B, C etc)
The reason I am struggling with these fields is that:
They have no obvious hierarchical attributes which will aid aggregation (AFAIA) - which suggest they should be in a fact table
They are mostly static or very rarely change - which suggests they should be in a dimension table.
Maybe the heuristic I am using above is too crude?
I will give some examples on the type of analysis I would like to carryout on the data warehouse - hopefully that will clarify things further.
I would like to aggregate and analyze the data by sex and demographic classification - e.g. answer questions like:
How does male and female weights compare across different demographic classifications?
Which demographic classification (male AND female), show the most increase in weight this quarter.
etc.
Can anyone clarify whether sex and demographic classification are part of the fact table, or whether they are (as I suspect) dimension tables.?
Also assuming they are dimension tables, could someone elaborate on the table structures (i.e. the fields)?
The 'obvious' schema:
CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));
may not be the correct one.
Not sure why you feel that using RDBMS is poor man's solution, but hope this may help.
Tables dimGeography and dimDemographic are so-called mini-dimensions; they allow for slicing based on demographic and geography without having to join dimUser, and also to capture user's current demographic and geography at the time of measurement.
And by the way, when in DW world, verbose -- Gender = 'female', AgeGroup = '30-35', EducationLevel = 'university', etc.
Star schema searches are the SQL equivalent of the intersection points of Venn Diagrams. As your sample queries clearly show, SEX_TYPE and DEMOGRAPHIC_CATEGORY are sets you want to search by and hence must be dimensions.
As for the table structures, I think your design for SEX_TYPE is misguided. For starters it is easier, more intuitive, to design queries on the basis of
where sex_type.name = 'FEMALE'
than
where sex_type.is_male = 1
Besides, in the real world sex is not a boolean. Most applications should gather UNKNOWN and TRANSGENDER as well, and that's certainly true for health/medical apps which is what you seem to be doing. Furthermore, it will avoid some unpleasant office arguments if you have any female co-workers.
Edit
"I am thinking of how to deal with
cases of new sex_types and demographic
categories not already in the
database"
There was a vogue for not having foreign keys in Data Warehouses. But they provide useful metadata which a query optimizer can use to derive the most efficient search path. This is particularly important when there is a lot of data and ad hoc queries to process. Dealing with new dimension values is always going to be hard, unless your source systems provide you with notification. This really depends on your set-up.
Generally, all numeric quantities and measures are columns in the fact table(s). Then everything else is a dimensional attribute. Which dimension they belong in is rather pragmatic and depends on the data.
In addition to the suggestions you have already received, I saw no mention of degenerate dimensions. In these cases, things like an invoice number or sequence number timestamp which is different for every fact needs to be stored in the fact, otherwise the dimension table will become 1-1 with the fact table.
A key design decision in your case is probably the analysis of data related to age if the study is ongoing. Because people's ages change with time, they will move to another age group at some point. Depending on whether the groups are fixed at the beginning of a study or not, this may determine how you want to aggregate. I'm not necessarily saying you should have a group dimension and get to age through that, but that you may need to determine the correct age/demographic dimension during the ETL. But this depends on the end use (or accommodate both with two dimension roles linked from the fact table - initial demographics, which never changes, and current demographics which will change with time).
A similar thing could apply with geography. Although you can obviously track a person's geography by analysing current geography changes over time, the point of a dimensional DW is to have all the relevant dimensions linked straight to the fact (things which you might normally derive in a normalized model through the network of an Entity-Relationship model - these get locked in at the time of ETL). This redundancy makes analysis quicker on the dimensional model in traditional RDBMSes.
Note that a lot of this does not apply in massively parallel DW like Teradata, which don't perform well with star schemas - they like all the data normalized and linked up to the same primary index because they the primary index to distribute the data over the processing units.
What OLAP / presentation tier tool are you intending to use? These often have features of their own to support building of cubes, hierarchies, aggregations, etc.
Normal Form is usually the most sound basis for a flexible and efficient Data Warehouse, although Marts are sometimes denormalized to support a specific set of reporting requirements. In the absence of any other information I suggest you aim to ensure your database is in at least Boyce-Codd / 5th Normal Form.