I understand that MOLAP is to organize and pre-calculate your data in a multi-dimensional structure. Each dimension is an axis, and the aggregated measures are stored in each little space. ROLAP is to query data directly from the underlying relational database.
As the title described, the user can at the same time set the dimension storage mode as MOLAP while the cube's storage mode as ROLAP. How does SSAS handle this situation?
If the aggregation(pre-calculation) only contains dimension data but not measures from the fact table, how does it improve the querying performance?
if your dim data is already in memory, including the primary keys, then you don’t need to query dims to determine parents, descendants, populate levels, etc.
Also, queries that retrieve only a handful of dim keys from the fact table can be performed in a select from the fact table only, without the need to join to the dimension. You filter directly on the fk values.
It won’t solve every performance problem, but will greatly simplify many common queries where you would otherwise need to join with several dims. Even reducing the number of joins in 1 or 2 can have an impact.
Related
I'm curious if there are any trade offs between creating a child table to hold a set of data compared to just placing all the data in the main table in the first place?
My scenario is that I have data that handles various metrics. Such as LastUpdated, and AmtOfXXX. I'm curious if it would be better to place all this data in a Table (specifically for metrics) and reference it by foreign key, or place all these fields directly in the main table and forego any foreign keys? Are there trade-offs? Performance considerations?
I'm referring to Relational Database Management Systems such as SQL Server and specifically I'll be using Entity Framework Core with MS SQL Server.
Your question appears to be more about the considerations between the two approach rather than asking which is specifically better. The latter is more an opinion. This addresses the former.
The major advantage to having a separate table that is 1-1 is to isolate the metrics from other information about the entities. There is a name for this type of data model, vertical partitioning (or at least that's what it was called when I first learned about it).
This has certain benefits:
The width of the data rows is smaller. So queries that only need the "real" data (or only the metrics) are faster.
The metrics are isolated. So adding new metrics does not require rewriting the "real" data.
A query such as select * on the "real" data only returns the real data.
Queries that modify only the metrics do not lock the "real" data.
There might also be an edge case if you have lots of columns and they fit into two tables but not into one.
Of course, there is overhead:
You need a JOIN to connect the two tables. (Although with the same primary key, the join will be quite fast).
Queries that modify both the "real" data and the metrics are more complicating, having to lock both tables.
I just want to get my head around dimensional modelling in terms of creating fact table. My understanding of it so far from the kimball book is that a fact table will be a transaction table which will related to dimensional tables to which there is a parent key constraint. Thats part 1 of my question.
Part 2, My confusion is around the fact that fact tables only store foreign keys and numeric values. What if that base transaction table stores dimensional data. What happens to these column/ attribute?
Do they get put to whichever dimensional table they relate to? how would you determine that if there is more than one table that this transaction table has a foreign key constraint.
Thank you.
With Kimball dimensional models, everything that you'd want to categorize, split, filter or otherwise order your data by goes into dimensions, leaving only the numeric fields that you'd sum, average, etc in the fact table.
Ideally, your dimensions can be reused for all fact tables across your business, giving a consistent view of all the attributes your data has available, and giving the correct results when people combine data from different fact tables through shared dimensions.
The second benefit of taking all the text/attributes out of fact tables is their increased performance when the number of rows grows a lot. This used to be a bigger concerns when storage and RAM were much more expensive and of course has been overtaken by the whole Big Data paradigm, but is still valid in any RDBMS.
Regarding part 2 of your question: Operational systems group their data for optimal (write) performance, which generally means storing data together in one table if it is used together in a transaction, and especially not bothering with many lookups and updates to secondary tables. The analysis/DWH side has completely different priorities.
Finally, you will end up with dimensional looking attributes from the original transaction that only make sense for that one table. They can go into junk dimensions or rarely, the fact table itself (degenerate dimension). Both concepts are in the book.
Hi I have a question regarding star schema query in MS SQL datawarehouse.
I have a fact table and 8 dimensions. And I am confused, to get the metrics from Fact, do we have to join all dimensions with Fact, even though I am not getting data from them? Is this required for the right metrics?
My fact table is huge, so that's why I am wondering for performance purposes and the right way to query.
Thanks!
No you do not have to join all 8 dimensions. You only need to join the dimensions that contain data you need for analyzing the metrics in the fact table. Also to increase performance make sure to only include columns from the dimension table that are needed for the analysis. Including all columns from the dimensions you join will decrease performance.
It is not necessary to include all the dimensions. Indeed, while exploring fact tables, It is very important to have the possibility to select only some dimensions to join and drop the others. The performance issues must not be an excuse to give up on this capability.
You have a bunch of different techniques to solve performance issues depending on the database you are using. Some common ways :
aggregate tables : it is one of the best way to solve performance issues. If you have a huge fact table, you can create an aggregate version of it, using only the most frequently queried columns. This way, it should be much smaller. Then, users (or the adhoc query application) has to use the aggregrate table instead of the original fact table when this is possible. The good news is that most databases know how to automatically manage aggregate tables (materialized views for example). Queries that initially target the original fact table, are transparently redirected to the aggregate table whenever possible.
indexing : bitmap indexing for example can be an efficient way to increase performance in a star schema fact table.
I'm looking to essentially have a centralized table with a number of lookup tables that surround it. The central table is going to be used to store 'Users' and the lookup tables will be user attributes, like 'Religion'. The central table will store an Id, like ReligionId, and the lookup table would contain a list of religions.
Now, I've done a lot of digging into this and I've seen many people comment saying that a UserAttribute table might be the best way to go, essentially using an EAV pattern. I'm not looking to do this. I realize that my strategy will be join-heavy and that's why I ask this question here. I'm looking for a way to optimize those joins.
If the table has 100 lookup tables, how could it be optimized to be faster than just doing a massive 100 table inner join? Some ideas come to mind like using many smaller joins, sub-selects and views. I'm open to anything, including a combination of these strategies. Again, just to note, I'm not looking to do anything that's EAV-related. I need the lookup tables for other reasons and I like normalized data.
All suggestions considered!
Here's a visual look:
Edit: Is this insane?
Optimization techniques will likely depend on the size of the center table and intended query patterns. This is very similar to what you get in data warehousing star schemas, so approaches from that paradigm may help.
For one, ensuring the size of each row is absolutely as small as possible. Disk space may be cheap, but disk throughput, memory, and CPU resources are potential bottle necks. You want small rows so that it can read them quickly and cache as much as possible in memory.
A materialized/indexed view with the joins already performed allows the joins to essentially be precomputed. This may not work well if you are dealing with a center table that is being written to alot or is very large.
Anything you can do to optimize a single join should be done for all 100. Appropriate indexes based on the selectivity of the column, etc.
Depending on what kind of queries you are performing, then other techniques from data warehousing or OLAP may apply. If you are doing lots of group by's then this is likely an area to look in to. Data warehousing techniques can be applied within SQL Server with no additional tooling.
Ask yourself why so many attributes are being queried and how they are being presented? For most analysis it is not necessary to join with lookup tables until the final step where you materialize a report, at which time you may only have grouped by on a subset of columns and thus only need some of the lookup tables.
Group By's generally should be able to group on the lookup Id's without needing the text/description from the lookup table so a join is not necessary. If your lookups have other information relevant to the query at hand then consider denormalizing it into the central table to eliminate the join and/or make that discreet value its own lookup, essentially splitting the existing lookup ID into another ID.
You could implement a master code table that combines the code tables into a single table with a CodeType column. This is not the same as EAV because you'd still have a column in the center table for each code type and a join for each, where as EAV is usually used to normalize out an arbitrary number of attributes. (Note: I personally hate master code tables.)
Lastly, consider normalization the center table if you are not doing data warehousing.
Are there lots of null values in certain lookupId columns? Is the table sparse? This is an indication that you can pull some columns out into a 1 to 1/0 relationships to reduce the size of the center table. For example, a Person table that includes address information can have a PersonAddress table pulled out of it.
Partitioning the table may improve performance if there's a large number of rows and you can determine that certain rows, perhaps with a certain old datetime from couple years in the past, would rarely be queried.
Update: See "Ask yourself why so many attributes are being queried and how they are being presented?" above. Consider a user wants to know number of sales grouped by year, department, and product. You should have id's for each of these so you can just group by those IDs on the center table and in an outer query join lookups for only what columns remain. This ensures the aggregation doesn't need to pull in unnecessary information from lookups that aren't needed anyway.
If you aren't doing aggregations, then you probably aren't querying large numbers of records at a time, so join performance is less of a concern and should be taken care of with appropriate indexes.
If you're querying large numbers of records at a time pulling in all information, I'd look hard at the business case for this. No one sits down at their desk and opens a report with a million rows and 100 columns in it and does anything meaningful with all of that data, that couldn't be accomplished in a better way.
The only case for such a query be a dump of all data intended for export to another system, in which case performance shouldn't be as much as a concern as it can be scheduled overnight.
Since you are set on your way. you can consider duplicating data in order to join less times in a similar way to what is done in olap database.
http://en.wikipedia.org/wiki/OLAP_cube
With that said I don't think this is the best way to do it if you have 100 properties.
Have you tried to export it to Microsoft Excel Power Pivot with Power Query? you can make fast data analysis with pretty awsome ways to show it with Power view video sample
I think I know what a domain table is (it basically contains all the possible values that some other column can contain), and I've looked up dimension table in Wikipedia. Unfortunately, I'm having a hard time understanding the description they have there, because they explain it in terms of another piece of jargon: "fact table", which is explained to "consists of the measurements, metrics or facts of a business process." To me, that's very tautological, which is not helpful. Can someone explain this in plain English?
Short version:
Domains represent data you've pulled out of your fact table to make the fact table smaller.
Dimensions represent axes that you've pre-aggregated along for faster querying.
Here is a long version in plain English:
You start with some set of facts. For instance every single sale your company has received, with date, product, price, geographical location, customer name - whatever your full combination of information might be - for each sale. You can put these facts into a huge table.
A large variety of queries that you want to run are in principle some fairly simple query on your fact table. However, your fact table is freaking huge. You need to make the queries faster.
(1) The first trick to making it faster is to move data out of it so it is smaller. So you can take every column that is "long text", put its possible values into a domain table, and replace the original column with an id into that table. This will make your fact table much smaller, and you can still get at your original data if you need it. This makes it much faster to query all rows since they take up less data.
That's fine if you have a small enough data set that querying the whole fact table is acceptably fast. But a lot of companies have too much data for this to be enough, so they have to be smarter.
(2) The second trick to making it faster is to pre-compute queries. Here is one way to do this. Identify a set of dimensions, and then pre-compute along dimensions and combinations of dimensions.
For instance customer name is one dimensions, some queries are per customer name, and others are across all customers. So you can add to your fact table pre-computed facts that have pre-aggregated data across all customers, and customer name has become a dimension.
Another good candidate for a dimension is geographical location. You can add summary records that aggregate, by county, by state, and across all locations. This summarizing is done after you've done the customer name aggregation, and so it will automatically have a record for total sales for all customers in a given zip code.
Repeat for any number of other dimensions.
Now when someone comes along with a query, odds are good that their query can be rewritten to take advantage of your pre-aggregated dimensions to only look at a few pre-aggregated facts, rather than all of the individual sales records. This will greatly speed up queries.
In practice, this winds up pre-aggregating more than you really need. So people building data warehouses do clever things which let them trade off effort spent pre-aggregating combinations that nobody may want versus run-time effort of having to compute a combination on the fly that could have been done in advance.
You can start with http://en.wikipedia.org/wiki/Star_schema if you
want to dig deeper on this topic.
Fact Tables and Dimension Tables, taken together, make up a Star Schema. A Star Schema is a representation, in SQL tables, of a Multidimensional data model. A multidimensional data model stores statistics, "facts", as values in a multidimensional space, where the "location" in each dimension establishes part of the context for the fact. The multidimensional data model was developed in the context of advancing the concept of data warehousing.
Dimension tables provide a key to each dimension, and attributes relevant to that dimension.
An MDDB can be stored in a data cube specially built for that purpose instead of using an SQL (relational) database. Cognos is one vendor that has its own data cube product out there. There are some advantages to using an SQL database and a star schema, over using a special purpose data cube product. There are other advantages to using a data cube product. Sometimes the advantages to the SQL plus Star schema approach outweigh the advantages of a data cube product.
Some of the advantages obtained by normalization can be obtained by designing a Snowflake Schema instead of a Star schema. However, neither star schema nor snowflake schema are going to be free from update anomalies. They are generally used in data warehousing or reporting databases, and copying data from an operational DB into one of these databases is something of a programming challenge. There are tools sold for this purpose.
A Fact table is a table which contains measurements or metrics or facts of business processes. Example:
"monthly sales number" in the Sales business process
"monthly profit amount" in the Profit business process
Most of them are additive (sales, profit), some are semi-additive (balance as of) and some are not additive (unit price).
The level of detail in Fact table is called the "grain" of the table i.e. the granularity can be fine or coarse. Fact table also contains Foreign Keys for the dimension tables.
Whereas Dimension Tables are those tables which contain attributes that helps in describing the facts of the fact table.
The following are types of Dimension Tables:
Slowly Changing Dimensions
Junk Dimensions
Confirmed Dimensions
Degenerated Dimensions
To know more you can go through the Data Warehousing Tutorials