Can you create preaggregated Dimensions/Measuresments like OLAP in BigQuery with Tableau? - google-bigquery

During the Cloud Migration of an On-Premise Microsoft SQL DB, the OLAP Cube, which is part of it, should also be replaced (but not migrated directly). There is the business requirement to keep the functionality in Tableau that you can select different measurements and dimension with their corresponding aggregations, as is possible now when connecting to the OLAP Cube in Tableau.
The underlying Data Source View includes ca. 10 tables (e.g. customer, sales, payment-method, customer-segmentation, time). So via OLAP the analysis "give me the average sales per payment method per customer-segment for every week" is a couple of clicks, in pure SQL it's already some effort.
How can you offer defined aggregations for some BigQUery tables without the user having to write the joins and aggregations by themselves, mainly because it takes much more time than simply drag & drop (SQL skills & time of query-execution are not the issue)?

The answer turns out to be pretty straight forward:
Join all source data together and write it into one flat table in BigQuery which includes the same information as the data source view in the OLAP Cube. Then Tableau connects to this table. The "measurements" logic from the cube is implemented as calculations in Tableau, the table columns are the dimensions.
Some caution needs to be applied when replicating the measurements because 1:n relations in the Data Source View result in multiplied data in the flat table. This can be solvedwith the correct use of Distinct Functions (e.g. "Distinct Count") in the measurement definition.
The table will end up quite large, but the queries on it are very fast, resulting in a performance increase compared to the OLAP Cube with the same user experience as using a cube in Tableau.

Related

Analysis services cube in Tableau

I'm trying to analyse data from my cube from Analysis Services in Tableau. My cube is:
And now I'd like to count number of facts that happened in each state (state is connected through city).But when I'm choosing StateID from DimStates Tableau shows, that data are incompatible. Is there any way to join them? Counting number of facts in each city work well.
Please help
Tableau's functionality is severely limited when working with cube data sources. I can't see exactly what you're trying to do, but it looks like you're trying to add a table calculation to the measure in your view which (depending on the aggregation) isn't supported:
Cube data sources are pre-aggregated and thus do not support aggregation functions, such as SUM(), AVG(), and CNT().
It may be possible to use Table Calculations to perform aggregation operations on the cell-level results from the cube in Tableau.
https://help.tableau.com/current/pro/desktop/en-us/cubes.htm

SSAS Data Source View Relationship

I need to get all the relationships in my SSAS cube Data Source View between Fact and Dim tables. I've around 15 Fact tables and linked dimensions to it. Is there any MDX query to get the relationship other then doing it manually
I suspect that you want to export a list of relationships between measure groups and dimensions as they are represented in the Dimension Usage tab of the cube designer. (The relationships in the DSV don't much matter unless SSAS needs to figure out how to join two tables in a SQL query it generates. You can have a cube with no DSV relationships at all. As long as the Dimension Usage tab has the right relationships then the cube will work.)
So please install the free BI Developer Extensions and run the Printer Friendly Dimension Usage Report. I believe it will contain the info you need.
I would recommend the above. If you want to look at the appropriate data management view (DMV) run the MDSCHEMA_MEASUREGROUP_DIMENSIONS DMV. It is harder to use and interpret but has what you need in terms of representing the Dimension Usage tab:
Select * from $system.MDSCHEMA_MEASUREGROUP_DIMENSIONS

COLUMN STORE INDEX vs CLUSTERED INDEX..Which one to use?

I’m trying to evaluate which type of indexes to use on our tables in the SQL Server 2014 data mart, which we are using to power our OLAP cube in SSAS. I have read the documentation on MSDN and still a bit unclear which is the right strategy for our use case with the ultimate goal of speeding up the queries on SQL Server that the cube issues when people browse the cube.
I have the tables related to each other as shown in the following snow flake dimensional model. The majority of the calculations that we are going to do in the cube, is COUNT DISTINCT of the users (UserInfoKey) based on different combination of dimensions (both filters and pivots). Keeping that in mind, what would the SQL experts suggest I do in terms of creating indexes on the tables?. I have the option of creating COLUMN STORE INDEXES on all my tables (partitioned by the HASH of primary keys) or create the regular primary keys (clustered indexes) on all my tables. Which one is better for my scenario? From my understanding the cube will be doing a lot of joins and groupby’s under the covers based on the dimensions selected by the user.
I tried both versions with some sample data and the performance isn’t that different in both cases. Now before I do the same experiment with real data (it’s going to take a lot of time to produce the real data and load it into our data mart), I wanted to check with the experts about their suggestions.
We are also evaluating if we should use PDW( Parallel Datawarehouse) as our data mart instead of vanilla SQL Server 2014.
Just to give an idea on the scale of data we are dealing with
The 2 largest tables are
ActivityData fact table : 784+ million rows
DimUserInfo dimension table: 30 + million rows
Any help or pointers are appreciated

SSAS Environment or CUBE creation methodology

Though I have relatively good exposer in SQL Server, but I am still a newbie in SSAS.
We are to create set of reports in SSRS and have the Data source as SSAS CUBE.
Some of the reports involves data from atleast 3 or 4 tables and also would involve Grouping and all possible things from SQL Environment (like finding the max record for a day and joining that with 4 more tables and apply filtering logic on top of it)
So the actual question is should I need to have these logics implemented in Cubes or have them processed in SQL Database (using Named Query in SSAS) and get the result to be stored in Cube which would be shown in the report? I understand that my latter option would involve creation of more Cubes depending on each report being developed.
I was been told to create Cubes with the data from Transaction Tables and do entire logic creation using MDX queries (as source in SSRS). I am not sure if that is a viable solution.
Any help in this would be much appreciated; Thanks for reading my note.
Aru
EDIT: We are using SQL Server 2012 version for our development.
OLAP cubes are great at performing aggregations of data, effectively grouping over the majority of columns all at once. You should not strive to implement all the grouping at the named query or relational views level as this will prevent you from being able to drill down through the data in the cube and will result in unnecessary overhead on the relational database when processing the cube.
I would start off by planning to pull in the most granular data from your relational database into your cube and only perform filtering or grouping in the named queries or views if data volumes or processing time are a concern. SSAS will perform some default aggregations of the data to allow for fast queries at the most grouped level.
More complex concerns such as max(someColumn) for a particular day can still be achieved in the cube by using different aggregations, but you do get into complex scenarios if you want to aggregate by one function (MAX) only to the day level and then by another function across other dimensions (e.g. summing the max of each day per country). In that case it may well be worth performing the max-per-day calculation in a named query or view and loading that into its own measure group to be aggregated by SUM after that.
It sounds like you're at the beginning of the learning path for OLAP, so I'd encourage you to look at resources from the Kimball Group (no affiliation) including, if you have time, the excellent book "The Data Warehouse Toolkit". At a minimum, please look into Dimensional Modelling techniques as your cube design will be a good deal easier if you produce a dimensional model (likely a star schema) in either views or named queries.
I would look at BISM Tabular if your model is not complicated. It compresses and stores data in memory. As for data processing I would suggest to keep all calculations and grouping in database layer (create views).
All the calculations and grouping should be done at database level atleast in form of views.
There are mainly two ways to store data (MOLAP and ROLAP). Use MOLAP storage model for deal with tables that store transactions kind of data.
The customer's expectation from transaction data (from my experience) would be to understand the sales based upon time dimension. Something like Get total sales in last week or last quarter. etc.
MDX scripts are basically kind of SQL scripts that Cube can understand. No logic should be there. based upon what Parameters are chosen in SSRS report, MDX query should be prepared. Small analytical functions such as subtotal, average can be dome by MDX but not complex calculations.

Are dimension tables and domain tables the same thing?

I think I know what a domain table is (it basically contains all the possible values that some other column can contain), and I've looked up dimension table in Wikipedia. Unfortunately, I'm having a hard time understanding the description they have there, because they explain it in terms of another piece of jargon: "fact table", which is explained to "consists of the measurements, metrics or facts of a business process." To me, that's very tautological, which is not helpful. Can someone explain this in plain English?
Short version:
Domains represent data you've pulled out of your fact table to make the fact table smaller.
Dimensions represent axes that you've pre-aggregated along for faster querying.
Here is a long version in plain English:
You start with some set of facts. For instance every single sale your company has received, with date, product, price, geographical location, customer name - whatever your full combination of information might be - for each sale. You can put these facts into a huge table.
A large variety of queries that you want to run are in principle some fairly simple query on your fact table. However, your fact table is freaking huge. You need to make the queries faster.
(1) The first trick to making it faster is to move data out of it so it is smaller. So you can take every column that is "long text", put its possible values into a domain table, and replace the original column with an id into that table. This will make your fact table much smaller, and you can still get at your original data if you need it. This makes it much faster to query all rows since they take up less data.
That's fine if you have a small enough data set that querying the whole fact table is acceptably fast. But a lot of companies have too much data for this to be enough, so they have to be smarter.
(2) The second trick to making it faster is to pre-compute queries. Here is one way to do this. Identify a set of dimensions, and then pre-compute along dimensions and combinations of dimensions.
For instance customer name is one dimensions, some queries are per customer name, and others are across all customers. So you can add to your fact table pre-computed facts that have pre-aggregated data across all customers, and customer name has become a dimension.
Another good candidate for a dimension is geographical location. You can add summary records that aggregate, by county, by state, and across all locations. This summarizing is done after you've done the customer name aggregation, and so it will automatically have a record for total sales for all customers in a given zip code.
Repeat for any number of other dimensions.
Now when someone comes along with a query, odds are good that their query can be rewritten to take advantage of your pre-aggregated dimensions to only look at a few pre-aggregated facts, rather than all of the individual sales records. This will greatly speed up queries.
In practice, this winds up pre-aggregating more than you really need. So people building data warehouses do clever things which let them trade off effort spent pre-aggregating combinations that nobody may want versus run-time effort of having to compute a combination on the fly that could have been done in advance.
You can start with http://en.wikipedia.org/wiki/Star_schema if you
want to dig deeper on this topic.
Fact Tables and Dimension Tables, taken together, make up a Star Schema. A Star Schema is a representation, in SQL tables, of a Multidimensional data model. A multidimensional data model stores statistics, "facts", as values in a multidimensional space, where the "location" in each dimension establishes part of the context for the fact. The multidimensional data model was developed in the context of advancing the concept of data warehousing.
Dimension tables provide a key to each dimension, and attributes relevant to that dimension.
An MDDB can be stored in a data cube specially built for that purpose instead of using an SQL (relational) database. Cognos is one vendor that has its own data cube product out there. There are some advantages to using an SQL database and a star schema, over using a special purpose data cube product. There are other advantages to using a data cube product. Sometimes the advantages to the SQL plus Star schema approach outweigh the advantages of a data cube product.
Some of the advantages obtained by normalization can be obtained by designing a Snowflake Schema instead of a Star schema. However, neither star schema nor snowflake schema are going to be free from update anomalies. They are generally used in data warehousing or reporting databases, and copying data from an operational DB into one of these databases is something of a programming challenge. There are tools sold for this purpose.
A Fact table is a table which contains measurements or metrics or facts of business processes. Example:
"monthly sales number" in the Sales business process
"monthly profit amount" in the Profit business process
Most of them are additive (sales, profit), some are semi-additive (balance as of) and some are not additive (unit price).
The level of detail in Fact table is called the "grain" of the table i.e. the granularity can be fine or coarse. Fact table also contains Foreign Keys for the dimension tables.
Whereas Dimension Tables are those tables which contain attributes that helps in describing the facts of the fact table.
The following are types of Dimension Tables:
Slowly Changing Dimensions
Junk Dimensions
Confirmed Dimensions
Degenerated Dimensions
To know more you can go through the Data Warehousing Tutorials