SSAS Environment or CUBE creation methodology - sql

Though I have relatively good exposer in SQL Server, but I am still a newbie in SSAS.
We are to create set of reports in SSRS and have the Data source as SSAS CUBE.
Some of the reports involves data from atleast 3 or 4 tables and also would involve Grouping and all possible things from SQL Environment (like finding the max record for a day and joining that with 4 more tables and apply filtering logic on top of it)
So the actual question is should I need to have these logics implemented in Cubes or have them processed in SQL Database (using Named Query in SSAS) and get the result to be stored in Cube which would be shown in the report? I understand that my latter option would involve creation of more Cubes depending on each report being developed.
I was been told to create Cubes with the data from Transaction Tables and do entire logic creation using MDX queries (as source in SSRS). I am not sure if that is a viable solution.
Any help in this would be much appreciated; Thanks for reading my note.
Aru
EDIT: We are using SQL Server 2012 version for our development.

OLAP cubes are great at performing aggregations of data, effectively grouping over the majority of columns all at once. You should not strive to implement all the grouping at the named query or relational views level as this will prevent you from being able to drill down through the data in the cube and will result in unnecessary overhead on the relational database when processing the cube.
I would start off by planning to pull in the most granular data from your relational database into your cube and only perform filtering or grouping in the named queries or views if data volumes or processing time are a concern. SSAS will perform some default aggregations of the data to allow for fast queries at the most grouped level.
More complex concerns such as max(someColumn) for a particular day can still be achieved in the cube by using different aggregations, but you do get into complex scenarios if you want to aggregate by one function (MAX) only to the day level and then by another function across other dimensions (e.g. summing the max of each day per country). In that case it may well be worth performing the max-per-day calculation in a named query or view and loading that into its own measure group to be aggregated by SUM after that.
It sounds like you're at the beginning of the learning path for OLAP, so I'd encourage you to look at resources from the Kimball Group (no affiliation) including, if you have time, the excellent book "The Data Warehouse Toolkit". At a minimum, please look into Dimensional Modelling techniques as your cube design will be a good deal easier if you produce a dimensional model (likely a star schema) in either views or named queries.

I would look at BISM Tabular if your model is not complicated. It compresses and stores data in memory. As for data processing I would suggest to keep all calculations and grouping in database layer (create views).

All the calculations and grouping should be done at database level atleast in form of views.
There are mainly two ways to store data (MOLAP and ROLAP). Use MOLAP storage model for deal with tables that store transactions kind of data.
The customer's expectation from transaction data (from my experience) would be to understand the sales based upon time dimension. Something like Get total sales in last week or last quarter. etc.
MDX scripts are basically kind of SQL scripts that Cube can understand. No logic should be there. based upon what Parameters are chosen in SSRS report, MDX query should be prepared. Small analytical functions such as subtotal, average can be dome by MDX but not complex calculations.

Related

Can you create preaggregated Dimensions/Measuresments like OLAP in BigQuery with Tableau?

During the Cloud Migration of an On-Premise Microsoft SQL DB, the OLAP Cube, which is part of it, should also be replaced (but not migrated directly). There is the business requirement to keep the functionality in Tableau that you can select different measurements and dimension with their corresponding aggregations, as is possible now when connecting to the OLAP Cube in Tableau.
The underlying Data Source View includes ca. 10 tables (e.g. customer, sales, payment-method, customer-segmentation, time). So via OLAP the analysis "give me the average sales per payment method per customer-segment for every week" is a couple of clicks, in pure SQL it's already some effort.
How can you offer defined aggregations for some BigQUery tables without the user having to write the joins and aggregations by themselves, mainly because it takes much more time than simply drag & drop (SQL skills & time of query-execution are not the issue)?
The answer turns out to be pretty straight forward:
Join all source data together and write it into one flat table in BigQuery which includes the same information as the data source view in the OLAP Cube. Then Tableau connects to this table. The "measurements" logic from the cube is implemented as calculations in Tableau, the table columns are the dimensions.
Some caution needs to be applied when replicating the measurements because 1:n relations in the Data Source View result in multiplied data in the flat table. This can be solvedwith the correct use of Distinct Functions (e.g. "Distinct Count") in the measurement definition.
The table will end up quite large, but the queries on it are very fast, resulting in a performance increase compared to the OLAP Cube with the same user experience as using a cube in Tableau.

Analysis services cube in Tableau

I'm trying to analyse data from my cube from Analysis Services in Tableau. My cube is:
And now I'd like to count number of facts that happened in each state (state is connected through city).But when I'm choosing StateID from DimStates Tableau shows, that data are incompatible. Is there any way to join them? Counting number of facts in each city work well.
Please help
Tableau's functionality is severely limited when working with cube data sources. I can't see exactly what you're trying to do, but it looks like you're trying to add a table calculation to the measure in your view which (depending on the aggregation) isn't supported:
Cube data sources are pre-aggregated and thus do not support aggregation functions, such as SUM(), AVG(), and CNT().
It may be possible to use Table Calculations to perform aggregation operations on the cell-level results from the cube in Tableau.
https://help.tableau.com/current/pro/desktop/en-us/cubes.htm

COLUMN STORE INDEX vs CLUSTERED INDEX..Which one to use?

I’m trying to evaluate which type of indexes to use on our tables in the SQL Server 2014 data mart, which we are using to power our OLAP cube in SSAS. I have read the documentation on MSDN and still a bit unclear which is the right strategy for our use case with the ultimate goal of speeding up the queries on SQL Server that the cube issues when people browse the cube.
I have the tables related to each other as shown in the following snow flake dimensional model. The majority of the calculations that we are going to do in the cube, is COUNT DISTINCT of the users (UserInfoKey) based on different combination of dimensions (both filters and pivots). Keeping that in mind, what would the SQL experts suggest I do in terms of creating indexes on the tables?. I have the option of creating COLUMN STORE INDEXES on all my tables (partitioned by the HASH of primary keys) or create the regular primary keys (clustered indexes) on all my tables. Which one is better for my scenario? From my understanding the cube will be doing a lot of joins and groupby’s under the covers based on the dimensions selected by the user.
I tried both versions with some sample data and the performance isn’t that different in both cases. Now before I do the same experiment with real data (it’s going to take a lot of time to produce the real data and load it into our data mart), I wanted to check with the experts about their suggestions.
We are also evaluating if we should use PDW( Parallel Datawarehouse) as our data mart instead of vanilla SQL Server 2014.
Just to give an idea on the scale of data we are dealing with
The 2 largest tables are
ActivityData fact table : 784+ million rows
DimUserInfo dimension table: 30 + million rows
Any help or pointers are appreciated

Why use sql language instead of mdx language in ssrs?

I was looking at pluralsight´s SSRS-training and they used regular sql to get data to the datasets. I am just learning mdx and when I work with datasets I so far only use mdx to get data. Should/could I mix this, should I use SQL instead of mdx? I don´t want to, now that I started to enjoy mdx..
MDX is often used against multidimensional cubes and have some commands specifically for this purpose which SQL does not have. If your datasource is a database, and not a cube however SQL is most commonly used as far as I know.
Comparison of SQL and MDX: http://msdn.microsoft.com/en-us/library/aa216779%28v=sql.80%29.aspx
MDX language = OLAP Cubes (SSAS)
SQL language = Relational databases.
OLAP cubes are used for reporting and performance reasons. When data or information is needed and it involves large aggregations of data or calculations of large amounts of data from a relational database, a OALP cube can be created to sometimes better handle the demands of the data requirements. MDX is the query language used to pull data from the cube.
Here's an example to help. You need to pull some data for a report. You could use a SQL statement or a cube (MDX) for this data. You test using a SQL statement, but the query takes 5 minutes to run. Or with a cube, you could add the equivalent of the SQL statement into the cube design where it will make the equivalent of the SQL query results available instantly. How is this possible? Cube's are relational databases full of pre-run calculations and aggregations of data. Pre-run, meaning they were run or processed at some earlier time, likely at night when everyone was home.
MDX is specific to only one reporting program, SQL Server Reporting Services (SSRS). SQL is tied to multiple different database programs. Usually people who know MDX are already an expert or very familiar with SQL. I'd learn SQL first since there are many more applications for it than MDX>

To aggregate or not to aggregate, that is the database schema design question

If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?
This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.
I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").
On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.
Is there anyway I can have the best of both worlds?
We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:
You have a lot of flexibility for
querying any way you want. Before we
had to build specific aggregates,
but now one cube answers all our
questions.
Storage in a cube is far smaller
than the detailed data.
Building and processing the cubes
takes less time and produces less
load on the database servers than
the aggregates did.
Some CONS:
There is a learning curve around
building cubes and learning MDX.
We had to create some tools to
automate working with the cubes.
UPDATE:
Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.
It helps to pick a good primary key (ie [user_id, used_date, used_time]). For a constant user_id it's then very fast to do a range-condition on used_date.
But as the table grows, you can reduce your table-size by aggregating to a table like [user_id, used_date]. For every range where the time-of-day doesn't matter you can then use that table. An other way to reduce the table-size is archiving old data that you don't (allow) querying anymore.
I always lean towards raw data. Once aggregated, you can't go back.
Nothing to do with deletion - unless there's the simplest of aggregated data sets, you can't accurately revert/transpose the data back to raw.
Ideally, I'd use a materialized view (assuming that the data can fit within the constraints) because it is effectively a table. But MySQL doesn't support them, so the next consideration would be a view with the computed columns, or a trigger to update an actual table.
Long history question, for currently, I found this useful, answered by microstrategy engineer
BTW, another already have solutions like (cube.dev/dremio) you don't have to do by yourself.