We are migrating from Microsoft's Analysis Services (SSAS) to HP Vertica database as our OLAP solution. Part of that involves changing our query language from MDX to SQL. We had a custom MDX query generator which allowed others to query for data through api or user interface by specifying needed dimensions and facts (outputs). Generated MDX queries were ok and we didn't have to handle joins manually.
However, now we are using SQL language and since we are keeping data in different fact tables we need to find a way how to generate those queries using same dimension and fact primitives.
Eg. if a user wants to see a client name together with a number of sales, we might take a request:
dimensions: { 'client_name' }
facts: { 'total_number_of_sales' }
and generate a query like this:
select clients.name, sum(sales.total)
from clients
join sales on clients.id = sales.client_id
group by 1
And it gets more complicated really quickly.
I am thinking about graph based solution which would store the relations between dimension and fact tables and I could build the required joins by finding shortest path between the nodes in a graph.
I would really appreciate any information on this subject including any keywords i should use searching for a solution to this type of problem or 3rd party products which could solve it. I have tried searching for it, but the problems were quite different.
You can use free Mondrian OLAP engine which can execute queries written in the MDX language on the top of relational database (RDBMS).
For a reporting you can try Saiku or Pentaho BI server on the of Mondrian OLAP.
Related
I have implemented Row Level Security using on SQL Server 2016. I think I have a failry complex setup, but our security requirement is complex.
This is in the context of a data warehouse. I have basic fact and dimension tables. I applied row level security to one of my dimension table with the following setup:
Table 1 : dimDataSources (standard physical table)
Table 2 : dimDataSources_Secured (Memory Optimized table)
I created a Security Policy on the dimDataSources_Secured (In-Memory) that uses a Natively Compiled function. That function read another Memory Optimized table that contains lookup values and Active Directory Groups that can read the record.
The function use the is_member() function to return 1 for all records that are allowed for my groups.
So the context seems a bit complex but so far it works.
But... now I get to use this in jonctions with fact table and we get performance hit. Here, I am not applying row level security directly on the fact table... only on the dimension table.
So my problem is if I run this:
SELECT SUM(Sales) FROM factSales
It returns quickly, let's say 2 seconds.
If I run the same query but with a join on the secured table (or view), it will take 5-6 times longer:
SELECT SUM(Sales) FROM factSales f
INNER JOIN dimDataSources_Secured d ON f.DataSourceKey = d.DataSourceKey
This retrieves only the source I have access to based on my AD groups.
When the execution plan changes, it seems like it retrieves the fact table data quickly, but then will do a nested loop lookup on the In-Memory table to get the allowed records.
Is that behavior caused by the usage of the Filter Predicate functions?
Anyone had good or bad experiences using Row Level Security?
Is it mature enough to put in production?
Is it a good candidate for data warehousing (i.e. processing big volumes of data)?
It is hard to put more details on my actual function and queries without writing a novel. I'm mostly looking for guidelines or alternatives.
Is that behavior caused by the usage of the Filter Predicate
functions? Anyone had good or bad experiences using Row Level
Security? is it mature enough to put in production? Is it a good
candidate for datawarhousing (processing of big volume of Data)?
Yea, you'll take a performance hit when using RLS. Aaron Bertrand wrote a good piece in March of 2017 on it. Ben Snaidero wrote a good one in 2016. Microsoft has also provided guidance on patterns to limit performance impact.
I've never seen RLS implemented for a OLAP schema so I can't comment on that. Without seeing your filter predicates, it's tough to say, but that's usually where the devil is.
I am developing SSRS reports that require a user selection (via a parameter) to retrieve either live data or historical data.
The sources for live and historical data are separate objects in a SQL Server database (views for live data; table-valued functions accepting a date parameter for historical data), but their schemas - the columns they return - are the same, so other than the dataset definition, the rest of the report doesn't need to know what its source is.
The dataset query draws from several database objects, and it contains joins and case statements in the select.
There are several approaches I can take to surfacing data from different sources based on the parameter selection I've described (some of which I've tested), listed below.
The main goal is to ensure that performance for retrieving the live data (primary use case) is not unduly affected by the presence of logic and harnessing to support the history use case. In addition, ease maintainability of the solution (including database objects and rdl) is a secondary, but important, factor.
Use an expression in the dataset query text, to conditionally return full SQL query text with the correct sources included using string concatenation. Pros: can resolve to a straight query that isn't polluted by the 'other' use case for any given execution. All logic for the report is housed in the report. Cons: awful to work with, and has limitations for lengthy SQL.
Use a function in the report's code module to do the same as 1. Pros: as per 1., but marginally better design-time experience. Cons: as per 1., but also adds another layer of abstraction that reduces ease of maintenance.
Implement multi-step TVFs on the database, that process the parameter and retrieve the correct data using logic in T-SQL. Pros: flexibility of t-SQL functionality, no string building/substitution involved. Can select * from its results and apply further report parameters in the report's dataset query. Cons: big performance hit when compared to in-line queries. Moves some logic outside the rdl.
Implement stored procedures to do the same as 3. Pros: as per 3, but without ease of select *. Cons: as per 3.
Implement in-line TVFs that union together live and history data, but using a dummy input parameter that adds something that resolves to 1=0 in the where clause of the source that isn't relevant. Pros: clinging on to the in-line query approach, other pros as per 3. Cons: feels like a hack, adds performance hit just for a query component that is known to return 0 rows. Adds complexity to the query.
I am leaning towards options 3 or 4 at this point, but eager to hear what would be a preferred approach (even if not listed here) and why?
What's the difference between live and historical? Is "live" data, data that changes and historical does not?
Is it not possible to replicate or push live/historical data into a Data Warehouse built specifically for reporting?
Though I have relatively good exposer in SQL Server, but I am still a newbie in SSAS.
We are to create set of reports in SSRS and have the Data source as SSAS CUBE.
Some of the reports involves data from atleast 3 or 4 tables and also would involve Grouping and all possible things from SQL Environment (like finding the max record for a day and joining that with 4 more tables and apply filtering logic on top of it)
So the actual question is should I need to have these logics implemented in Cubes or have them processed in SQL Database (using Named Query in SSAS) and get the result to be stored in Cube which would be shown in the report? I understand that my latter option would involve creation of more Cubes depending on each report being developed.
I was been told to create Cubes with the data from Transaction Tables and do entire logic creation using MDX queries (as source in SSRS). I am not sure if that is a viable solution.
Any help in this would be much appreciated; Thanks for reading my note.
Aru
EDIT: We are using SQL Server 2012 version for our development.
OLAP cubes are great at performing aggregations of data, effectively grouping over the majority of columns all at once. You should not strive to implement all the grouping at the named query or relational views level as this will prevent you from being able to drill down through the data in the cube and will result in unnecessary overhead on the relational database when processing the cube.
I would start off by planning to pull in the most granular data from your relational database into your cube and only perform filtering or grouping in the named queries or views if data volumes or processing time are a concern. SSAS will perform some default aggregations of the data to allow for fast queries at the most grouped level.
More complex concerns such as max(someColumn) for a particular day can still be achieved in the cube by using different aggregations, but you do get into complex scenarios if you want to aggregate by one function (MAX) only to the day level and then by another function across other dimensions (e.g. summing the max of each day per country). In that case it may well be worth performing the max-per-day calculation in a named query or view and loading that into its own measure group to be aggregated by SUM after that.
It sounds like you're at the beginning of the learning path for OLAP, so I'd encourage you to look at resources from the Kimball Group (no affiliation) including, if you have time, the excellent book "The Data Warehouse Toolkit". At a minimum, please look into Dimensional Modelling techniques as your cube design will be a good deal easier if you produce a dimensional model (likely a star schema) in either views or named queries.
I would look at BISM Tabular if your model is not complicated. It compresses and stores data in memory. As for data processing I would suggest to keep all calculations and grouping in database layer (create views).
All the calculations and grouping should be done at database level atleast in form of views.
There are mainly two ways to store data (MOLAP and ROLAP). Use MOLAP storage model for deal with tables that store transactions kind of data.
The customer's expectation from transaction data (from my experience) would be to understand the sales based upon time dimension. Something like Get total sales in last week or last quarter. etc.
MDX scripts are basically kind of SQL scripts that Cube can understand. No logic should be there. based upon what Parameters are chosen in SSRS report, MDX query should be prepared. Small analytical functions such as subtotal, average can be dome by MDX but not complex calculations.
I was looking at pluralsight´s SSRS-training and they used regular sql to get data to the datasets. I am just learning mdx and when I work with datasets I so far only use mdx to get data. Should/could I mix this, should I use SQL instead of mdx? I don´t want to, now that I started to enjoy mdx..
MDX is often used against multidimensional cubes and have some commands specifically for this purpose which SQL does not have. If your datasource is a database, and not a cube however SQL is most commonly used as far as I know.
Comparison of SQL and MDX: http://msdn.microsoft.com/en-us/library/aa216779%28v=sql.80%29.aspx
MDX language = OLAP Cubes (SSAS)
SQL language = Relational databases.
OLAP cubes are used for reporting and performance reasons. When data or information is needed and it involves large aggregations of data or calculations of large amounts of data from a relational database, a OALP cube can be created to sometimes better handle the demands of the data requirements. MDX is the query language used to pull data from the cube.
Here's an example to help. You need to pull some data for a report. You could use a SQL statement or a cube (MDX) for this data. You test using a SQL statement, but the query takes 5 minutes to run. Or with a cube, you could add the equivalent of the SQL statement into the cube design where it will make the equivalent of the SQL query results available instantly. How is this possible? Cube's are relational databases full of pre-run calculations and aggregations of data. Pre-run, meaning they were run or processed at some earlier time, likely at night when everyone was home.
MDX is specific to only one reporting program, SQL Server Reporting Services (SSRS). SQL is tied to multiple different database programs. Usually people who know MDX are already an expert or very familiar with SQL. I'd learn SQL first since there are many more applications for it than MDX>
I have access to an OLAP catalog, but I am not familiar with MDX. I am looking for the MDX equivalent of SQL:
SHOW DATABASES;
SHOW TABLES;
I was looking at MDX language reference, but I could not find a way of getting the schema, the cube metadata. Thanks for helping.
You can use the $SYSTEM database to query your objects.
Use SELECT * FROM $SYSTEM.DISCOVER_SCHEMA_ROWSETS to get a list of things you can query. In your case it would most likely be DBSCHEMA_CATALOG, DBSCHEMA_TABLES and MDSCHEMA_CUBES.
This is very rough information, and using stuff like Preet suggests might be favorable in the end.
There is answer List dimension members with MDX query to show how list dimensions.
This open source project (TSSASM) shows how to query access the cube structure from a TSQL database.
However I think you may need XMLA commands to see what you need.