Performance Implications of Calculated Members in SSAS - ssas

I'm wondering if there are any performance implications from adding a lot of calculated members to my cube. On one hand, it's nice to have things defined once, located centrally, tested, and available for use from any client which doesn't support MDX. On the other hand, some of these members I'm adding might not be used very frequently, so I could just inline them in the one or two reports that might need them.
Aside from the clutter of having unnecessary members hanging around, should I keep the number of calculated members as small as possible? Will more increase cube processing time? Will they slow down queries which don't use those calculated members?

Calculated members have little to no effect on processing nor on other queries. Add as many as you'd like!
The reason is that they're just defined on the cube, but actually evaluated at runtime. Therefore, the only queries that will be slowed or affected by them are queries that use them. Expect them to return a bit slower than native members for this reason, also.
Look for every opportunity to make the calculated member an actual part of your cube if it's used very frequently. Also, learn and love the scope statement. While a calculated member that's scoped is still calculated at runtime, the scope statement provides it a ready-made execution plan, so it tends to be faster. I will often create a member in the DSV and then scope it for my high-volume calculated members.

Anytime you can push the calculations back into the relational model, It will increase MDX query performance; but also have a negative impact on processing performance.
If you can pre-calculate some measures using in-row sql logic, then expose these as measures in your data source view. The storage engine can build aggregates and the formula engine will have less work to do. You are basically pushing the the heavy lifting down to sql. This works really well for static calculation and conversion factors and things like simple arithmetic etc.
Another thing you can do is create any intermediate calculated members that shouldn't be used by end user as hidden, This won't have any affect on performance; but will de-clutter the cube from the end users perspective.

Related

Best way to report comparisons of one agency to the rest of the state/nation

When attempting to do some benchmarking type reports, I run into the issue of extreme slowness due to the amount of data residing in the database, and this will get incrementally worse. I'm curious of what would be considered the best approach for reports that show for example a percentage of patients entering the hospital within a certain date range that were there due to a specific condition, as well as how that particular hospital compares to the state percentage and also the national percentage. Of course this is all based on the hospitals whose data resides in the database. I have just been writing stored procedures to calculate these percentages, but I know this isn't the best approach. I'm curious how other more experienced reporting professionals would tackle this. I'm currently using SSRS for reporting. I know a little about SSAS, but not enough to know if I should consider it for this type of reporting.
This all depends on the data-structure and the kind of calculations you have to do.
You try to narrow down the amount of data you have to process and the complexity of operations in every possible way
If you have lots of data on a slow system you first try to select the needed data, transfer it to the calculation point and then keep it cached as long as you can.
If you have huge amounts of data you try to preprocess it as much as you can. E.g. for datawarehouses you have a datetime-table with year/month/day/day-of-week/week-of-year etc in it and just constraints to them in the other tables. Like this you can avoid timeconsuming calculations.
If the operations are complex you have to analyze them to make them simpler/faster but on this point it is impossible to predict how much (if at all) there is some room.
It all depends on your understanding of the data-structure and processes you need them for, in order to improve everything as much as you can.
I myself haven't worked with SSAS yet but this is also a great tool but (imho) more for lots of different analysis.

Thoughts on dimension measures for BI

I am working with a consultant who recommends creating a measure dimension and then adding the measure dimension key to our fact table.
I can see how this can make adding new measures easier by just adding rows instead of physically creating columns in the fact table. I can also see how this can add work to the ETL process, adds another join to the star schema, one generic column in fact table to hold all measure data etc.
I'm interested in how others have dealt with this situation. We currently have close to twenty measures.
Instinctively, I don't like it: it's the EAV model, which is not very popular (you can Google the reasons why).
The EAV model is generally considered to be a headache to query and maintain
Different measures go together with different dimensions; this approach could easily turn into "one giant fact table for everything" instead of multiple smaller fact tables for specific reporting areas
I suspect you would end up creating views to give the appearance of multiple fact tables anyway
You will multiply the number of rows in your fact table by the number of measures, resulting in a much bigger physical table
Even with a good indexing/partitioning scheme, queries that include more than one measure will have to read a lot more rows to get the data
What about measures with different data types?
Is this easily supported in your reporting tool?
I'm sure there are other issues, but those are the ones that come to mind immediately. As a rule of thumb, if someone suggests an EAV implementation in any context, you should be very wary and ask them exactly what advantages it offers and how it will be managed as the data and complexity increase. But I think you've already identified some key areas of concern.
SSAS will do this, and I know of a major vendor of insurance policy administration software that provided a M.I. solution for their system that works like this. You do get some flexibility from the approach in that you can add measures without having to deploy a build of the cube, although for 20 measures I don't think you need to worry about that.
'Measures' is essentially another dimension (and often referred to as such in the documentation). I believe SSAS uses a largely column-oriented structure behind the scenes.
However, a naive application of this approach does have some issues that could come and bite you to a greater or lesser extent.
You only have one measure, [Value], [Amount] or whatever it's called. If your tool won't let you inject calculated measures at the front-end then you can't sort the whole data set on the value of one of your attribute types. ProClarity and report builder >=2.0 will do this but Excel won't.
You can't do ratios or other calculated measures in this way. You will have to either embed them in the cube script (meaning you need to deploy a build to add them) or use a tool that lets you define them in the client.
Although it doesn't make a lot of differece to the cube it will be slow to query on the database and increase storage requirements. It's also fiddly to query on the database.

Is OLAP/MDX a good way to process data w/ unknown values at various aggregation levels

I'm new to OLAP, so perhaps I don't know the right terminology to use for this question, but bear with me here.
I work with lots of hierarchical, multidimensional data where parent/aggregated cells mostly have data, but child/leaf cells are often missing data (attribute values are unknown but non-zero). I currently use a combination of scripting and SQL to work with it, but that's getting unwieldy. It seems like OLAP cubes and MDX are better suited to the structure of the data, but not necessarily to tasks I need to do with it. For example:
OLAP seems mainly designed for read-only reporting; I do a lot of modifications to the data in batch processes
OLAP seems to like having complete leaf-level data to calculate aggregates; my data has missing values at various levels
Examples of what I want to do:
Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
Queries and calculations need to be able to handle the unknowns properly. Ideally be able to easily query how much of an aggregated cell's value is made up of estimated vs. known values, possibly compute confidence/error statistics, or check whether we can derive an exact value for an unknown when it has a known parent and all known siblings, etc.
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
Could an OLAP server and MDX be a good tool for this type of work? Are there any other tools that would work well for manipulating hierarchical/multidimensional/gap-filled data?
That's some needs for an OLAP system, interesting and challenging :-) :
- Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
You can change the way cubes aggregate values in a hierarchy. Doing this in one hierarchy is fine doing this using in multiple hierarchies might start to get complicated. It's worth checking twice if there is a mathematical 'unique' solution to the problem with multiple 'special' hierarchies.
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Here you can use writeback (MDX function Update cube), but I think it's a bit too simple for your needs. Implementation depend on the vendors. Pay attention creating cells can kill your memory as for large cubes you can quickly have millions of cells in a subcube.
What is the sparsity of your model ? -> number of cells with data / number of total cells
Some models have sparsities of 1e-30, here it's easy to explode if you're updating all cells ;-).
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
This is looking complicated The issue here is the complexity of the algos, a possible solution using MDX language and how they match with the OLAP engige (fast enough). You're taking the risk it explodes, but have a look at Scope function
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
That should not be a real challenge..
To answer your question, I don't think so. We've a similar problem - on the genetical field - and we are going to solve the problem 'adding' a dedicated calculation module to our OLAP solution. It's an interesting on going project

What is a good db for high update interval and aggregated reads?

I need to find the solution for the following problem:
A system that has 10,000s of parent object, each one contains a few hundreds of sub types with valus which might be updated a few times each second.
On the other end, I need to generate aggregated data That can be calculated 4-5 times a second. There can be 10,000s of this different aggregators.
The good thing is that rows are rarely added during runtime, only updated. Also, the value types are small int and floats.
What framework should I use?
I thought about using a key-value nosql method, but that doesn't enables me to do aggregated calls. Any idea is welcome, thanks
You should consider an OLAP solution, like SQL Server Analysis services. OLAP systems pre-calculate aggregations on several levels and facilitate therefore very high performance for scenarios such as yours.

Would this method work to scale out SQL queries?

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.