When is a Static preferred over Dynamic Named Sets in SSAS - ssas

Ok, so this question is with regard to the use of Dynamic and Static Sets in Analysis Server. It is my understanding that Static Sets are evaluated during the processing of the Cube and don't regard any conditions specified in the WHERE slicer in MDX. Whereas Dynamic Sets are evaluated during the time of Query execution.
I have a cube that has an extremely large number of customer Claim numbers in a Dimension. If my clients wanted to view any number of Measures associated with these Claim Numbers but wanted to look at a range, they'd have to filter a good number of the Claim Numbers.
To alleviate this, I created a Dynamic Set that does nothing more than group several thousand of the numbers into a group that they may use. The problem is that I've seen a great degradation of performance using the Dynamic Sets so I am thinking that a Static Set may suffice for this type of scenario.
So will a Static Set suffice if it is just a set of Dimension members with no calculations contained in it?

At least static sets gives SSAS the ability to cache results; so you might get some performance improvement but rethinking the [Claim] dimension would be better if possible.

Related

SSRS: Patterns for conditional dataset definitions

I am developing SSRS reports that require a user selection (via a parameter) to retrieve either live data or historical data.
The sources for live and historical data are separate objects in a SQL Server database (views for live data; table-valued functions accepting a date parameter for historical data), but their schemas - the columns they return - are the same, so other than the dataset definition, the rest of the report doesn't need to know what its source is.
The dataset query draws from several database objects, and it contains joins and case statements in the select.
There are several approaches I can take to surfacing data from different sources based on the parameter selection I've described (some of which I've tested), listed below.
The main goal is to ensure that performance for retrieving the live data (primary use case) is not unduly affected by the presence of logic and harnessing to support the history use case. In addition, ease maintainability of the solution (including database objects and rdl) is a secondary, but important, factor.
Use an expression in the dataset query text, to conditionally return full SQL query text with the correct sources included using string concatenation. Pros: can resolve to a straight query that isn't polluted by the 'other' use case for any given execution. All logic for the report is housed in the report. Cons: awful to work with, and has limitations for lengthy SQL.
Use a function in the report's code module to do the same as 1. Pros: as per 1., but marginally better design-time experience. Cons: as per 1., but also adds another layer of abstraction that reduces ease of maintenance.
Implement multi-step TVFs on the database, that process the parameter and retrieve the correct data using logic in T-SQL. Pros: flexibility of t-SQL functionality, no string building/substitution involved. Can select * from its results and apply further report parameters in the report's dataset query. Cons: big performance hit when compared to in-line queries. Moves some logic outside the rdl.
Implement stored procedures to do the same as 3. Pros: as per 3, but without ease of select *. Cons: as per 3.
Implement in-line TVFs that union together live and history data, but using a dummy input parameter that adds something that resolves to 1=0 in the where clause of the source that isn't relevant. Pros: clinging on to the in-line query approach, other pros as per 3. Cons: feels like a hack, adds performance hit just for a query component that is known to return 0 rows. Adds complexity to the query.
I am leaning towards options 3 or 4 at this point, but eager to hear what would be a preferred approach (even if not listed here) and why?
What's the difference between live and historical? Is "live" data, data that changes and historical does not?
Is it not possible to replicate or push live/historical data into a Data Warehouse built specifically for reporting?

How does OLAP address dimensions that are numeric ranges?

To preface this, I'm not familiar with OLAP at all, so if the terminology is off, feel free to offer corrections.
I'm reading about OLAP and it seems to be all about trading space for speed, wherein you precalculate (or calculate on demand) and store aggregations about your data, keyed off by certain dimensions. I understand how this works for dimensions that have a discrete set of values, like { Male, Female } or { Jan, Feb, ... Dec } or { #US_STATES }. But what about dimensions that have completely arbitrary values like (0, 1.25, 3.14156, 70000.23, ...)?
Does the use of OLAP preclude the use of aggregations in queries that hit the fact tables, or is it merely used to bypass things that can be precalculated? Like, arbitrary aggregations on arbitrary values still need to be done on the fly?
Any other help regarding learning more about OLAP would be much appreciated. At first glance, both Google and SO seem to be a little dry (compared to other, more popular topics).
Edit: Was asked for a dimension on which there are arbitrary values.
VELOCITY of experiments: 1.256 m/s, -2.234 m/s, 33.78 m/s
VALUE of transactions: $120.56, $22.47, $9.47
Your velocity and value column examples are usually not the sort of columns that you would query in an OLAP way - they are the values you're trying to retrieve, and would presumably be in the result set, either as individual rows or aggregated.
However, I said usually. In our OLAP schema, we have a good example of a column you're thinking of: event_time (a date-time field, with granualarity to the second). In our data, it will be nearly unique - no two events will be happening during the same second, but since we have years of data in our table, that still means there are hundreds of millions of potentially discrete values, and when we run our OLAP queries, we almost always want to constrain based on time ranges.
The solution is to do what David Raznick has said - you create a "bucketed" version of the value. So, in our table, in addition to the event_time column, we have an event_time_bucketed column - which is merely the date of the event, with the time part being 00:00:00. This reduces the count of distinct values from hundreds of millions to a few thousand. Then, in all queries that constrain on date, we constrain on both the bucketed and the real column (since the bucketed column will not be accurate enough to give us the real value), e.g.:
WHERE event_time BETWEEN '2009.02.03 18:32:41' AND '2009.03.01 18:32:41'
AND event_time_bucketed BETWEEN '2009.02.03' AND '2009.03.01'
In these cases, the end user never sees the event_time_bucketed column - it's just there for query optimization.
For floating point values like you mention, the bucketing strategy may require a bit more thought, since you want to choose a method that will result in a relatively even distribution of the values and that preserves contiguity. For example, if you have a classic bell distribution (with tails that could be very long) you'd want to define the range where the bulk of the population lives (say, 1 or 2 standard deviations from mean), divide it into uniform buckets, and create two more buckets for "everything smaller" and "everything bigger".
I have found this link to be handy http://www.ssas-info.com/
Check out the webcasts section where in they walk you through different aspects starting from, what is BI, Warehousing TO designing a cube, dimensions, calculations, aggregations, KPIs, perspectives etc.
In OLAP aggregations help in reducing the query response time by having pre-calculated values which would be used by the query. However, the flip side is increase in storage space as more space would be needed to store aggregations apart from the base data.
SQL Server Analysis Services has Usage Based Optimization Wizard which helps in aggregation design by analyzing queries that have been submitted by clients (reporting clients like SQL Server Reporting Services, Excel or any other) and refining the aggregation design accordingly.
I hope this helps.
cheers

To aggregate or not to aggregate, that is the database schema design question

If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?
This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.
I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").
On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.
Is there anyway I can have the best of both worlds?
We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:
You have a lot of flexibility for
querying any way you want. Before we
had to build specific aggregates,
but now one cube answers all our
questions.
Storage in a cube is far smaller
than the detailed data.
Building and processing the cubes
takes less time and produces less
load on the database servers than
the aggregates did.
Some CONS:
There is a learning curve around
building cubes and learning MDX.
We had to create some tools to
automate working with the cubes.
UPDATE:
Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.
It helps to pick a good primary key (ie [user_id, used_date, used_time]). For a constant user_id it's then very fast to do a range-condition on used_date.
But as the table grows, you can reduce your table-size by aggregating to a table like [user_id, used_date]. For every range where the time-of-day doesn't matter you can then use that table. An other way to reduce the table-size is archiving old data that you don't (allow) querying anymore.
I always lean towards raw data. Once aggregated, you can't go back.
Nothing to do with deletion - unless there's the simplest of aggregated data sets, you can't accurately revert/transpose the data back to raw.
Ideally, I'd use a materialized view (assuming that the data can fit within the constraints) because it is effectively a table. But MySQL doesn't support them, so the next consideration would be a view with the computed columns, or a trigger to update an actual table.
Long history question, for currently, I found this useful, answered by microstrategy engineer
BTW, another already have solutions like (cube.dev/dremio) you don't have to do by yourself.

Sorting based on calculation with nhibernate - best practice

I need to do paging with the sort order based on a calculation. The calculation is similar to something like reddit's hotness algorithm in that its dependant on time - time since post creation.
I'm wondering what the best practice for this would be. Whether to have this sort as a SQL function, or to run an update once an hour to calculate the whole table.
The table has hundreds of thousands of rows. And I'm using nhibernate, so this could cause problems for the scheduled full calcution.
Any advice?
It most likely will depend a lot on the load on your server. A few assumptions for my answer:
Your calculation is most likely not simple, but will take into account a variety of factors, including time elapsed since post
You are expecting at least reasonable growth in your site, meaning new data will be added to your table.
I would suggest your best bet would be to calculate and store your ranking value, and as Nuno G mentioned retrieve using an ordered clause. As you note there are likely to be some implications, two of which would be:
Scheduling Updates
Ensuring access to the table
As far as scheduling goes you may be able to look at some ways of intelligently recalculating your value. For example, you may be able to identify when a calculation is likely to be altered (for example, if a dependant record is updated you might fire a trigger, adding the ID of your table to a queue for recalculation). You may also do the update in ranges, rather then in the full table.
You will also want to minimise any locking of your table whilst you are recalculating. There are a number of ways to do this, including setting your isolation levels (using MS SQL terminonlogy). If you are really worried you could even perform your calculation externally (eg. in a temp table) and then simply run an update of the values to your main table.
As a final note I would recommend looking into the paging options available to you - if you are talking about thousands of records make sure that your mechanism determines the page you need on the SQL server so that you are not returning the thousands of rows to your application, as this will slow things down for you.
If you can perform the calculation using SQL, try use Hibernate to load the sorted collection by executing a SQLQuery, where your query includes a 'ORDER BY' expression.

Setting up Dim and Fact tables for a Data Warehouse

I'm tasked with creating a datawarehouse for a client. The tables involved don't really follow the traditional examples out there (product/orders), so I need some help getting started. The client is essentially a processing center for cases (similar to a legal case). Each day, new cases are entered into the DB under the "cases" table. Each column contains some bit of info related to the case. As the case is being processed, additional one-to-many tables are populated with events related to the case. There are quite a few of these event tables, example tables might be: (case-open, case-dept1, case-dept2, case-dept3, etc.). Each of these tables has a caseid which maps back to the "cases" table. There are also a few lookup tables involved as well.
Currently, the reporting needs relate to exposing bottlenecks in the various stages and the granularity is at the hour level for certain areas of the process.
I may be asking too much here, but I'm looking for some direction as to how I should setup my Dim and Fact tables or any other suggestions you might have.
The fact table is the case event and it is 'factless' in that it has no numerical value. The dimensions would be time, event type, case and maybe some others depending on what other data is in the system.
You need to consolidate the event tables into a single fact table, labelled with an 'event type' dimension. The throughput/bottleneck reports are calculating differences between event times for specific combinations of event types on a given case.
The reports should calculate the event-event times and possibly bin them into a histogram. You could also label certain types of event combinations and apply the label to the events of interest. These events could then have the time recorded against them, which would allow slice-and-dice operations on the times with an OLAP tool.
If you want to benchmark certain stages in the life-cycle progression you would have a table that goes case type, event type1, event type 2, benchmark time.
With a bit of massaging, you might be able to use a data mining toolkit or even a simple regression analysis to spot correlations between case attributes and event-event times (YMMV).
I suggest you check out Kimball's books, particularly this one, which should have some examples to get you thinking about applications to your problem domain.
In any case, you need to decide if a dimensional model is even appropriate. It is quite possible to treat a 3NF database 'enterprise data warehouse' with different indexes or summaries, or whatever.
Without seeing your current schema, it's REALLY hard to say. Sounds like you will end up with several star models with some conformed dimensions tying them together. So you might have a case dimension as one of your conformed dimensions. The facts from each other table would be in fact tables which link both to the conformed dimension and any other dimensions appropriate to the facts, so for instance, if there is an employee id in case-open, that would link to an employee conformed dimension, from the case-open-fact table. This conformed dimension might be linked several times from several of your subsidiary fact tables.
Kimball's modeling method is fairly straightforward, and can be followed like a recipe. You need to start by identifying all your facts, grouping them into fact tables, identifying individual dimensions on each fact table and then grouping them as appropriate into dimension tables, and identifying the type of each dimension.
Like any other facet of development, you must approach the problem from the end requirements ("user stories" if you will) backwards. The most conservative approach for a warehouse is to simply represent a copy of the transaction database. From there, guided by the requirements, certain optimizations can be made to enhance the performance of certain data access patterns. I believe it is important, however, to see these as optimizations and not assume that a data warehouse automatically must be a complex explosion of every possible dimension over every fact. My experience is that for most purposes, a straight representation is adequate or even ideal for 90+% of analytical queries. For the remainder, first consider indexes, indexed views, additional statistics, or other optimizations that can be made without affecting the structures. Then if aggregation or other redundant structures are needed to improve performance, consider separating these into a "data mart" (at least conceptually) which provides a separation between primitive facts and redundancies thereof. Finally, if the requirements are too fluid and the aggregation demands to heavy to efficiently function this way, then you might consider wholesale explosions of data i.e. star schema. Again though, limit this to the smallest cross section of the data as possible.
Here's what I came up with essentially. Thx NXC
Fact Events
EventID
TimeKey
CaseID
Dim Events
EventID
EventDesc
Dim Time
TimeKey
Dim Regions
RegionID
RegionDesc
Cases
CaseID
RegionID
This may be a case of choosing a solution before you've considered the problem. Not all datawarehouses fit into the star schema model. I don't see that you are aggregating any data here. So far we have a factless fact table and at least one rapidly changing dimension (cases).
Looking at what I see so far I think the central entity in this database should be the case. Trying to stick the event at the middle doesn't seem right. Try looking at it a different way. Perhaps, case, events, and case events to start.