I have read about aggregation design tab in ssas.It is telling we can precalculate the aggregation and it will improve the response time.But when i tried that option,i haven't go thorough any step which is showing what kind of aggregation(eg:sum,avg etc) we have to do.In one of the steps, there is an option of specifying count of rows.So aggregation means only count of rows in table?
Other than this we know,there is one calculated measure option using we can specify sum or avg or etc.This also one kind of aggregation right?
So, Can any one please make it clear about there two?
Thank You in advance.. :)
I'm not sure if I understand your question, but the method of aggregation is defined for each measure. See the properties of your measures to know what kind of aggregation method is used.
The estimated row count (of each partition) is used by SSAS to determine the necessary storage space of your aggregation design.
Related
I'm new to MDX coming from the world of SQL and I am trying to understand the concept of 'ALL'. I am understanding that 'ALL' is a single member, creating a cube with low granularity as far as that dimension is concerned. Is this correct?
What are some examples of SQL that can help me reason about this concept? Of course, SQL uses tables not cubes, but I'm sure there are some similarities which can help me make the connection? Let me try to create an example.
Let's say I have a table with such a schema representing a cube with 3 dimensions:
myTable (dim1_attribute1,dim1_attribute2, dim2_attribute1,dim2_attribute2,dim2_attribute3,
dim3_attribute1,dim3_attribute2,dim3_attribute3)
What kind of SQL would give me an aggregate with 'ALL' granularity on dim3?
'ALL' means that basically you are taking into account all members of that dimension - you are not slicing cube with some particular member of that dimension. In SQL query equivalent would be to omit dim3 from the where clause altogether, so you are not filtering your resulting aggregation with some particular value from dim3 but you are taking all rows into account.
I am trying to learn aggregation in SSAS,
I belive when we create a cube it generates dimensions and facts it generates all the possible combinations already then,
What exactly happens when we create Aggregation in SSAS?
Please Help
Generally not all aggregations are created, but instead the "SSAS engine creates the aggregations based on an internal cost-vs.-benefit algorithm." Think of the aggregations as group by's in SQL that are saved on disk, as defined by your aggregation design.
If all possible aggregations are saved we can get database explosion causing the aggregations to be larger than the source data. This can actually make query performance worse.
I have a question for an assignment that I'm confused about.
The question is:
Calucalte the lowest cost of executing query:
SELECT max(profit) FROM Salesman;
What would the formula be for working this out? Would I need to use a SELECT cost formula such as Linear Search cost, or would I use an Aggregate search forumla? Or the two combined?
Note: The profit field is not indexed in a B-Tree
I'm just having trouble deciding what forumla to use for this query.
I'm not sure what metrics you're using to calculate the cost. But, the question requests the "lowest" cost. So, imagine the situation that takes the least work from the system, then calculate the cost using whatever your instructor or course indicates you should use.
If your data is predetermined, couldn't you just use the chosen database system itself to describe the costs?
To preface this, I'm not familiar with OLAP at all, so if the terminology is off, feel free to offer corrections.
I'm reading about OLAP and it seems to be all about trading space for speed, wherein you precalculate (or calculate on demand) and store aggregations about your data, keyed off by certain dimensions. I understand how this works for dimensions that have a discrete set of values, like { Male, Female } or { Jan, Feb, ... Dec } or { #US_STATES }. But what about dimensions that have completely arbitrary values like (0, 1.25, 3.14156, 70000.23, ...)?
Does the use of OLAP preclude the use of aggregations in queries that hit the fact tables, or is it merely used to bypass things that can be precalculated? Like, arbitrary aggregations on arbitrary values still need to be done on the fly?
Any other help regarding learning more about OLAP would be much appreciated. At first glance, both Google and SO seem to be a little dry (compared to other, more popular topics).
Edit: Was asked for a dimension on which there are arbitrary values.
VELOCITY of experiments: 1.256 m/s, -2.234 m/s, 33.78 m/s
VALUE of transactions: $120.56, $22.47, $9.47
Your velocity and value column examples are usually not the sort of columns that you would query in an OLAP way - they are the values you're trying to retrieve, and would presumably be in the result set, either as individual rows or aggregated.
However, I said usually. In our OLAP schema, we have a good example of a column you're thinking of: event_time (a date-time field, with granualarity to the second). In our data, it will be nearly unique - no two events will be happening during the same second, but since we have years of data in our table, that still means there are hundreds of millions of potentially discrete values, and when we run our OLAP queries, we almost always want to constrain based on time ranges.
The solution is to do what David Raznick has said - you create a "bucketed" version of the value. So, in our table, in addition to the event_time column, we have an event_time_bucketed column - which is merely the date of the event, with the time part being 00:00:00. This reduces the count of distinct values from hundreds of millions to a few thousand. Then, in all queries that constrain on date, we constrain on both the bucketed and the real column (since the bucketed column will not be accurate enough to give us the real value), e.g.:
WHERE event_time BETWEEN '2009.02.03 18:32:41' AND '2009.03.01 18:32:41'
AND event_time_bucketed BETWEEN '2009.02.03' AND '2009.03.01'
In these cases, the end user never sees the event_time_bucketed column - it's just there for query optimization.
For floating point values like you mention, the bucketing strategy may require a bit more thought, since you want to choose a method that will result in a relatively even distribution of the values and that preserves contiguity. For example, if you have a classic bell distribution (with tails that could be very long) you'd want to define the range where the bulk of the population lives (say, 1 or 2 standard deviations from mean), divide it into uniform buckets, and create two more buckets for "everything smaller" and "everything bigger".
I have found this link to be handy http://www.ssas-info.com/
Check out the webcasts section where in they walk you through different aspects starting from, what is BI, Warehousing TO designing a cube, dimensions, calculations, aggregations, KPIs, perspectives etc.
In OLAP aggregations help in reducing the query response time by having pre-calculated values which would be used by the query. However, the flip side is increase in storage space as more space would be needed to store aggregations apart from the base data.
SQL Server Analysis Services has Usage Based Optimization Wizard which helps in aggregation design by analyzing queries that have been submitted by clients (reporting clients like SQL Server Reporting Services, Excel or any other) and refining the aggregation design accordingly.
I hope this helps.
cheers
If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?
This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.
I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").
On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.
Is there anyway I can have the best of both worlds?
We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:
You have a lot of flexibility for
querying any way you want. Before we
had to build specific aggregates,
but now one cube answers all our
questions.
Storage in a cube is far smaller
than the detailed data.
Building and processing the cubes
takes less time and produces less
load on the database servers than
the aggregates did.
Some CONS:
There is a learning curve around
building cubes and learning MDX.
We had to create some tools to
automate working with the cubes.
UPDATE:
Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.
It helps to pick a good primary key (ie [user_id, used_date, used_time]). For a constant user_id it's then very fast to do a range-condition on used_date.
But as the table grows, you can reduce your table-size by aggregating to a table like [user_id, used_date]. For every range where the time-of-day doesn't matter you can then use that table. An other way to reduce the table-size is archiving old data that you don't (allow) querying anymore.
I always lean towards raw data. Once aggregated, you can't go back.
Nothing to do with deletion - unless there's the simplest of aggregated data sets, you can't accurately revert/transpose the data back to raw.
Ideally, I'd use a materialized view (assuming that the data can fit within the constraints) because it is effectively a table. But MySQL doesn't support them, so the next consideration would be a view with the computed columns, or a trigger to update an actual table.
Long history question, for currently, I found this useful, answered by microstrategy engineer
BTW, another already have solutions like (cube.dev/dremio) you don't have to do by yourself.