We have a very large dimension in our SSAS. During the incremental run we are using ProcessAdd to process the dimension. This dimension processing is taking 95 % of the total cube processing time.
This dimension involves single table. The named query for the dimension from DSV is -
SELECT ABC, XYZ, DEF, PQR, PLADKey, LEFT(ABC, 3) AS DNL1, LEFT(ABC, 7) AS DNL2,
LEFT(ABC, 9) AS DNL3
FROM dbo.PLAD AS ad
The table has more than 33000000 rows that increases daily. Is it possible that due to high row count the processAdd is working slow. Does it automatically picks the news rows only or do we have to specify the filter criteria to identify new rows ( like adding a where condition to select only the data that is greater than last key value)?
We are using AMO to generate the XMLA script for processing. If we need to add filters how we do that in AMO?
We are working on SQL Server 2008 R2.
Any suggestions that could improve the performance for this dimension processing will be helpful.
If I understood your current state you ran a ProcessAdd on that dimension but didn't customize the query to just read the new rows? First, it is important to only do ProcessAdd on dimensions which are insert-only (no updates or deletes) in your ETL. If that's your case then I blogged about ProcessAdd here. See the "ProcessAdd Dimension 2008.xmla" example. It shows how to provide a SQL query that only returns the new rows.
Related
I have a dimension table in my current warehouse (Netezza) which has 10 million records and which is being updated on a daily basis.
Should we keep this dimension table in BigQuery as it is as we are planning to migrate to BigQuery.
How can we redesign this large dimension in BigQuery?
Because bigquery is not intended for updates, it's not that easy to implement a dimension table. The proper answer depends on your use case.
But here are some alternatives:
Have an append-only dimension table with an "UpdatedAt" field. Than, use window function to get the last version (you can even create a view that has only the last version)
Truncate the dimension table daily with the latest version of your data.
Create an external table based on GCS / Big Table / Cloud SQL, and have the dimensions updated there.
Save your dimension table in a separate database, and use Cloud Dataflow to perform the join
Save the dimension data together with the fact table (Yes, there will be a lot of duplications, but sometimes it's worth the cost)
Simply update the dimension table whenever there is a change (there is a limit to do that)
All of these approaches have drawbacks. The solution can even be a mix of more than one approach.
I have a scenario where an SSAS cube's data needs to be refreshed. We want to avoid using a full refresh that takes an hour, and do a 'delta' refresh. The delta refresh should
1) Update fact records that have changed
2) Insert fact records that are new
3) Delete fact records that no longer exist
Consider a fact table with three dimensions: Company, Security, FiscalYear
and two measures: Qty, Amount
Scenario: In the fact table, a record with Company A, Security A, FiscalYear A has the measure Qty changed from 2 to 20. Previously the cube correctly showed the Qty to be 2. After the update,
If we do a Full refresh, it correctly shows 20. But in order to get this, we had to suffer a full hour of cube processing.
We tried adding a timestamp column to the fact table, split the cube into Current and Old partitions, full refreshed the Current Partition and Merged into Old partition as seems to be the popular suggestion. When we browse the cube, it shows 22, which is incorrect
We tried an Incremental refresh of the cube, same issue. It shows 22, also incorrect.
So what I am trying to ascertain here, is whether there is no way to process a cube so it only takes the changes (and by that I mean Updates, Inserts AND deletes, not just Inserts!) and applies them to the data inside an SSAS cube?
Any help would be greatly appreciated!
Thanks!
No, there is no way to do this. The only control you have over processing is the granularity of what you process. For instance, if you know that data over a certain age will never change, you can put data over that age in a partition, and not include it in your processing.
I have one fact table that has all information regarding how much a company buys and sells. In order to create som calculation regarding f.ex margin I need to use the rows for a purchase together with rows for sales to get the correct calculations.
Now, I have created a calculated measure that gives me the correct result, but the more dimensions I add to my query the slower the query runs when using this calculated measure. It seems like it is using a lot of time to return the tuples I am using to find the rows regarding purchase.
I am using tuples to "store" the purcase row, but the tuple becomes quite large because I need to include all default members of dimensions used by the sales rows in order for them to be used. Basicly my tuples looks like this just with more dimension hierarchies:
(
[Dimension 1].[Hierarchy 1].&[member]
,[Dimension 1].[Hierarchy 2].&[member]
,[Dimension 2].[Hierarchy 1].&[member]
,[Dimension 3].[Hierarchy 1].&[member]
,[Dimension 4].[Hierarchy 1].&[member]
,[Measures].[Purchase Standard Cost]
)
I then multiply this tuple with a measure from the sales rows and I get my result.
Anyone have any tips on how to improve the query performance? The calculation works and if I slice by just a couple of dimensions it works just fine and performance is not to bad, but the more I add the slower it gets and the users will hit performance issues.
Since amount of used dimensions increased, Storage Engine has to scan additional files, it could be a reason of such performance degradation.
I have several suggestions based on their effectiveness from my point of view:
Implement partitioning (if it's not implemented yet) to scan lower amount of data.
"Materialize" some tuples into physical dimension (if there are no dynamics, late-binding functions etc. in MDX):
2.1. Add corresponding keys, which represents tuples, to your source tables.
2.2. Build appropriate dimensions on these keys.
2.3. Use calculated measures with these "ex-tuples".
Example:
You have a 100M rows table with columns: SomeDate, Customer, Product, Amount and a single-partitioned measure group.
You need to create tuples like (2015-01-01, Customer A, Product Z, Amount).
Server have to scan the entire data to take exact values.
Once you add partitions by SomeDate years (+slices), server will take only 2015 partition.
2.1. Add column Tuple_ID int to the table and map it during ETL.
E.g. Tuple_ID = 1 where Customer = 'Customer A' and Product = 'Product Z'
2.2. Create a dimension on this new field (or on additional table with list of combinations to be able to modify logic easily).
2.3. Use ([Tuple ID].[Tuple ID].&[1],[Measures].[Amount]) in calculation.
Advantage of such technique is that server takes only pre-calculated values, and queries speed as a result.
I am partitioning my cube by the most recent 13 months, and then a legacy partition to hold older months.
I have successfully created dynamic partitions, but now I need to add a dynamic slice to each partition.
I thought I could use this in the Partition Slice Expression:
[Dim Date].[Month].&[" + CStr(Month(Now())) + "].lag(8)
but it's failing. Does anyone have any ideas?
I tried all day, but ultimately resolved that partition slice expressions dont like anything that is not a dimension member value.
To be clear my goal was to create dynamic partitioning using the 14 described partitions above. Best Practice advises to also use slices on the partitions per Mosha's Article but since my partitons are dynamic, then my slices needed to be dynamic.
I finally added a member to my Date Dimension that mimics the dynamic labeling of the 14 partitions I wanted to create. Next I referenced the new date dimension member values to each of the corresponding partition slices, basically moving the "dynamic" slices to the cube structure.
It works great, and give me another usefull Dimension Member. I have also partitioned the fact table in the data warehouse with the same 14 partitions using a partitioning scheme, file groups, etc. As an added bonus, since everything is dynamic my SSIS package is much less complex and does not require DDL tasks to move partitions around.
where are you doing this?
you should partition the data warehouse on your cube using T-SQL queries, not DMX queries:
I have a SSAS cube in which one of my dimension has 5 million recrods. When I try to view data for the dimension, report or excel pivot becomes lengthy and also the performance is poor. I cant categorize that particular dimension data. Only way I can think of to restrict data is select top 10K rows from the dimension which has metric values. Apart from restricting it to top 10K dimension records can anyone please suggest other possibilities?
Have you set up aggregations? I would venture to guess that the majority of the time being spent getting your data to a viewing point has to do with your measures. If I was you I would try adding in aggregations or upping the aggregation percent in order to alleviate some of the pressure at querytime by passing this workload to the processing time of the dimension/cube.
Generally, people set their aggregation levels at about 30% to start.
If you have done this already, I would think about upgrading your hardware on the server that your cube sits on. (depending on what you already have)
These are just suggestions as it could also be an issue in your cube design that is causing a lengthy runtime.
I would suggest you to create a hierarchy for showing 5 million records. Group by substring in Level 1,( if required some characters in Level 2), then the data falling under that group. For example :
Level 1 Value
A Apple
A Ant
This would mean that you wont be showing all 5 million records at once and it is very effective now to use aggregations too.