Partition and process SSAS cube for huge data

Partition and process SSAS cube for huge data - sql-server-2012

I've a SSAS cube with rigid relationship. Daily I get data from source for last 2 months only. My cube have data since 2010 onwards.
I am planning to partition that cube and then process it. My questions are
I know that in rigid relationship I've to go with Process full. Does that mean that I've to process all partition as Process Full or I can go ahead with selected partition for process full.
How can I design my partition strategy? If I do 2 months partition then I will end up in 6 partition per year and later they may increase. I thought of going with 6 months partition. but if I am on 7th month or 1st month then I've to process two partition(i.e. current + last 6 month). Is it good enough?

Marking attribute relationships as Rigid when they actually do change (meaning when the rollups change such as Product A rolling up to Cereal vs. Oatmeal category) is a bad idea. Just mark them as Flexible relationships. Rigid vs. flexible doesn't impact query performance just processing performance. And if Rigid causes you to do ProcessFull on dimensions that is going to mean you have to reprocess all your measure group partitions. So change relationships to Flexible unless you are 100% sure you never run an UPDATE statement on your dimension table in your ETL.
I would partition by month. Then you can just process the most recent two months every day. To be more explicit:
ProcessUpdate your dimensions
ProcessData the most recent two months of partitions.
ProcessIndexes on your cube (which rebuilds indexes and flexible aggs on older partitions)

Related

Save history partitions in Analysis Services measure group

We have a multidimensional cube, with a stock on hand measure group, partitioned by year. The underlying table has 1.5 billion rows in it, and effectively equals around 275 million rows per partition.
Every night we do a process full on the entire SSAS database, and of course all these history partitions (like SOH 2011, SOH 2012 etc) process all over again, even though the data never changes.
I would like to know if there is a way of still executing a process full of the SSAS database , but preserving the data in those history partitions so that they do not get processed.
Update: In reply to one of the comments, about just processing the latest measure group partitions. Of course that is an option, and what that implies is that you are going to create a set of customised jobs / steps to process Dimensions, and then certain measure group partitions. This is more challenging to maintain, and also you have to be as smart as the SSAS engine to decide on parallel processing options etc.
My ideal solution would be to somehow mark those partitions as not needing processing, or restore the processed partitions from a previous process.

At what scale of data is the ROI of partitioning the most valuable?

So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).

Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.

Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.

Improve cube processing time

I've got a cube with around 6 facts and 40 dimensions. Right now my cube is taking 1 and half hour to process, and most of the time is taken by 2 facts. But now the users are asking for data which should not be more then an hour old. Now I'm thinking of storing the partitions of those 2 facts table in ROLAP(right now they are MOLAP)mode. Would it help in improving the cube processing time or I should look for another approach?Also, is it going to make much difference in query performance.

You are on the right track. Partitions are the key to sink processing time. You have not specified how much facts you have (time-wise), but lets say you have a month worth of data. So one partition would be month - 1 latest day and the second partition would contain that latest day. You would only re-process this small partition every hour and process full cube once every day when there are no users online.
Partitions do help out when it comes to processing time.

If the partitions are small, and the logic to access the ROLAP partition is fast (i. e. no complex views, etc.) the increased query time should not be dramatic. But in fact, you just should test, as there are many factors influencing performance.

How to reuse process result of fixed data

In a financial system, transactions of every year is stored in a separate table. So, there are Transactions2007, Transactions2008, ..., Transactions2012 tables in the system. They all have the same table design. The data in tables of previous years never change. But current years data is updated in a daily manner.
I want to build a cube on the union of tables of all years. The question is how to prevent SSAS from reprocessing previous years.

When processing the Cube, you can set the cube process option to Process Incremental and then in the Configuration Dialog, you can select a query to select data only from the recent tables. Here is a link for more info.

I handled it by partitioning the cube (by time dimension) and processing only the most recent partition.

SQL Cube Processing Window

I've got Dim Tables, Fact Tables, ETL and a cube. I'm now looking to make sure my cube only holds the previous 2 months worth of data. Should this be done by forcing my fact table to hold only 2 months of data and doing a "full process", or is there a way to trim outdated data from my cube?

Your data is already dimensionalized through ETL and you have a cube built on top of it?
And you want to retain the data in the Fact table, but not necessarily need it in the cube for more than the last 2 months?
If you don't even want to retain the data, I would simply purge the fact table by date. Because you're probably going to want that space reclaimed anyway.
But there are also settings in the cube build - or build your cube off dynamic views that only expose the last two months - then the cube (re-)build can be done before you've even purged the underlying fact tables.
You can also look into partitioning by date:
http://www.mssqltips.com/tip.asp?tip=1549
http://www.sqlmag.com/Articles/ArticleID/100645/100645.html?Ad=1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas