[SSAS CUBE Partition Strategy for large historical data] - process

I have a big cube with 2.5 million new data per day. 19 million a week. Those data are historical data, no update , no remove and no change.So what's best partition strategy for this kind of data ? You can see only one week there are a lot of data. Shall I create a new partition everyday to process new data and merge into a static large partition at night ?

I think the best solution is to use different ranges:
(Date) -> (Partition)
This week -> Daily (this helps not to reprocess all week and not to use tricks with ProcessAdd and only new data)
This year -> Weekly (53 partitions is ok)
Previous years -> Yearly
At the end of each week you can merge daily partitions. 19 millions per one partition is good, but using weekly basis for older years may cause additional time for querying and processing.
So you'll have less than 100 partitions for entire measure group at least for the nearest 40 years (7 daily + 53 weekly + 40 yearly).
And don't forget to add slices to every created partition.
Removing unnecessary indexes (e.g. for high-selective attributes used as properties etc.) may also help to speed up process time and decrease disk space usage.

Related

Databricks datamart updated optimization

I had a performance issue to ask your input on.
This is singly based in Databricks on Azure with storage on Azure Data Lake Storage. And tech stack is not more than 2 years old and is all up to the most recent release.
Say I have a datamart Delta table, 100 columns, 30,000,000 rows, grows 225,000 rows every calendar-quarter.
There isn't a Datawarehouse in this architecture, so the newest 225,000 rows are simply appended to the datamart; 30,000,000+ and growing every quarter.
Two columns are a dimension key Dim1_cd and a matching Dim1_desc.
There are 36 other dimensions key-value pairs in the datamart much like Dim1 is a key-value pair.
The datamart is a list of transactions, has a Period column, eg. "2021Q3", and Period is the first level and only partition of the datamart.
The partition divides the delta table into 15 partition folders currently. Each with 100Meg-ish size files numbering in the 150-parquet file range per partition folder.
A calendar-quarter later a new set of files is delivered and to be appended to the datamart, one of which is a Dim1_lookup.txt file which is first read into a Dim1_deltaTable; and has only two columns Dim1_cd and Dim1_desc. Each row is 3rd normal form distinct. On disk, the Dim1_lookup.txt file is only 55K large.
Applying this dimension’s newest version will sometimes have to take only 3-4 minutes, where there are not any Dim1_desc values needing to be updated. Other times, there are 20,000 to 100,000 updates to be written across 100s to 1000s of parquet files and can take an unpleasant long time.
Of course, writing the code of a delta table update to apply the Dim1_deltaTable is no big challenge.
But what can you suggest how to optimize the updates?
Ideally you might have a Datawarehouse backing the datamart, but not the case in this architecture.
You might want to partition Dim1_desc to take advantage of delta’s data skipping but there are 36 other _desc fields you have the same update concerns for.
What do you consider possible to optimize the update/minimize update processing time?

Performance improvement in sql server by adding partitions

I have a sql server table where I keep adding data on a daily basis. Most queries have a date filter. I would like to add a range partition based on date to improve query performance.
My questions are:
Do i need to keep adding partitions on a daily basis as new dates become available?
Is there a limit to the number of partitions one can add and beyond a certain number of partitions, do the performance improvement disappear?

Partition and process SSAS cube for huge data

I've a SSAS cube with rigid relationship. Daily I get data from source for last 2 months only. My cube have data since 2010 onwards.
I am planning to partition that cube and then process it. My questions are
I know that in rigid relationship I've to go with Process full. Does that mean that I've to process all partition as Process Full or I can go ahead with selected partition for process full.
How can I design my partition strategy? If I do 2 months partition then I will end up in 6 partition per year and later they may increase. I thought of going with 6 months partition. but if I am on 7th month or 1st month then I've to process two partition(i.e. current + last 6 month). Is it good enough?
Marking attribute relationships as Rigid when they actually do change (meaning when the rollups change such as Product A rolling up to Cereal vs. Oatmeal category) is a bad idea. Just mark them as Flexible relationships. Rigid vs. flexible doesn't impact query performance just processing performance. And if Rigid causes you to do ProcessFull on dimensions that is going to mean you have to reprocess all your measure group partitions. So change relationships to Flexible unless you are 100% sure you never run an UPDATE statement on your dimension table in your ETL.
I would partition by month. Then you can just process the most recent two months every day. To be more explicit:
ProcessUpdate your dimensions
ProcessData the most recent two months of partitions.
ProcessIndexes on your cube (which rebuilds indexes and flexible aggs on older partitions)

Bigquery partitioning table performance

I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?
A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.

Improve cube processing time

I've got a cube with around 6 facts and 40 dimensions. Right now my cube is taking 1 and half hour to process, and most of the time is taken by 2 facts. But now the users are asking for data which should not be more then an hour old. Now I'm thinking of storing the partitions of those 2 facts table in ROLAP(right now they are MOLAP)mode. Would it help in improving the cube processing time or I should look for another approach?Also, is it going to make much difference in query performance.
You are on the right track. Partitions are the key to sink processing time. You have not specified how much facts you have (time-wise), but lets say you have a month worth of data. So one partition would be month - 1 latest day and the second partition would contain that latest day. You would only re-process this small partition every hour and process full cube once every day when there are no users online.
Partitions do help out when it comes to processing time.
If the partitions are small, and the logic to access the ROLAP partition is fast (i. e. no complex views, etc.) the increased query time should not be dramatic. But in fact, you just should test, as there are many factors influencing performance.