I've got Dim Tables, Fact Tables, ETL and a cube. I'm now looking to make sure my cube only holds the previous 2 months worth of data. Should this be done by forcing my fact table to hold only 2 months of data and doing a "full process", or is there a way to trim outdated data from my cube?
Your data is already dimensionalized through ETL and you have a cube built on top of it?
And you want to retain the data in the Fact table, but not necessarily need it in the cube for more than the last 2 months?
If you don't even want to retain the data, I would simply purge the fact table by date. Because you're probably going to want that space reclaimed anyway.
But there are also settings in the cube build - or build your cube off dynamic views that only expose the last two months - then the cube (re-)build can be done before you've even purged the underlying fact tables.
You can also look into partitioning by date:
http://www.mssqltips.com/tip.asp?tip=1549
http://www.sqlmag.com/Articles/ArticleID/100645/100645.html?Ad=1
Related
gurus!
I've inherited an SSAS 2014 multidimensional cube at work. I've been doing SQL Server database work (queries, tables, stored procs, etc) for many years now. But I'm a complete SSAS newbie. And, even in my ignorance, I can tell that this cube I've inherited is a mess!
I've been able to keep the thing updated with new data each month, but now our company has rolled out a new product and I'm having to add five new fields to the fact table / view for keys related to that product, along with the related dimension views. I've taken a couple of shots at it, but wind up hitting numerous errors when I process the fact table partitions.
BTW, heading off the natural question, there's no way I can roll the "five new fields" data into fields that already exist unless I completely rebuild the cube from scratch, which is out of the question right now.
So, I'll try to boil down what I THINK is the problem here. Hoping someone can answer my question.
The fact data is located in four different data warehouse databases (names changed to protect company data) -
DB_Current
DB_2018
DB_2017
DB_2016
There is a fact view within each of those databases to stage the fact data. That view is called "vw_fact" and is identical across all databases. When that view gets pulled into the cube, it gets partitioned into four different partitions (per month-year) due to data size.
The new product was just rolled out this year, so I added the five new fields to "vw_fact" only in "DB_Current". I didn't change the prior years' views in their respective databases. My shot-in-the-dark guess there was that the prior years views would automically join the matching field names to the current year's view without needing the new fields.
When I tried processing the four years' worth of partitions, I then ran into numerous "field doesn't exist errors".
So, my questions are these:
Do I have to add five new fields to ALL FOUR views? That is, the individual views within all four years' of databases?
If I have to do #1 above, do I then need to run a "Process Full" on all partitions for all four years? Or do I need to run one of the other process options?
Thank you so much in advance for any advice you can offer here!
Joel
You need to have the matching result sets for all partitions source queries. Though, that doesn't mean you will necessarily have to add it to all views. You can edit the source query for the different partitions in visual studio. If you for some reason don't want to edit the 4 views ( which I would probably do ) you could hard code the surrogate key for the unknown member or something similar in the queries of the partitions where the new fields are not relevant (if it's dimension foreign keys we're talking about, alternatively to 0 or something if it's measures). If you have new dimensions I would go for a FULL process.
We have a multidimensional cube, with a stock on hand measure group, partitioned by year. The underlying table has 1.5 billion rows in it, and effectively equals around 275 million rows per partition.
Every night we do a process full on the entire SSAS database, and of course all these history partitions (like SOH 2011, SOH 2012 etc) process all over again, even though the data never changes.
I would like to know if there is a way of still executing a process full of the SSAS database , but preserving the data in those history partitions so that they do not get processed.
Update: In reply to one of the comments, about just processing the latest measure group partitions. Of course that is an option, and what that implies is that you are going to create a set of customised jobs / steps to process Dimensions, and then certain measure group partitions. This is more challenging to maintain, and also you have to be as smart as the SSAS engine to decide on parallel processing options etc.
My ideal solution would be to somehow mark those partitions as not needing processing, or restore the processed partitions from a previous process.
I've a SSAS cube with rigid relationship. Daily I get data from source for last 2 months only. My cube have data since 2010 onwards.
I am planning to partition that cube and then process it. My questions are
I know that in rigid relationship I've to go with Process full. Does that mean that I've to process all partition as Process Full or I can go ahead with selected partition for process full.
How can I design my partition strategy? If I do 2 months partition then I will end up in 6 partition per year and later they may increase. I thought of going with 6 months partition. but if I am on 7th month or 1st month then I've to process two partition(i.e. current + last 6 month). Is it good enough?
Marking attribute relationships as Rigid when they actually do change (meaning when the rollups change such as Product A rolling up to Cereal vs. Oatmeal category) is a bad idea. Just mark them as Flexible relationships. Rigid vs. flexible doesn't impact query performance just processing performance. And if Rigid causes you to do ProcessFull on dimensions that is going to mean you have to reprocess all your measure group partitions. So change relationships to Flexible unless you are 100% sure you never run an UPDATE statement on your dimension table in your ETL.
I would partition by month. Then you can just process the most recent two months every day. To be more explicit:
ProcessUpdate your dimensions
ProcessData the most recent two months of partitions.
ProcessIndexes on your cube (which rebuilds indexes and flexible aggs on older partitions)
In a financial system, transactions of every year is stored in a separate table. So, there are Transactions2007, Transactions2008, ..., Transactions2012 tables in the system. They all have the same table design. The data in tables of previous years never change. But current years data is updated in a daily manner.
I want to build a cube on the union of tables of all years. The question is how to prevent SSAS from reprocessing previous years.
When processing the Cube, you can set the cube process option to Process Incremental and then in the Configuration Dialog, you can select a query to select data only from the recent tables. Here is a link for more info.
I handled it by partitioning the cube (by time dimension) and processing only the most recent partition.
I have a database table with about 700 millions rows plus (growing exponentially) of time based data.
Fields:
PK.ID,
PK.TimeStamp,
Value
I also have 3 other tables grouping this data into Days, Months, Years which contains the sum of the value for each ID in that time period. These tables are updated nightly by a SQL job, the situation has arisen where by the tables will need to updated on the fly when the data in the base table is updated, this can be however up to 2.5 million rows at a time (not very often, typically around 200-500k up to every 5 minutes), is this possible without causing massive performance hits or what would be the best method for achieving this?
N.B
The daily, monthly, year tables can be changed if needed, they are used to speed up queries such as 'Get the monthly totals for these 5 ids for the last 5 years', in raw data this is about 13 million rows of data, from the monthly table its 300 rows.
I do have SSIS available to me.
I cant afford to lock any tables during the process.
700M recors in 5 months mean 8.4B in 5 years (assuming data inflow doesn't grow).
Welcome to the world of big data. It's exciting here and we welcome more and more new residents every day :)
I'll describe three incremental steps that you can take. The first two are just temporary - at some point you'll have too much data and will have to move on. However, each one takes more work and/or more money so it makes sense to take it a step at a time.
Step 1: Better Hardware - Scale up
Faster disks, RAID, and much more RAM will take you some of the way. Scaling up, as this is called, breaks down eventually, but if you data is growing linearly and not exponentially, then it'll keep you floating for a while.
You can also use SQL Server replication to create a copy of your database on another server. Replication works by reading transaction logs and sending them to your replica. Then you can run the scripts that create your aggregate (daily, monthly, annual) tables on a secondary server that won't kill the performance of your primary one.
Step 2: OLAP
Since you have SSIS at your disposal, start discussing multidimensional data. With good design, OLAP Cubes will take you a long way. They may even be enough to manage billions of records and you'll be able to stop there for several years (been there done that, and it carried us for two years or so).
Step 3: Scale Out
Handle more data by distributing the data and its processing over multiple machines. When done right this allows you to scale almost linearly - have more data then add more machines to keep processing time constant.
If you have the $$$, use solutions from Vertica or Greenplum (there may be other options, these are the ones that I'm familiar with).
If you prefer open source / byo, use Hadoop, log event data to files, use MapReduce to process them, store results to HBase or Hypertable. There are many different configurations and solutions here - the whole field is still in its infancy.
Indexed views.
Indexed views will allow you to store and index aggregated data. One of the most useful aspects of them is that you don't even need to directly reference the view in any of your queries. If someone queries an aggregate that's in the view, the query engine will pull data from the view instead of checking the underlying table.
You will pay some overhead to update the view as data changes, but from your scenario it sounds like this would be acceptable.
Why don't you create monthly tables, just to save the info you need for that months. It'd be like simulating multidimensional tables. Or, if you have access to multidimensional systems (oracle, db2 or so), just work with multidimensionality. That works fine with time period problems like yours. At this moment I don't have enough info to give you, but you can learn a lot about it just googling.
Just as an idea.