I have a scenario where an SSAS cube's data needs to be refreshed. We want to avoid using a full refresh that takes an hour, and do a 'delta' refresh. The delta refresh should
1) Update fact records that have changed
2) Insert fact records that are new
3) Delete fact records that no longer exist
Consider a fact table with three dimensions: Company, Security, FiscalYear
and two measures: Qty, Amount
Scenario: In the fact table, a record with Company A, Security A, FiscalYear A has the measure Qty changed from 2 to 20. Previously the cube correctly showed the Qty to be 2. After the update,
If we do a Full refresh, it correctly shows 20. But in order to get this, we had to suffer a full hour of cube processing.
We tried adding a timestamp column to the fact table, split the cube into Current and Old partitions, full refreshed the Current Partition and Merged into Old partition as seems to be the popular suggestion. When we browse the cube, it shows 22, which is incorrect
We tried an Incremental refresh of the cube, same issue. It shows 22, also incorrect.
So what I am trying to ascertain here, is whether there is no way to process a cube so it only takes the changes (and by that I mean Updates, Inserts AND deletes, not just Inserts!) and applies them to the data inside an SSAS cube?
Any help would be greatly appreciated!
Thanks!
No, there is no way to do this. The only control you have over processing is the granularity of what you process. For instance, if you know that data over a certain age will never change, you can put data over that age in a partition, and not include it in your processing.
Related
What I am trying to accomplish is a SQL table that contains several different totals based off of 5 other tables. This would be so that when my application needs those totals, it would not need to perform the sum since it is a rather large query.
I would like to know if there is a recommended method to have a totals table that constantly updates based on changes made in other tables. I have thought of replacing it with an indexed view or having triggers on each of the tables that are being summed, but it seems inefficient to rerun the sum query every time a field is updated. One other thing I thought of would be to have a trigger on update and every time the data changes, I would just add or remove the difference from the stored total. My end goal is to have some totals that are constantly up to date.
The table is showing totals per product. (e.g. total qty from table1 + total qty from table2)
If this is too general, I can give more specifics about table structure.
Add a trigger to tables in question and check for the only the relevant value changing rather than run the sum each time a field that is irrelevant to the total on the computed table is modified.
I ended up putting these in a queue when the underlying data was changed, and using a scheduled task to update the totals at a regular interval. We decided the tradeoff in data freshness was worth not having to recalculate the total with every transaction.
I have a dimension of metrics found in a table called DimMetrics.
The columns are as follows:
MetricSK - unique key
MetricAK - surrogate key
Source
Status = Current, Expired
LastUpdate = date of the startDate of my range.
I am pulling data from one datasource to this one. Where I am retreiving storing the MetricAK and Source on a monthly basis.
The records could change from month to month, where the source could be deleted or added.
What is the best mehtod to achieve this? tried using the Slowly Moving Dimension provided, but I only managed to get it to work by adding records creating new MetricSK's.
What I would like is when i do a monthly import SSIS will check curren trecords and set records that are not part of the new batch to Expired, and then add any new records with Current and the first day of the date range I choose.
I hope this make sense, as I am stuck in a viable solution.
Thanks,
Pete
Ok.. this is kind of a hack since the Microsoft SCD does not have an inbuilt flow for delete. Before you start the data flow task that does the SCD on your table. Use an Execute SQL task that sets the status of all the rows to Expired. Now in the data flow task for the SCD component you would need to add the flow for unchanged rows and then a OLEDB Command that updates the status back to Current. Again this approach would be a performance bottle neck if you are talking about several millions of rows. But for a medium size table this should be fine.
In SQL Server 2008+, we'd like to enable tracking of historical changes to a "Customers" table in an operational database.
It's a new table and our app controls all writing to the database, so we don't need evil hacks like triggers. Instead we will build the change tracking into our business object layer, but we need to figure out the right database schema to use.
The number of rows will be under 100,000 and number of changes per record will average 1.5 per year.
There are at least two ways we've been looking at modelling this:
As a Type 2 Slowly Changing Dimension table called CustomersHistory, with columns for EffectiveStartDate, EffectiveEndDate (set to NULL for the current version of the customer), and auditing columns like ChangeReason and ChangedByUsername. Then we'd build a Customers view over that table which is filtered to EffectiveEndDate=NULL. Most parts of our app would query using that view, and only parts that need to be history-aware would query the underlying table. For performance, we could materialize the view and/or add a filtered index on EffectiveEndDate=NULL.
With a separate audit table. Every change to a Customer record writes once to the Customer table and again to a CustomerHistory audit table.
From a quick review of StackOverflow questions, #2 seems to be much more popular. But is this because most DB apps have to deal with legacy and rogue writers?
Given that we're starting from a blank slate, what are pros and cons of either approach? Which would you recommend?
In general, the issue with SCD Type- II is, if the average number of changes in the values of the attributes are very high, you end-up having a very fat dimension table. This growing dimension table joined with a huge fact table slows down the query performance gradually. It's like slow-poisoning.. Initially you don't see the impact. When you realize it, it's too late!
Now I understand that you will create a separate materialized view with EffectiveEndDate = NULL and that will be used in most of your joins. Additionally for you, the data volume is comparatively low (100,000). With average changes of only 1.5 per year, I don't think data volume / query performance etc. are going to be your problem in the near future.
In other words, your table is truly a slowly changing dimension (as opposed to a rapidly changing dimension - where your option #2 is a better fit). In your case, I will prefer option #1.
I have a table (A) that lists all bundles created off a machine in a day. It lists the date created and the weight of the bundle. I have an ID column, a date column, and a weight column. I also have a table (B) that holds the details related to that machine for the day. In that table (B), I want a column that lists a sum of weights from the other table (A) that the dates match on. So if the machine runs 30 bundles in a day, I'll have 30 rows in table (A) all dated the same day. In table (B) I'll have 1 row detailing other information about the machine for the day plus the column that holds the total bundle weight created for the day.
Is there a way to make the total column in table (B) automatically adjust itself whenever a row is added to table (A)? Is this possible to do in the table schema itself rather than in an SQL statement each time a bundle is added? If it's not, what sort of SQL statement do I need?
Wes
It would be a mistake to do so unless you have performance problems that require it.
A better approach is to define a view in the database that will aggregate the daily bundles by machine:
CREATE VIEW MachineDailyTotals
(MachineID, RunDate, BundleCount, TotalWeight)
AS SELECT MachineID, RunDate, COUNT(*), SUM(WeightCol)
FROM BundleListTable
GROUP BY MachineID, RunDate
This will allow you to always see the correct, updated total weight per machine per day without imposing any load on the database until you actually look at the data. You can perform a simple OUTER JOIN with the machine table to get information about the machine, including the daily total info, without having to actually store the totals anywhere.
If you need the sum (or other aggregate) in real time, add a trigger on table A for INSERT, UPDATE, DELETE which calculates the sum to be stored in B.
Otherwise, add a daily job which calculates the sums.
Please specify which database you are using.
Are you sure that you don't want to pull this information dynamically rather than storing it in a separate table? This seems like an indirect violation of Normalization rules in that you'll be storing the same information in two different places. With a dynamic query, you'll always be sure that the derived information will be correct without having to worry about the coding and maintenance of triggers.
Of course, if you are dealing with large amounts of data and query times are becoming an issue, you may want the shortcut of a summary table. But, in general, I'd advise against it.
This can be accomplished via Triggers which are little bits of code that execute whenever a certain action (insert/update/delete) happens on a table. The syntax is varies by vendor (MySQL vs. Oracle) but the language is typically the same language you would write a stored procedure in.
If you mention the DB type I can help with the actual syntax
I've got Dim Tables, Fact Tables, ETL and a cube. I'm now looking to make sure my cube only holds the previous 2 months worth of data. Should this be done by forcing my fact table to hold only 2 months of data and doing a "full process", or is there a way to trim outdated data from my cube?
Your data is already dimensionalized through ETL and you have a cube built on top of it?
And you want to retain the data in the Fact table, but not necessarily need it in the cube for more than the last 2 months?
If you don't even want to retain the data, I would simply purge the fact table by date. Because you're probably going to want that space reclaimed anyway.
But there are also settings in the cube build - or build your cube off dynamic views that only expose the last two months - then the cube (re-)build can be done before you've even purged the underlying fact tables.
You can also look into partitioning by date:
http://www.mssqltips.com/tip.asp?tip=1549
http://www.sqlmag.com/Articles/ArticleID/100645/100645.html?Ad=1