SSAS is a periodic snapshot the right choice - ssas

I'm a newbie to SSAS. I have a database which has an agreement table in which the status of the agreements changes over time. This is stored in the agreement log. the status can be any combination over an extended period of time. One set of questions I will need to answet are how many agreements are of a given status and also to show trends in the status over time. I'm reading Kimball and periodic snapshot seems to be the best fit but I'm at a loss how to design the fact table. Do I preaggregate the data into periods broken down by status? And then how do I manipulate it in SSAS and how do aggregations work as it's more like a bank balance. I sort of get some of the concepts but I'm still pretty confused.

Agreed, this is a good case for Periodic Snapshot.
In this case, you need a status dimension,
and a fact with a period indicator. Your reports will also need to filter on the period.
ETL is a bit more complicated, as during the current period, you clear down and reload the current period data. Previous periods to the current one are fixed. Obviously, you lose visibility on statuses that change multiple times within a period, so the period should be chosen based on how quickly the data changes as well as how often its reported. This is also why Periodic Snapshots are often used in conjunction with transaction fact tables

Related

Can we track Microsoft Delta changes on the Users, Group, ChatMessage or Planner Objects based on time. "/me/planner/all/delta"

Basically does any one know how to ask for delta changes that happened after certain time. I am saving all the changes that user has done to planner objects to the database, but I know eventually delta changes for 100 of plans will go insanely huge. GET /me/planner/all/delta GET /users/{id}/planner/all/delta. Does any one knows how to filter delta response with given time. My plan is to query delta after certain time.
It could be in any object that delta works. Right now I can bring all the delta changes but I do not see how I can ask for changes that happend after certain time.
Delta only works with the tokens presented in the links, it is not time based (we do not store it based on time internally). It is also best-effort, which means at certain time the delta changes will be cleared and the clients will be forced to read objects again to be in sync. So even if there was a time based query, there wouldn't be a guarantee that you can access older data.
What is your scenario? Some kind of history tracking or auditing?
As far as I know, nope. I have to cycle on all Planner plans and tasks in them to get the details. I am currently saving the planner task details to sharepoint and instead of updating it I am just deleting all old records and recreating them.
That makes sense, I was saving the deltas so that in future I could tell which user modified what planner objects. Since Microsoft has not implemented an audit trail for planner objects yet. Storing delta Link was just for my possible future rollback processes.
I realized deltaLink does not expire it is just using delta token to find the future changes from the time the delta was queried. Basically, I am requesting Microsoft teams to have some kind of audit trail for Planner objects changes(at least for who changed at what time) so we can query those activities and have those specific individuals held responsible for unwanted changes that they made. For instance, changing the due date of planner tasks

Changing Opening Hours without affecting historic data

I've been tasked to create a data visualisation dashboard that relies on me drilling into the existing database.
One report is 'revenue per available covers' - part of the calculation determining how many hours were booked against how many hours were available.
The problem is the 'hours available', currently this is stored in a schedule table that has a 1-1 link with the venue - and if admin want to update this there is a simple CRUD panel with the pre-linked field ready to complete.
I have realised that if I rely on this table at any point in the future when the schedule changes the calculations change for any historic data.
Any ideas on how to keep a 'historic' schedule with as minimum impact as possible to the database?
What you have is a (potentially) slowly-changing dimension. Basically, there are two approaches:
For each transactional record, include the hours that you are interested in.
Store the schedule with time frames, which capture the schedule at a particular point in time.
In SQL Server, I would normally go for the second option, using effDate and endDate columns to capture the period when the schedule is active.
I would suggest that you start with a deeper explanation of the concept, such as the Wikipedia page.

Reporting tables in SQL

Our organization has a reporting application, that queries a real time transaction table to pull data for reports. As the query is against transaction table that is continuously updated the report performance is dismal. We are trying to come up with a new DB design to improve the performance.
My idea is to have three different tables for each year (eg; reports_2014,reports_2015,reports_2016) ( as we need to report only last three years of data) which will be created at the end of the year from the real time DB. The current year table (reports_2016) on the reporting DB will be updated with new records for the previous day at midnight. My reporting query will use a view that will be a union all of these three tables + the data from real time table for records from midnight to till this point in time.
Initially, I felt this to be a good design, provided I am going to have good indexes on these history tables.
However, I have a catch here arising from the inherent application design that updates these real time tables.
The status column of a transaction record can change to cancelled if I am cancelling a transaction , along with a new transaction cancellation record.
I could capture this by having a AFTER insert trigger and capturing the updates made correctly.
Now the issue is when there is a cancel record that is posted during the time my ETL to copy last days data to history table runs, I miss the update.
How do I capture this? Is there a way to delay the trigger untill my ETL is complete? Or is there a better approach to this problem?
My apologies if this is not the right place to post this question.
Thanks,
Roopesh
Multiple parallel tables with the same structure is almost never a good idea for a database design. Databases offer two important methods for handling performance:
Indexes
Partitioning
as well as other methods, such as rewriting queries, spatial indexes, full text indexes, and so on.
In your case, instead of multiple tables, consider table partitions.
As for your process, you should be using the creation/modification date of records. I would envision a job running early in the morning, say at 1:00 a.m., and this job would gather the previous day's records. Any changes after midnight simply do not apply. They will be included the following day.
If the reporting needs to be real-time as well, then you should consider building the reporting into the application itself. Some methods are:
Following the same approach as above, but doing the reporting runs more frequently (say once per hour rather once per each day).
Modifying the existing triggers to handle updates to reporting tables as well as the base tables.
Wrapping all DML transactions in stored procedures that handle both the transactional tables and the reporting tables.
Re-architecting the system to use queues with multiple readers to handle the disparate processing needs.
Thank You Gordon for your inputs. At this point ours is a real time reporting system. The database is a mirrored instance of production transactional database. Whenever a new transaction is entered to production database the same record flows to reporting database, which has the exactly similar schema, instantly. We do have indexes on columns those are queried frequently, however as there are many inserts in every hour the index performance is degraded quite fast. We rebuild them once in two weeks and it takes around 8 hours. That is where I thought having indexes on this huge transaction table with many inserts every hour may not be a good idea.. Please correct me if I am wrong...
I am actually reading through partitioning to see if it is a viable option for me. I had a discussion on the same with our DBA and I got following comment from him 'The reporting database is a mirrored instance of real time production database. You have to implement partitioning on the production transactional database. If you are using partitioning on a mirrored instance that would not work as your actual source DB is not partitioned' I am not sure how far this is true. Do you know if there is such a dependency between partitioning and mirroring??

Data warehouse, data update strategy with Bigquery

We have a MIS where stores all the information about Customers, Accounts, Transactions and etc. We are building a data warehouse with BigQuery.
I am pretty new on this topic, Should we
1. everyday extract ALL the customer's latest information and append them to a BigQuery table with timestamp,
2. or we only extract the updated customer's information on that day?
First solution uses a lot of storage and takes time to upload data, and got lots of duplicates. But it's very clear for me to run query. For 2nd solution, given a specific date how can I get the latest record for that day?
Similar for Account data, an example of simplified Account table, only 4 fields here.
AccountId, CustomerId, AccountBalance, Date
If I need to build a report or graphic of a group of customers' AccountBalance everyday, I need to know the balance of each account on every specific date. So should I extract each account record everyday, even it's the same as last day, or I can only extract the account when the balance changed?
What is the best solution or your suggestion? I prefer the 2nd one because there are no duplicates, but how can I construct the query in BigQuery, will performance be an issue?
What else should I consider? Any recommendation for me to read?
When designing DWH you need to start from business questions, translate them to KPIs, measures, dimensions, etc.
When you have those in place...
you chose technology based on some of the following questions (and many more):
who are your users? in what frequency and what resolutions they consume the data? what are your data sources? are they structured? what are the data volumes? what is your data quality? how often your data structure changes? etc.
when choosing technology you need to think of the following: ETL, DB, Scheduling, Backup, UI, Permissions management, etc.
after you have all those defined... data schema design is pretty straight forward and is derived from "The purpose of the DWH" and your technology limits.
You have pointed out some of the points to consider, but the answer is based of your needs... and is not related to specific DB technology.
I am afraid your question is too general to be answered without deep understanding of your needs.
Referring to your comment bellow:
How reliable is your source data? Are you interested in the analyzing trends or just snapshots? Does your source system allow "Select all" operations? what are the data volumes? What resources does your source allow for extraction (locks, bandwidth, etc.)?
If you just need a daily snapshot of the current balance, and there are no limits by your source system,
it would be much simpler to run a daily snapshot.
this way you don't need to manage "increments", handle data integrity issues and systems discrepancies etc. however, this approach might have undesired impact on your source system, and your network costs...
If you do have resources limits, and you chose the incremental ETL approach, you can either
create a "Changes log" table and query it, you can use row_number()
in order to find latest record per account.
or yo can construct a copy of the source accounts table, merging
changes everyday to an existing table.
each approach has its own aspect of simplicity, costs, and resource consumption...
Hope this helps

Guidelines for determining the maximum frequency to process an SSAS cube?

We have an Analysis Services cube that needs to be as real-time as possible. It's a relatively small cube that currently takes a couple of seconds to process.
Are there any guidelines for this? I'm curious what other folks are doing.
Also, what would be the impact of processing the cube too frequently? Would the main concern be the load on the SSAS server and the source DB? In our case it would be fairly nominal. How would SSAS clients be affected? Current SSAS consumers are Excel, PerformancePoint, and Sharepoint/Excel Services.
I would say the first issue you'd have to consider is how much is this cube going to grow over time? If it is constantly updated and processed that couple seconds could quickly turn into 20 minutes.
For example, we currently have a cube that has 20 million rows (probably more now hehe) with financial data related to hospital billing and charges that takes about 20 mins to process and we do it once a day in the morning. Depending on the time of the year we sometimes do process during the day again but there have been no complaints as long as we notify people we are doing this.
Have you considered a real-time (ROLAP) partition to store the current day's data? This way, you get the performance of MOLAP for all your data prior to the current day, which you can process nightly, but have ROLAP's low latency for the data collected since the last cube process.
If your cube is small enough, you could even stretch that to be the current week's data, or more.
As far as the disadvantages of processing frequently, check out the below article, which says: "If the processing job succeeds, an exclusive lock is put on the object when changes are being committed, which means the object is temporarily unavailable for query or processing. During the commit phase of the transaction, queries can still be sent to the object, but they will be queued until the commit is completed."
http://technet.microsoft.com/en-us/library/ms174860.aspx
So your users will see an impact in query performance.
It may be that you have to 'put it out there' and track how it performs.
Once you can see how people are using the cube, you can determine if constant reprocessing is really necessary and if it is, you may have to optimise how this occurs.
Spcifically using "usage based optimisation" as described here:
http://www.databasejournal.com/features/mssql/article.php/3575751/Usage-Based-Optimization-in-Analysis-Services-2005.htm