Changing Opening Hours without affecting historic data - sql

I've been tasked to create a data visualisation dashboard that relies on me drilling into the existing database.
One report is 'revenue per available covers' - part of the calculation determining how many hours were booked against how many hours were available.
The problem is the 'hours available', currently this is stored in a schedule table that has a 1-1 link with the venue - and if admin want to update this there is a simple CRUD panel with the pre-linked field ready to complete.
I have realised that if I rely on this table at any point in the future when the schedule changes the calculations change for any historic data.
Any ideas on how to keep a 'historic' schedule with as minimum impact as possible to the database?

What you have is a (potentially) slowly-changing dimension. Basically, there are two approaches:
For each transactional record, include the hours that you are interested in.
Store the schedule with time frames, which capture the schedule at a particular point in time.
In SQL Server, I would normally go for the second option, using effDate and endDate columns to capture the period when the schedule is active.
I would suggest that you start with a deeper explanation of the concept, such as the Wikipedia page.

Related

Storing vast amounts of "uptime" data for a website monitoring service

this is more of a general discussion rather than a code question.
I have a website monitoring platform whereby users of the system can input their website URL and we'll check it every X minutes based on the customer's interval, at each interval, an entry is stored as a UptimeCheck model in the Laravel 8 project with the status being down or up.
If a customer has 20 monitors, and each checks every minute, then over a 30 day period for the one customer they'd accumulate over 1 million rows.
My query, is really do I need to keep this number of rows?
The reason this number of rows is kept is so that we can present a graph showing the average website uptime.
My thinking is that if I created some kind of SVG programatically for each day and store this in the table then I wouldn't need to store as many entries, but my concern here is how would I merge SVG models into one to present a daily graph?
What kind of libraries could I use and how else might I approach this?
Unlike performance, the trick for storing uptime data is simple. You don't store it. ;)
You need to store DOWNTIME data instead. Register only unavailability events and extrapolate uptime when displaying reports.

Reporting tables in SQL

Our organization has a reporting application, that queries a real time transaction table to pull data for reports. As the query is against transaction table that is continuously updated the report performance is dismal. We are trying to come up with a new DB design to improve the performance.
My idea is to have three different tables for each year (eg; reports_2014,reports_2015,reports_2016) ( as we need to report only last three years of data) which will be created at the end of the year from the real time DB. The current year table (reports_2016) on the reporting DB will be updated with new records for the previous day at midnight. My reporting query will use a view that will be a union all of these three tables + the data from real time table for records from midnight to till this point in time.
Initially, I felt this to be a good design, provided I am going to have good indexes on these history tables.
However, I have a catch here arising from the inherent application design that updates these real time tables.
The status column of a transaction record can change to cancelled if I am cancelling a transaction , along with a new transaction cancellation record.
I could capture this by having a AFTER insert trigger and capturing the updates made correctly.
Now the issue is when there is a cancel record that is posted during the time my ETL to copy last days data to history table runs, I miss the update.
How do I capture this? Is there a way to delay the trigger untill my ETL is complete? Or is there a better approach to this problem?
My apologies if this is not the right place to post this question.
Thanks,
Roopesh
Multiple parallel tables with the same structure is almost never a good idea for a database design. Databases offer two important methods for handling performance:
Indexes
Partitioning
as well as other methods, such as rewriting queries, spatial indexes, full text indexes, and so on.
In your case, instead of multiple tables, consider table partitions.
As for your process, you should be using the creation/modification date of records. I would envision a job running early in the morning, say at 1:00 a.m., and this job would gather the previous day's records. Any changes after midnight simply do not apply. They will be included the following day.
If the reporting needs to be real-time as well, then you should consider building the reporting into the application itself. Some methods are:
Following the same approach as above, but doing the reporting runs more frequently (say once per hour rather once per each day).
Modifying the existing triggers to handle updates to reporting tables as well as the base tables.
Wrapping all DML transactions in stored procedures that handle both the transactional tables and the reporting tables.
Re-architecting the system to use queues with multiple readers to handle the disparate processing needs.
Thank You Gordon for your inputs. At this point ours is a real time reporting system. The database is a mirrored instance of production transactional database. Whenever a new transaction is entered to production database the same record flows to reporting database, which has the exactly similar schema, instantly. We do have indexes on columns those are queried frequently, however as there are many inserts in every hour the index performance is degraded quite fast. We rebuild them once in two weeks and it takes around 8 hours. That is where I thought having indexes on this huge transaction table with many inserts every hour may not be a good idea.. Please correct me if I am wrong...
I am actually reading through partitioning to see if it is a viable option for me. I had a discussion on the same with our DBA and I got following comment from him 'The reporting database is a mirrored instance of real time production database. You have to implement partitioning on the production transactional database. If you are using partitioning on a mirrored instance that would not work as your actual source DB is not partitioned' I am not sure how far this is true. Do you know if there is such a dependency between partitioning and mirroring??

Data warehouse, data update strategy with Bigquery

We have a MIS where stores all the information about Customers, Accounts, Transactions and etc. We are building a data warehouse with BigQuery.
I am pretty new on this topic, Should we
1. everyday extract ALL the customer's latest information and append them to a BigQuery table with timestamp,
2. or we only extract the updated customer's information on that day?
First solution uses a lot of storage and takes time to upload data, and got lots of duplicates. But it's very clear for me to run query. For 2nd solution, given a specific date how can I get the latest record for that day?
Similar for Account data, an example of simplified Account table, only 4 fields here.
AccountId, CustomerId, AccountBalance, Date
If I need to build a report or graphic of a group of customers' AccountBalance everyday, I need to know the balance of each account on every specific date. So should I extract each account record everyday, even it's the same as last day, or I can only extract the account when the balance changed?
What is the best solution or your suggestion? I prefer the 2nd one because there are no duplicates, but how can I construct the query in BigQuery, will performance be an issue?
What else should I consider? Any recommendation for me to read?
When designing DWH you need to start from business questions, translate them to KPIs, measures, dimensions, etc.
When you have those in place...
you chose technology based on some of the following questions (and many more):
who are your users? in what frequency and what resolutions they consume the data? what are your data sources? are they structured? what are the data volumes? what is your data quality? how often your data structure changes? etc.
when choosing technology you need to think of the following: ETL, DB, Scheduling, Backup, UI, Permissions management, etc.
after you have all those defined... data schema design is pretty straight forward and is derived from "The purpose of the DWH" and your technology limits.
You have pointed out some of the points to consider, but the answer is based of your needs... and is not related to specific DB technology.
I am afraid your question is too general to be answered without deep understanding of your needs.
Referring to your comment bellow:
How reliable is your source data? Are you interested in the analyzing trends or just snapshots? Does your source system allow "Select all" operations? what are the data volumes? What resources does your source allow for extraction (locks, bandwidth, etc.)?
If you just need a daily snapshot of the current balance, and there are no limits by your source system,
it would be much simpler to run a daily snapshot.
this way you don't need to manage "increments", handle data integrity issues and systems discrepancies etc. however, this approach might have undesired impact on your source system, and your network costs...
If you do have resources limits, and you chose the incremental ETL approach, you can either
create a "Changes log" table and query it, you can use row_number()
in order to find latest record per account.
or yo can construct a copy of the source accounts table, merging
changes everyday to an existing table.
each approach has its own aspect of simplicity, costs, and resource consumption...
Hope this helps

How to store complex records for referencing historical revisions?

I have a table on my database that outlines complex processes in a work breakdown structure (similar to what's used to create Gantt charts). There are multiple rows for a particular process, each row outlining a hierarchical step of a particular process.
I then have a table with some product types, each being linked to a particular process. When an order for a particular product is placed - it is to be manufactured with the associated process.
In my situation, the processes can be dynamic (steps added or removed, for example).
I'm curious as to what the best way to capture current and historical revisions of each process is, such that even though a process may have evolved over time - I can historically go back to a particular order and determine what the process looked like at that time.
I'm sure there are multiple ways to go about this, using logging or triggers with a new history table - but I've had no experience doing something like this and I'd like to know what worked well for others.

SSAS is a periodic snapshot the right choice

I'm a newbie to SSAS. I have a database which has an agreement table in which the status of the agreements changes over time. This is stored in the agreement log. the status can be any combination over an extended period of time. One set of questions I will need to answet are how many agreements are of a given status and also to show trends in the status over time. I'm reading Kimball and periodic snapshot seems to be the best fit but I'm at a loss how to design the fact table. Do I preaggregate the data into periods broken down by status? And then how do I manipulate it in SSAS and how do aggregations work as it's more like a bank balance. I sort of get some of the concepts but I'm still pretty confused.
Agreed, this is a good case for Periodic Snapshot.
In this case, you need a status dimension,
and a fact with a period indicator. Your reports will also need to filter on the period.
ETL is a bit more complicated, as during the current period, you clear down and reload the current period data. Previous periods to the current one are fixed. Obviously, you lose visibility on statuses that change multiple times within a period, so the period should be chosen based on how quickly the data changes as well as how often its reported. This is also why Periodic Snapshots are often used in conjunction with transaction fact tables