Creating a SQL totals table - sql

What I am trying to accomplish is a SQL table that contains several different totals based off of 5 other tables. This would be so that when my application needs those totals, it would not need to perform the sum since it is a rather large query.
I would like to know if there is a recommended method to have a totals table that constantly updates based on changes made in other tables. I have thought of replacing it with an indexed view or having triggers on each of the tables that are being summed, but it seems inefficient to rerun the sum query every time a field is updated. One other thing I thought of would be to have a trigger on update and every time the data changes, I would just add or remove the difference from the stored total. My end goal is to have some totals that are constantly up to date.
The table is showing totals per product. (e.g. total qty from table1 + total qty from table2)
If this is too general, I can give more specifics about table structure.

Add a trigger to tables in question and check for the only the relevant value changing rather than run the sum each time a field that is irrelevant to the total on the computed table is modified.

I ended up putting these in a queue when the underlying data was changed, and using a scheduled task to update the totals at a regular interval. We decided the tradeoff in data freshness was worth not having to recalculate the total with every transaction.

Related

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

Calculating running balance from join table [SQL Database Design]

Let's say I have three tables
TRANSACTIONS
amount
date
RECORDS
amount
date
CUSTOM_RECORDS
amount
date
(Let's just say there are many other fields to justify splitting of these tables)
To calculate running balance I have two methods
-------------METHOD 1 -------------
Heavy on READ, Light on WRITE
Whenever we read, just join the table, sort by date and calculate the running balance.
PRO
Write is easy, just write into each table
CON
Reading is very heavy, the calculation needs to be done on each read.
It is very strange to be querying (from let's say a span of 1 week) and to have the calculation done for ALL the records. If I query for 10 records then calculation needs to be done for 1 million records to be able to know the 10 record balance.
-------------METHOD 2 -------------
Heavy on WRITE, Light on READ
I have another table
FINAL_TABLE
date
amount
running balance
Whenever I write, I refresh this table and calculate all the running balance again.
PRO
Read is easy, running balance already computed.
Querying between time period is as easy as extracting the date between the time span from the FINAL_TABLE
CON
Write is really slow, each write on any of the Three tables mean refreshing a whole FINAL_TABLE table!
Why didn't I just reuse the latest running balance? This can occur if the entry is a guarantee to be chronological in real life. However, sometimes entry might be added late.
Currently, I am using Method 2 and every time a client save/update a row into any of the three tables, the server freeze as it tries to refresh and compute the FINAL_TABLE. Obviously, this is not very scalable.
Method 1 is also not very scalable in term of querying. I would have to calculate running balance from the beginning of time in order to know the running balance of last week.
Both Method is not very scalable. What is a good design to ensure scalability and relatively fast performance on READ and WRITE? What method does the bank use to keep track of running balance?
It depends.
Suppose you have a report like transaction report where accounts' running balance will be shown. If you want to show real time data then always method 1 will be preferable. And I will suggest to use Quirky Update for this rather than using cursors, loops, sub-queries or recursions.
On the other hand, if you don't need real time running total then you could have use method 2 with a little customization. I will not support updating Final Table while you made a transaction. Rather than I will suggest to update it with interval schedule. Depending on your traffic or load you may update the running total after a interval.
And for real time I will discourage using method 2 as it will make your transaction costly.
To make your method 1 faster here is some link.
Calculating Running Total
Quirky Update
Quirky Update Performance
Halloween Protection
Create Table AccBalance
(
AccountNO PK,
Balance
)
Create Table AccDateWiseCumBalance
(
AccountNO PK,
SystemDate PK,
Cumulative Balance
)
First table will be updated by each transaction will keep real time balance but not any history.
Second table keep account and date wise cumulative balance which will be updated at each day end.
So if you need up to previous date cumulative balance you will retrieve data from second table.
And if you need up to current date cumulative balance you will retrieve data from second table up to day before current date and retrieve current date data from first table.

Select sum or updating a field for Total Balance?

I'm using Entity Framework and Azure Sql.
I have users and they have records on balance table.Some of users may have 1 million record.I need total balance of the user before every http requests.
I have two approaches for getting total balance of user:
First:
Insert balance and update totalbalance field (by itself) in a transaction.
transaction(
InsertBalance(amount)
Update CustomerSummary Set Totalbalance=Totalbalance+Amount
)
If I need total balance I'll just select this from CustomerSummary table.
Second: Inserts balance directly without using any transaction.
If I need total balance I have to get sum by query.
Is the first approach reliable for total balance ?
Can I get sum on second approach as fast as like first approach ?
The second approach is guaranteed to be accurate -- if you want the sum of a particular column, there is nothing more accurate than a query that calculates the sum.
The reason for maintaining a summary table is performance. Typically, such a table is maintained in one of two ways:
Triggers
Stored procedures that wrap all data modification operations
Your example with the insert is an "application-side" solution. The danger is that someone might come along and say that a balance is incorrect and then have the value changed directly in the database. The total doesn't get changed.
To make this work correctly, you need to have the right controls over access to the database to ensure that whenever amount changes, then all its dependencies change. Note: this is not an issue if you calculate the balance when you need it.

SQL Server : Update reporting table in real time

For one of our applications we have huge data in multiple tables and every time a user does something new record is inserted in to these tables. There is a reporting screen where we have to do calculations from these tables and show the total from these tables
For example: Assume two parent tables Employee and Attendance
Employee table has 100,000 records and Attendance table has data for each day whenever a employee goes and comes out of their building. The records in Attendance table is more 2 million for one year. I need to calculate the attendance for each employee (Total) and display it on screen for all 100,000 records and it is paginated based on employee name. The caluclation takes too much time and it spikes the DB CPU.
To avoid runtime calculation for the total Im planning to have a separate table with total calculated values for each employee and just query the table and show it whenever needed. But the problem is for previous years the data is not going to change but for the current year the data will be generated whenever the employee records attendance day to day. What is the best option for me to keep the table updated in real time with Total for every employee whenever new attendance is recorded for the current year.
I thought of using triggers but triggers are synchronous and it should affect the performance of my reporting application when ever I query or it will affect the performance of inserts into parent table.
Please let me know if there are any better ways to update my Total value table in real time without impacting the performance of insert or update to parent tables
This is a perfect case for indexed views. Certainly, the core of your query is a group by such as:
select EmployeeID, count(*)
from AttendanceRecords
group by EmployeeID
Index that view. It's contents will then be available cheaply and updated in real time. There is zero potential for out-of-sync data.
One option would be to use SQL Change Tracking:
https://msdn.microsoft.com/en-us/bb933875.aspx
This is not change data capture (which can be quite heavy) - change tracking just lets you know which keys changed so you can act on it. With that information, you could have a regular job that collects those changes and updates your summaries.
...or, if you can use SQL 2014, you could get into Updatable Column Stores and dispense with the summaries.
Would you consider exporting data from previous years and using it to create the total attendance counts for employees in earlier years?
You say you're moving towards essentially having a table acting as a counter at the moment, so by ensuring your old data conforms to this model as well it'll be much easier to write and maintain the code that interacts with it and server load from any individual query should be minimal.

SQL Is it possible to setup a column that will contain a value dependent on another column?

I have a table (A) that lists all bundles created off a machine in a day. It lists the date created and the weight of the bundle. I have an ID column, a date column, and a weight column. I also have a table (B) that holds the details related to that machine for the day. In that table (B), I want a column that lists a sum of weights from the other table (A) that the dates match on. So if the machine runs 30 bundles in a day, I'll have 30 rows in table (A) all dated the same day. In table (B) I'll have 1 row detailing other information about the machine for the day plus the column that holds the total bundle weight created for the day.
Is there a way to make the total column in table (B) automatically adjust itself whenever a row is added to table (A)? Is this possible to do in the table schema itself rather than in an SQL statement each time a bundle is added? If it's not, what sort of SQL statement do I need?
Wes
It would be a mistake to do so unless you have performance problems that require it.
A better approach is to define a view in the database that will aggregate the daily bundles by machine:
CREATE VIEW MachineDailyTotals
(MachineID, RunDate, BundleCount, TotalWeight)
AS SELECT MachineID, RunDate, COUNT(*), SUM(WeightCol)
FROM BundleListTable
GROUP BY MachineID, RunDate
This will allow you to always see the correct, updated total weight per machine per day without imposing any load on the database until you actually look at the data. You can perform a simple OUTER JOIN with the machine table to get information about the machine, including the daily total info, without having to actually store the totals anywhere.
If you need the sum (or other aggregate) in real time, add a trigger on table A for INSERT, UPDATE, DELETE which calculates the sum to be stored in B.
Otherwise, add a daily job which calculates the sums.
Please specify which database you are using.
Are you sure that you don't want to pull this information dynamically rather than storing it in a separate table? This seems like an indirect violation of Normalization rules in that you'll be storing the same information in two different places. With a dynamic query, you'll always be sure that the derived information will be correct without having to worry about the coding and maintenance of triggers.
Of course, if you are dealing with large amounts of data and query times are becoming an issue, you may want the shortcut of a summary table. But, in general, I'd advise against it.
This can be accomplished via Triggers which are little bits of code that execute whenever a certain action (insert/update/delete) happens on a table. The syntax is varies by vendor (MySQL vs. Oracle) but the language is typically the same language you would write a stored procedure in.
If you mention the DB type I can help with the actual syntax