SQL Is it possible to setup a column that will contain a value dependent on another column? - sql

I have a table (A) that lists all bundles created off a machine in a day. It lists the date created and the weight of the bundle. I have an ID column, a date column, and a weight column. I also have a table (B) that holds the details related to that machine for the day. In that table (B), I want a column that lists a sum of weights from the other table (A) that the dates match on. So if the machine runs 30 bundles in a day, I'll have 30 rows in table (A) all dated the same day. In table (B) I'll have 1 row detailing other information about the machine for the day plus the column that holds the total bundle weight created for the day.
Is there a way to make the total column in table (B) automatically adjust itself whenever a row is added to table (A)? Is this possible to do in the table schema itself rather than in an SQL statement each time a bundle is added? If it's not, what sort of SQL statement do I need?
Wes

It would be a mistake to do so unless you have performance problems that require it.
A better approach is to define a view in the database that will aggregate the daily bundles by machine:
CREATE VIEW MachineDailyTotals
(MachineID, RunDate, BundleCount, TotalWeight)
AS SELECT MachineID, RunDate, COUNT(*), SUM(WeightCol)
FROM BundleListTable
GROUP BY MachineID, RunDate
This will allow you to always see the correct, updated total weight per machine per day without imposing any load on the database until you actually look at the data. You can perform a simple OUTER JOIN with the machine table to get information about the machine, including the daily total info, without having to actually store the totals anywhere.

If you need the sum (or other aggregate) in real time, add a trigger on table A for INSERT, UPDATE, DELETE which calculates the sum to be stored in B.
Otherwise, add a daily job which calculates the sums.
Please specify which database you are using.

Are you sure that you don't want to pull this information dynamically rather than storing it in a separate table? This seems like an indirect violation of Normalization rules in that you'll be storing the same information in two different places. With a dynamic query, you'll always be sure that the derived information will be correct without having to worry about the coding and maintenance of triggers.
Of course, if you are dealing with large amounts of data and query times are becoming an issue, you may want the shortcut of a summary table. But, in general, I'd advise against it.

This can be accomplished via Triggers which are little bits of code that execute whenever a certain action (insert/update/delete) happens on a table. The syntax is varies by vendor (MySQL vs. Oracle) but the language is typically the same language you would write a stored procedure in.
If you mention the DB type I can help with the actual syntax

Related

How many records were counted in specific day

Is it possible to check in DB2 how many records were counted in specific table in specific day in past
I have a table with name 'XYZ' and I would like to check row count for specific day e.g. for 10.09.2020, for 05.09.2020 and for 01.09.2020
In ordinary SQL, without special provisions, no, you can´t!
Depending on your usage scenario, there are several ways to achieve this function. Here are three that I can think of:
If you table has a timestamp field or you can add one and you can guarantee there will be no rows deleted: You can just count the rows where the timestamp is smaller then your desired date. Cheap, performance wise, but deletes may make this impossible.
You could set up a procedure that runs daily and counts your rows to write them in a different table. This van also be rather cheap from a performance point of view, but you will be limited to the specific "snapshot" times you configured beforehand and you may have conditions where the count procedure did not run an therefore data is missing.
You could create an audit-table and a trigger on the table you are interested in to log every insert and delete operation on the table with a timestamp. This is the most performance heavy solution, but the only one that will give you always a full picture of the row count at any given time.

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

I need to partition a huge table and truncate the partition on daily basis

I have a table tblCallDataStore which has a flow of 2 millions records daily.Daily, I need to delete any record older than 48 hours. If I create a delete job, it runs for more than 13 hours or sometimes more than that. What is the feasible way to partition the table and truncate the partition.
How may I do that?
I have a Receivedate column and I want to do the partition on its basis.
Unfortunately, you will need to bite the one-time bullet and add partitioning to your table (hint: base it off the DATEPART(DAY, ...) of the timestamp you are keying off of so you can quickly get this as your #ParitionID variable later). After that, you can use the SWITCH PARTITION statement to move all rows of the specified partition ID to another table. You can then quickly TRUNCATE the receiving table since you don't care about the data anymore.
There are a couple of tricky setup steps in this process: Managing the partition IDs themselves, and ensuring the receiving table has an identical structure (and various other requirements, such as not being able to refer to your partitioned table with foreign keys (umm... ouch)). Once you have the partition ID you want, the code is simple:
ALTER TABLE [x] SWITCH PARTITION #PartitionID TO [x_copy]
I have been meaning to write a stored procedure to automatically maintain the receiving tables, since at work we have just been manually updating both tables in tandem whenever we make a change (e.g. adding a new column or index). To manually do it, basically just copy the table definition and append something to all the entity names (e.g. just put the number 2 on the end).
P.S. Maybe you want to use hourly partitioning instead, then you have much more control over the "48 hours" requirement, and also have the flexibility to handle time zones (ugh). You'd just have to be more clever since DATEPART(HOUR, ...) obviously won't work directly. Also note that your partition IDs must be defined as a range, so eventually you will end up cycling through them, so keep that in mind.
P.P.S. As noted in the comments by David Browne, partitioning was an enterprise-only feature until SQL 2016 SP1.

Creating a SQL totals table

What I am trying to accomplish is a SQL table that contains several different totals based off of 5 other tables. This would be so that when my application needs those totals, it would not need to perform the sum since it is a rather large query.
I would like to know if there is a recommended method to have a totals table that constantly updates based on changes made in other tables. I have thought of replacing it with an indexed view or having triggers on each of the tables that are being summed, but it seems inefficient to rerun the sum query every time a field is updated. One other thing I thought of would be to have a trigger on update and every time the data changes, I would just add or remove the difference from the stored total. My end goal is to have some totals that are constantly up to date.
The table is showing totals per product. (e.g. total qty from table1 + total qty from table2)
If this is too general, I can give more specifics about table structure.
Add a trigger to tables in question and check for the only the relevant value changing rather than run the sum each time a field that is irrelevant to the total on the computed table is modified.
I ended up putting these in a queue when the underlying data was changed, and using a scheduled task to update the totals at a regular interval. We decided the tradeoff in data freshness was worth not having to recalculate the total with every transaction.

Is it viable to have a SQL table with only one row and one column?

I'm currently working on my first application that uses a database so I'm very new to this. The database has multiple tables that are what you would expect to normally see.
However, I created one table which only has one row and one column used to keep a count of the total items processed by the program so it's available to access elsewhere. I can't just use
SELECT COUNT(*) FROM table_name
because these items that I am processing I do not want to actually keep in a table.
It seems like a waste to use a table to store one value so I am wondering if there a better way to keep track of this value.
What is your table storing? it's storing some kind of processing audit. So make it a little more useful - add a column storing the last datetime that the data was processed. Add a column for the time it took to process. Add another column which stores the username (or some identifier) of whoever ran the process. Now add a row for every table that is processed (there's only one now but there might be more in future). Try and envisage how your processing is going to grow in future