Keeping track of mutated rows in BigQuery? - google-bigquery

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!

One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';

An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.

If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.

Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

Related

How to make 2 remote tables in sync along with determination of Delta records

We have 10 tables on vendor system and same 10 tables on our DB side along with 10 _HISTORIC tables i.e. for each table in order to capture updated/new records.
We are reading the main tables from Vendor system using Informatica to truncate and load into our tables. How do we find Delta records without using Triggers and CDC as it comes with cost on vendor system.
4 tables are such that which have 200 columns and records around 31K in each with expectation that 100-500 records might update daily.
We are using Left Join in Informatica to load new Records in our Main and _HISTORIC tables.
But what's efficient approach to find the Updated records of Vendor table and load them in our _HISTORIC table ?
For new Records using query :
-- NEW RECORDS
INSERT INTO TABLEA_HISTORIC
SELECT FROM TABLEA
LEFT JOIN TABLEB
ON A.PK = B.PK
WHERE B.PK IS NULL
I believe a system versioned temporary table will be something you are looking for here. You can create a system versioned table for any table in SQL server 2016 or later.
for example, say I have a table Employee
CREATE TABLE Employee
(
EmployeeId VARCHAR(20) PRIMARY KEY,
EmployeeName VARCHAR(255) NOT NULL,
EmployeeDOJ DATE,
ValidFrom datetime2 GENERATED ALWAYS AS ROW START,--Automatically updated by the system when the row is updated
ValidTo datetime2 GENERATED ALWAYS AS ROW END,--auto-updated
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)--to set the row validity period
)
the column ValidFrom, ValidTo determines the time period on which that particular row was active.
For More Info refer the micorsoft article
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
Create staging tables, load wipe&load them. Next, use them for finding the differences that need to be load into your target tables.
The CDC logic needs to be performed this way, but it will not affect your source system.
Another way - not sure if possible in your case - is to load partial data based on some source system date or key. This way you stage only the new data. This improves performance a lot, but makes finding the deletes in source impossible.
A. To replicate a smaller subset of records in the source without making schema changes, there are a few options.
Transactional Replication, however this is not very flexible. For example would not allow any differences in the target database, and therefore is not a solution for you.
Identify a "date modified" field in the source. This obviously has to already exist, and will not allow you to identify deletes
Use a "windowing approach" where you simply delete and reload the last months transactions, again based on an existing date. Requires an existing date that isn't back dated and doesn't work for non transactional tables (which are usually small enough to just do full copies anyway)
Turn on change tracking. Your vendor may or may not argue that tihs is a costly change (it isn't) or impacts application performance (it probably doesn't)
https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver15
Turning on change tracking will allow you to more easily identify changes to all tables.
You need to ask yourself: is it really an issue to copy the entire table? I have built solutions that simple copy entire large tables (far larger than 31k records) every hour and there is never an issue.
You need to consider what complications you introduce by building an incremental solution, and whether the associated maintenance and complexity is worth being able to reduce a record copy from 31K (full table) to 500 records (changed). Again a full copy of 31K records is actually pretty fast under normal circumstances (like 10 seconds or so)
B. Target table
As already recommended by many, you might want to consider a temporal table, although if you do decide to do full copies, a temporal table might not be the beast option.

I need to partition a huge table and truncate the partition on daily basis

I have a table tblCallDataStore which has a flow of 2 millions records daily.Daily, I need to delete any record older than 48 hours. If I create a delete job, it runs for more than 13 hours or sometimes more than that. What is the feasible way to partition the table and truncate the partition.
How may I do that?
I have a Receivedate column and I want to do the partition on its basis.
Unfortunately, you will need to bite the one-time bullet and add partitioning to your table (hint: base it off the DATEPART(DAY, ...) of the timestamp you are keying off of so you can quickly get this as your #ParitionID variable later). After that, you can use the SWITCH PARTITION statement to move all rows of the specified partition ID to another table. You can then quickly TRUNCATE the receiving table since you don't care about the data anymore.
There are a couple of tricky setup steps in this process: Managing the partition IDs themselves, and ensuring the receiving table has an identical structure (and various other requirements, such as not being able to refer to your partitioned table with foreign keys (umm... ouch)). Once you have the partition ID you want, the code is simple:
ALTER TABLE [x] SWITCH PARTITION #PartitionID TO [x_copy]
I have been meaning to write a stored procedure to automatically maintain the receiving tables, since at work we have just been manually updating both tables in tandem whenever we make a change (e.g. adding a new column or index). To manually do it, basically just copy the table definition and append something to all the entity names (e.g. just put the number 2 on the end).
P.S. Maybe you want to use hourly partitioning instead, then you have much more control over the "48 hours" requirement, and also have the flexibility to handle time zones (ugh). You'd just have to be more clever since DATEPART(HOUR, ...) obviously won't work directly. Also note that your partition IDs must be defined as a range, so eventually you will end up cycling through them, so keep that in mind.
P.P.S. As noted in the comments by David Browne, partitioning was an enterprise-only feature until SQL 2016 SP1.

SQL server data backup/warehouse

I've been asked to do a snapshots of certain tables from the database, so in the future we can have a clear view of the situation for any given day in the past. lets say that one of such tables looks like this:
GKEY Time_in Time_out Category Commodity
1001 2014-05-01 10:50 NULL EXPORT Apples
1002 2014-05-02 11:23 2014-05-20 12:05 IMPORT Bananas
1003 2014-05-05 11:23 NULL STORAGE Null
The simples way to do a snapshot would be creating copy of the table with another column SNAPSHOT_TAKEN (Datetime) and populate it with an INSERT statement
INSERT INTO UNITS_snapshot (SNAPSHOT_TAKEN, GKEY,Time_in, Time_out, Category, Commodity)
SELECT getdate() as SNAPSHOT_TAKEN, * FROM UNITS
OK, it works fine, but it would make the destination table quite big pretty soon, especially if I'd like to run this query often. Better solution would be checking for changes between current live table and the latest snapshot and write them down, omitting everything that hasn't been changed.
Is there a simply way to write such query?
EDIT: Possible solution for the "Forward delta" (assuming no deletes from original table)
INSERT INTO UNITS_snapshot
SELECT getdate() as SNAP_DATE,
r.* -- Here goes all data from from the original table
CASE when b.gkey is null then 'I' else 'U' END AS change_type
FROM UNITS r left outer join UNITS_snapshot b
WHERE (r.time_in <>b.time_in or r.time_out<>b.time_out or r.category<>b.category or r.commodity<>b.commodity or b.gkey is null)
and (b.snap_date =
(SELECT max (b.snap_date) from UNITS_snapshot b right outer join UNITS r
on r.gkey=b.gkey) or b.snap_date is null)
Assumptions: no value from original table is deleted. Probably also every field in WHERE should be COALESCE (xxx,'') to avoid comparing null values with set ones.
Both Dan Bracuk and ITroubs have made very good comments.
Solution 1 - Daily snapshop
The first solution you proposed is very simple. You can build the snapshot with a simple query and you can also consult it and rebuild any day's snapshot with a very simple query, by just filtering on the SNAPSHOT_TAKEN column.
If you have just some thousands of records, I'd go with this one, without worrying too much about its growing size.
Solution 2 - Daily snapshop with rolling history
This is basically the same as solution 1, but you keep only some of the snapshots over time... to avoid having the snapshot DB growing indefinitely over time.
The simplest approach is just to save the snapshots of the last N days... maybe a month or two of data. A more sophisticated approach is to keep snapshot with a density that depends on age... so, for example, you could have every day of the last month, plus every sunday of the last 3 months, plus every end-of-month of the last year, etc...
This solution requires you develop a procedure to handle deletion of the snapshots that are not required any more. It's not as simple as using getdate() within a query. But you obtain a good balance between space and historic information. You just need to balance out a good snapshot retainment strategy to suit your needs.
Solution 3 - Forward row delta
Building any type of delta is a much more complex procedure.
A forward delta is built by storing the initial snapshot (as if all rows had been inserted on that day) and then, on the following snapshots, just storing information about the difference between snapshot(N) and snapshot(N-1). This is done by analyzing each row and just storing the data if the row is new or updated or deleted. If the main table does not change much over time, you can save quite a lot of space, as no info is stored for unchanged rows.
Obviously, to handle deltas, you now need 2 extra columns, not just one:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
row change type (could be D=deleted, I=inserted, U=updated... or
something similar)
The main complexity derives from the necessity to identify rows (usually by primary key) so as to calculate if between 2 snapshots any individual row has been inserted, updated, deleted... or none of the above.
The other complexity comes from reading the snapshot DB and building the latest (or any other) snapshot. This is necessary because, having only row differences in the table, you cannot simply select a day's snapshot by filtering on snapshot_taken.
This is not easy in SQL. For each row you must take into account just the final version... the one with MAX snapshot_taken that is <= the date of the snapshot you want to build. If it is an insert or update, then keep the data for that row, else (if it is a delete) then ignore it.
To build a delta of snapshot(N), you must first build the latest snapshot (N-1) from the snapshot DB. Then you must compare the two snapshots by primary key or row identity and calculate the change type (I/U/D) and insert the changes in the snapshot DB.
Beware that you cannot delete old snapshot data without consolidating it first. That is because all snapshots are calculated from the oldest initial one and all the subsequent difference data. If you want to remove a year's of old snapshots, you'll have to consolidate the old initial snapshot and all the year's variations in a new initial snapshot.
Solution 4 - Backward row delta
This is very similar to solution 3, but a bit more complex.
A backward delta is built by storing the final snapshot and then, on the following snapshots, just storing information about the difference between snapshot(N-1) and snapshot(N).
The advantage is that the latest snapshot is always readily available through a simple select on the snapshot DB. You only need to merge the difference data when you want to retrieve an older snapshot. Compare this to the forward delta, where you always need to rebuild the snapshot from the difference data unless you are actually interested in the very first snapshot.
Another advantage (compared to solution 3) is that you can remove older snapshots by just deleting the difference data older than a particular snapshot. You can do this easily because snapshots are calculated from the final snapshot and not from the initial one.
The disadvantage is in the obscure logic. Difference data is calculated backwards. Values must be stored on the (U)pdate and (D)elete variations, but are unnecessary on the I variations. Going backwards, rows must be ignored if the first variation you find is an (I)nsert. Doable, but a bit trickier.
Solution 5 - Forward and backward column delta
If the main table has many columns, or many long text or varchar columns, and only a bunch of these are updated, then it could make sense to store only column variations instead of row variations.
This is done by using a table with this structure:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
change type (could be D=deleted, I=inserted, U=updated... or
something similar)
column name
value
The difference can be calculated forward or backward, as per row deltas.
I've seen this done, but I really advise against it. There are just too many disadvantages and added complexity.
Value is a text or varchar, and there are typecasting issues to handle if you have numeric, boolean or date/time values... and, if you have a lot of these, it could very well be you won't be saving as much space as you think you are.
Rebuilding any snapshot is hell. Altogether... any operation on this type of table really requires a lot of knowledge of the main table's structure.

Incremental extraction from DB2

What would be the most efficient way to select only rows from DB2 table that are inserted/updated since the last select (or some specified time)? There is no field in the table that would allow us to do this easily.
We are extracting data from the table for purposes of reporting, and now we have to extract the whole table every time, which is causing big performance issues.
I found example on how to select only rows changed in last day:
SELECT * FROM ORDERS
WHERE ROW CHANGE TIMESTAMP FOR ORDERS >
CURRENT TIMESTAMP - 24 HOURS;
But, I am not sure how efficient this would be, since the table is enormous.
Is there some other way to select only rows that are changed, that might be more efficient that this?
I also found solution called ParStream. This seems as something that can speed up demanding queries on the data, but I was unable to find any useful documentation about it.
I propose these options:
You can use Change Data Capture, and this will replay automatically the modifications to another data source.
Normally, a select statement does not assure the order of the rows. That means that you cannot use a select without a time reference in order to retrieve the most recent. Thus, you have to have a time column in order to retrieve the most recent. You can keep track of the most recent row in a global variable, and the next time retrieve the rows with a time bigger than that variable. If you want to increase performance, you can put the table in append mode, and in this way the new rows will be physically together. Keeping an index on this time column could be expensive to maintain, but it will speed (no table scan) when you need to extract the rows.
If your server is DB2 for i, use database journaling. You can extract after images of inserted records by time period or journal entry number from the journal receiver(s). The data entries can then be copied to your target file.

SQL Server compare table entries for update

I have a trade table with several million rows. Each row represents the version of a trade. If I'm given a possibly new trade I compare it to the latest version in the trade table. If it has changed I add a new version, otherwise I do nothing. In order to compare the 2 trades I read the version from the trade table into my application.
This doesn't work well when I'm given 10s of thousands of possibly new trades. Even batching reads to read in a 1000 trades at once and compare them the whole process can take several minutes. All the time is spent in the DB.
I'm trying to find a way to compare the possibly new trades to the ones in the trade table without so much I/O. What I've come up with so far is adding a hash column to each row in the trade table. The hash is of all the trade fields. Then when I'm given possibly new trades I compute their hash, put the values into a temporary table, then find ones that are different. This feels very hacky. Is there a better way of doing it?
Thanks
--
Some more info
SQL Server 2008
Trade(rowid, tradeid, type, trader, volume, etc..) -- rowid is unique, tradeid will be duplicated for difference versions of the same trade
The table has about 30 columns and is not normalised, so depending on type some columns can be null. Someone posts thousands of trades to a java servlet which is then supposed to add a new row for any trade that has changed. Unfortunately in order to do this the java servlet has to read in every one of the thousands of trades and compare them.
The newest version of a particuluar trade is just the version with the highest rowid.
If you are using SQL Server 2008, you might want to use the MERGE statement.
Create an index on the columns that uniquely identify each trade.
Hash not a bad solution. It will help if you post some more info about the table structure.
Standard way to do it is to simply run UPDATE statement, WHERE clause will include joins on key fields: WHERE table.PRODUCT_ID = NEWTRADE.PRODUCT_ID; also, check the value fields: WHERE table.TRADE_AMOUNT <> newtrade.BIDAMOUONT; if you index the table by PRODUCT_ID - it will run milliseconds.
You may insert your 10s of thousands new trades in a table first and then run UPDATE to join main table with new trades. again, make sure you have indexes the tables properly.
Given what you have told us, it sounds like you are in part looking for a way to determine if the row changed. This is a good candidate for a rowversion column (previously known as a timestamp). This column will change whenever any value in the row changes. Thus, you could compare the last trade's rowversion with the current rowversion to determine if they were different.
It might be possible to do this in a single insert statement if you show us some additional details about the table schema and specifically how "last" is determined and how you match rows in the two tables (i.e. the matching key between the two tables).