Ignore Unique Constraint and still insert other rows - sql

I think I want the semantics of both UNIQUE and IGNORE_DUP_KEY.
I have an INSERT query that looks over recent data and inserts a unique key-value pair. It runs often and takes seconds at most.
I have another INSERT query that looks at all data and inserts unique key-value pairs. It takes minutes to run and probably finds nothing to do, except it will sometimes see the same data as the recent query, and will decide to insert the same pair.
I've implemented a UNIQUE constraint, so that's not a problem in itself, but I'd like other records determined by the long-running query to be inserted irrespective of the duplicates.
Both queries do explicitly have a clause similar to
WHERE NOT EXISTS (SELECT Key, Value From TargetTable TT
WHERE TT.Key = Result.Key AND TT.Value = Result.Value)

If I understand correctly, you want something like MySQL's INSERT IGNORE. I don't believe this functionality exists in SQL Server. Your specific problem appears to be updates on this (or another table) that occur during the updating process, introducing duplicate keys.
One option is to put a lock on the table during this operation, not allowing any other operations. That is probably not feasible given the time frame for the lock.
Another option is to take the long running query and stash the results into a temporary table. Then, do the inserts from this table, one at a time, capturing and ignoring any violations of the unique constraint.

I've decided to split the second query into two so that I now have three queries:
Quick 15 minute past query running every 30 seconds.
Nearly as quick query looking back to midnight UTC; may have duplicate key failures but retrying won't take long; runs every 2 minutes.
Slow query reviewing all data, but ignoring the current UTC day; won't have duplicate key failures; runs twice a day.

Related

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

Get rows inserted since last check?

I am implementing a CQRS pattern where one or more processes are inserting records into the database and one or more processes are pulling them at a difference pace.
I'd like consumer processes to poll the database for new records that were inserted since last check, but I'm not sure how to (safely) implement this.
You can assume that rows will not change once they are inserted. It seems it isn't enough for each row to have a unique id, and a timestamp indicating when it was inserted.
If I query for records with a timestamp greater than the last row I saw then I run into problems if multiple records were inserted at the same time (having the same timestamp).
If I query for records with an id greater than the last row I saw then I run into problems where concurrent transactions may commit IDs in non-increasing order (e.g. postgreSQL sessions allocate and cache sequence IDs ahead of time to improve performance).
Ideally, I am looking for a DBMS-agnostic solution and be able to consume data as close to real-time as possible. Any ideas?
Clarification: Each row should be consumed multiple times, once per consumer. Meaning, just because one consumer processes a row should not prevent other consumers from doing so. Each consumer will do something different with the same data.
Since you have a lot of data coming in and might have multiple records for the last time stamp, you need a way to keep track of the data read. Here are a few different approaches with their pro and cons:
You can wait for the data to come in for a time stamp. You would do this by not reading the MAX(timestamp) so you would get all the data from the table except the last one for which the data might still be coming in.
Pro: Simple design
Con: Not real time processing
You can store the id's you have read each time for the last time stamp. When getting the data, you can use a query like (timestamp = lasttimestamp and id not in (set of ids)) or timestamp > lasttimestamp)
Pro: Almost real time
Con: Additional storage required
If you don't use sharding or similar:
You can use optimistic locking.
For this you can create an order column, with an unique index on the records table (the Log). Before each insertion, the producer query the Log for the greatest order, it increments it and insert the next record with this order.
If a concurrency exception occurs (i.e. Duplicate entry '12345' for key order) then you retry the entire process (query, increment, insert).
If you use sharding or similar:
Then you will need an additional service/table that will generate a new, unique, always-increasing order integer every time it is asked to do so.
This has the disadvantage that there is another piece that must be managed, a single point of failure that must be highly-available.
P.S.
"sharding or similar" means that you can't have unique indexes on the entire table because you use sharding or you write to multiple tables.
you can't rely on the timestamps or anything that relates to physical time because the system time may be adjusted, by an automated service (NTP) or by an human operator.

SQL - When was my table last change?

I want to find when the last INSERT, UPDATE or DELETE statement was performed on a table (for now, in the future I want to do this in multiple tables) in an Oracle database.
I created a table and then I updated one of its rows. Now I've the following query:
SELECT SCN_TO_TIMESTAMP(ora_rowscn) from test_table;
This query returns the timestamps of each row, and for each of them it gives the time when they were first created.
But the row that I've updated have the same timestamp as the others. Why? Shouldn't the timestamp be updated?
ORA_ROWSCN is not the right solution for this. It is not necessarily reliable at the row level. Moreover, it's not going to be useful at all for deleted rows.
If you have a real need to know when DML changes were made to a table, you should look at Oracle's auditing feature.
An alternative is to use triggers to record when changes are made to the table. Since you say you only care about the time of the most recent change, you can just create a single-column table to record the time, and write a trigger that fires on any DML statement to maintain it. If you're doing this in a production environment or even just in one where more than one session might be modifying the table, you'd want to think about how it should work when concurrent changes are made. You could force the table to have at most one row, but that would serialize every change to the table. You could allow each session to insert a separate row and take the max value when querying it, but then you probably want to think about clearing out old rows from time to time.

Is there any real world difference in the following two SQL statements (updates that result in no net change)

Say I have a table 1 million rows and lets say 50% on the particular column is null (so 500k NULL and 500k non NULL). And I want to set all the rows to NULL.
Assume no indexing to simplify the domain.
UPDATE
MyTable
SET
MyColumn = NULL
or
UPDATE
MyTable
SET
MyColumn = NULL
WHERE
MyColumn IS NOT NULL
Logic dictates that the latter is more efficient. However won't the optimiser realise the first is the same as the second as the WHERE condition and the SET only reference MyColumn.
The optimizer works against SELECT statements.
The optimizer does not affect how a table is Updated.
When you ask SQL-Server to updated Every row, then it will Update EVERY Row.
It will also take a lot longer to do this because you're affecting every row; which I believe means it will affect your transaction log too.
Be VERY Careful NOT do this.
You will create Exclusive Locks on EVERY Record in the entire table when this happens.
Even if the data is not actually changing, SQL-Server will still update the record nonetheless.
This Might Cause Deadlocks on that table if another process tries to use it during that time.
I speak from experience where every night our main database table would lock up for 15 minutes while a process (someone else wrote) was updating the entire table... Twice.
This caused all the other queries to wait for it to complete (some would timeout).
Not even a simple Select statement could be run against it while it was Updating.
The optimizer will not realize that the first is the same as the second.
You should use the second form. The first form will log the changes to the records that are not actually changed under some circumstances (but perhaps not in this particular case). Here is a good reference on this subject.

How many times trigger in MySQL is called?

I have a trigger on INSERT in MySQL 5.1. I want to know, how many times per second is it called. How can I do this?
Your best bet is to keep inserting into a table.
INSERT INTO trigger_log(query) VALUES(?)
This table has a datetime column that will automatically be updated, then you can do various queries to determine how many times/minute or hour, what period had the highest number of calls, etc.
Otherwise just update a table that has a column for day, hour, min, counter and just increment the counter for the current day/hour/min.
I don't like the second one as much as there is so much potential information being lost, but it would do what you want also.
There is no way to directly cound the number of triggers on the inserts. You could analyse the logfiles or you could alter your trigger (as the trigger acts on insert) to write an entry in a log table with auto_increment id and datetime. You can then analyze this table for any statistics.