DBT for BigQuery - Why is there no delete+insert by a primary key? - google-bigquery

This is more of a theoretical question:
I have a scenario where I wish to do a delete + insert from a source table to a target table in DBT. (Match by PK, delete existing records then insert).
DBT doesn't seem to support this incremental strategy for BigQuery (It does for Snowflake).
It instead offers an insert+overwrite by deleting and re-inserting a given partition. Which doesn't solve my specific need.
Is there a reasoning behind this?

I think you want merge, which should accomplish the same as a delete and insert, but with better performance. BQ docs on merge statements.
delete+insert is only supported for Snowflake to support an edge case where the PK of the table is not actually unique. source. On Snowflake merge is also preferred/more performant.

Related

Can I prevent duplicate data in bigquery?

I'm playing with BQ and I create a table and inserted some data. I reinserted it and it created duplicates. I'm sure I'm missing something, but is there something I can do to ignore it if the data exists in the table?
My use case is I get a stream of data from various clients and sometimes their data will include some data they previously already sent(I have no control on them submitting).
Is there a way to prevent duplicates when certain conditions are met? The easy one is if the entire data is the same but also if certain columns are present?
It's difficult to answer your question without a clear idea of the table structure, but it feels like you could be interested in the MERGE statement: ref here.
With this DML statement you can perform a mix of INSERT, UPDATE, and DELETE statements, hence do exactly what you are describing.

how can I update a table in one schema to match a table in a second schema

How can I update a table in one schema to match a table in a second schema assuming the only difference is additional fields and indexes in the second. I do not want to change any of the data in the table. Hoping to do it without laboriously identifying the missing fields.
A elegant solution to this can be a DDL trigger that is triggered on a ALTER, CREATE ddl_event that applies the same changes to the first table (in one schema) as in the second table(that is another schema) in the same transaction.
Link --> https://docs.oracle.com/cd/E11882_01/appdev.112/e25519/triggers.htm#LNPLS2008
A little known but interesting recent addition to the Oracle DBMS artillery is DBMS_COMPARISON.
https://docs.oracle.com/cd/B28359_01/appdev.111/b28419/d_comparison.htm
Haven't tried it myself, but according the document should be able to get you the information at least without having to do any heavy scripting.
I've been doing this sort of thing since Oracle7 and always had to resort to complex scripting.

How Can I Maintain a Unique Identifier Amongst Multiple Database Tables?

I have been tasked with creating history tables for an Oracle 11g database. I have proposed something very much like the record based solution in the first answer of this post What is the best way to keep changes history to database fields?
Then my boss suggested that due to the fact that some tables are clustered i.e Some data from table 1 is related to table 2 (think of this as the format the tables were in before they were normalised), he would like there to be a version number which is maintained between all the tables at this cluster level. The suggested way to generate the version number is by using a SYS_GUID http://docs.oracle.com/cd/B12037_01/server.101/b10759/functions153.htm.
I thought about doing this with triggers so when one of this tables is updated, the other tables version numbers are subsequently updated, but I can see some issues with this such as the following:
How can I stop the trigger from one table, in turn firing the trigger for the other table?(We would end up calling triggers forever here)
How can I stop the race conditions? (i.e When table 1 and 2 are updated at the same time, how do I know which is the latest version number?)
I am pretty new to Oracle database development so some suggestions about whether or not this is a good idea/if there is a better way of doing this would be great.
I think the thing you're looking for is sequence: http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_6015.htm#SQLRF01314
The tables could take the numbers from defined sequence independently, so no race conditions or triggers on your side should occur
Short answer to your first question is "No, you cannot.". The reason for this is that there's no way that users can stop a stated trigger. The only method I can imagine is some store of locking table, for example you create a intermediate table, and select the same row for update among your clustered tables. But this is really a bad way, as you've already mentioned in your second question. It will cause dreadful concurrency issue.
For your second question, you are very right. Different triggers for different original tables to update the same audit table will cause serious contention. It's wise to bear in mind the way triggers work that is they are committed when the rest of transaction commit. So if all related tables will update the same audit table, especially for the same row, simultaneously will render the rational paradigm unused. One benefit of the normalization is performance gain, as when you update different table will not content each other. But in this case if you want synchronize different table's operations in audit table. It will finally work like a flat file. So my suggestion would be trying your best to persuade your boss to use your original proposal.
But if your application always updates these clustered table in a transaction and write one audit information to audit table. You may write a stored procedure to update the entities first and write an audit at end of the transaction. Then you can use sequence to generate the id of audit table. It won't be any contention.

Best approach for Oracle database mass update

I need to do a mass update of a table (~30,000 records) in an Oracle 10g database. The challenge is there is no "where" clause that can select the target rows. Each target row can, however, be identified via a composite key. The catch here is that the list of composite keys is from an external source (not in the database).
Currently I have a Java program that loops through the list of composite keys and spits out a PL/SQL procedure which is essentially just a bunch of repeated update statements similar to the following:
update table1 t set myfield='Updated' where t.comp_key1='12345' and t.comp_key2='98765';
Is there a better way to do this, or is this "good enough" considering we're only dealing with ~30K records?
Possibly good enough, but if the external source of the keys is a file, create an external table pointing at the file, to expose the data in the file as a relational table and then you can potentially do it in a merge (update) statement.
Good enough.
30,000 updates using the primary key, even if they are all hard-parsed, will normally only take a few seconds. You could probably speed things up by combining the updates, as #Ed Gibbs suggested. But so far this looks like a very quick process that isn't worth optimizing. Putting it all in a single PL/SQL procedure was a smart move, and saved 99% of the time that would be needed for a really naive, row-by-row-from-the-client solution.

Skipping primary key conflicts with SQL copy

I have a large collection of raw data (around 300million rows) with about 10% replicated data. I need to get the data into a database. For the sake of performance I'm trying to use SQL copy. The problem being when I commit the data, primary key exceptions prevent any of the data from being processed. Can I change the behavior of primary keys such that conflicting data is simply ignored, or replaced? I don't really care either way - I just need one unique copy of each of the data.
I think your best bet would be to drop the constraint, load the data, then clean it up and reapply the constraint.
That's what I was considering doing, but was worried about performance of getting rid of 30million randomly placed rows in a 300million entry database. The duplicate data also has a spatial relationship which is why I wanted to try to fix the problem while loading the data rather than after I have it all loaded.
Use a select statement to select exactly the data you want to insert, without the duplicates.
Use that as a basis of a CREATE TABLE XYZ AS SELECT * FROM (query-just-non-dupes)
You might check out ASKTOM ideas on how to select the non-duplicate rows