BigQuery, concurrent MERGE with Insert and Update -> insert duplicate - google-bigquery

I'm contributing to a Kafka Connector loading data onto BigQuery.
It has a temporary table (my_tmp_tmp) and a destination table (detionation_tbl).
The way the data is loaded into detionation_tbl is through a MERGE
https://github.com/confluentinc/kafka-connect-bigquery/blob/d5f4eaeffa683ad8813a337cfeb66b5344e6dd91/kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/MergeQueries.java#L216
The MERGE statement uses:
dedup
both insert and updates
However:
on the first load,
all of the requests will contain only INSERT statement (nothing is in the table),
and if two merges are run at the same time, (many workers with retry),
the same records might be inserted twice (according to this A MERGE DML statement does not conflict with other concurrently running DML statements as long as the statement only inserts rows and does not delete or update any existing rows. This can include MERGE statements with UPDATE or DELETE clauses, as long as those clauses aren't invoked when the query runs., source). I also see this in practice leading to duplicates
Duplicates are not wanted since the whole point of running a MERGE is lost (compared to a solution that run INSERTs and dedup later)
Since it is a live dataset (being queried by users), Duplicates will break integrity of the dataset, hence we need to find a solution at Sink/BigQuery level.
Is it possible to make the merge statement always conflict with others so this doesn't happen? Any other solution?

Related

Redshift: Support for concurrent inserts in the same table

I have a lambda code that fires some insert queries to the same Table concurrently through redshift data api.
1. Insert into Table ( select <some analytical logic> from someTable_1)
2. Insert into Table ( select <some analytical logic> from someTable_2)
3. Insert into Table ( select <some analytical logic> from someTable_n)
Considering such queries will be fired concurrently, does Redshift apply a lock to the Table for each insert? Or does it allow parallel insert queries in the same table?
I'm asking because postgres allows concurrent inserts.
https://www.postgresql.org/files/developer/concurrency.pdf
Both Redshift and Postgres DBs us MVCC - https://en.wikipedia.org/wiki/Multiversion_concurrency_control - so they will likely work the same. There are no write-locks, just serial progression through the commit queue when the commits are seen. I've see no functional problems with this in Redshift so you should be good.
Functionally this is good but Redshift is columnar and Postgres is row-based. This leads to differences in the updating side. Since these INSERTs are likely only adding a small (for Redshift) number of rows and the minimum write size on Redshift is 1MB per column per slice, there is likely to be a lot of unused space in these blocks. If this is done often there will be a lot of wasted space in the table and large need to vacuum. If you can you will want to look at this write pattern to see if more batching of the insert data can be done.
Based on the discussion in comments, one can conclude that concurrent inserts into the same table in Redshift are blocking in nature as opposed to postgres.
Refer to the docs:-
https://docs.aws.amazon.com/redshift/latest/dg/r_Serializable_isolation_example.html
Edit:-
FYI if you are thinking what is the exact information to look for in the above mentioned docs, I am directly pasting it below:-
Concurrent COPY operations into the same table
Transaction 1 copies rows into the LISTING table:
begin;
copy listing from ...;
end;
Transaction 2 starts concurrently in a separate session and attempts to copy more rows into the LISTING table. Transaction 2 must wait until transaction 1 releases the write lock on the LISTING table, then it can proceed.
begin;
[waits]
copy listing from ;
end;
The same behavior would occur if one or both transactions contained an INSERT command instead of a COPY command.

How to make DELETE faster in fast changing (DELSERT) table in MonetDB?

I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.
In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.
The process of upload to MDB is as follows:
Create staging table in MDB
Use COPY command to upload 10k records into the staging table.
DELETE from destination table all IDs that are in staging table.
INSERT to the destination table all rows from staging table.
So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.
Now to the problem: the DELETE is slow.
When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.
However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.
I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.
I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.
Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.
Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).
Thanks!

BigQuery Atomicity

I am trying to do a full load of a table in big query daily, as part of ETL. The target table has dummy partition column of type integer and is clustered. I want to have the statement to be atomic i.e either it will completely overwrite the new data or rollback to old data in case of failure for any reason in between and it will serve user queries with old data until it completely overwritten.
One way of doing this is delete and insert but big query does not support multi statement transactions.
I am thinking to use the below statement. Please let me know if this is atomic.
create or replace table_1 partition by dummy_int cluster dummy_column
as select col1,col2,col3 from stage_table1

Data flow insert lock

I have an issue with my data flow task locking, this task compares a couple of tables, from the same server and the result is inserted into one of the tables being compared. The table being inserted into is being compared by a NOT EXISTS clause.
When performing fast load the task freezes with out errors when doing a regular insert the task gives a dead lock error.
I have 2 other tasks that perform the same action to the same table and they work fine but the amount of information being inserted is alot smaller. I am not running these tasks in parallel.
I am considering using no locks hint to get around this because this is the only task that writes to a cerain table partition, however I am only coming to this conclusion because I can not figure out anything else, aside from using a temp table, or a hashed anti join.
Probably you have so called deadlock situation. You have in your DataFlow Task (DFT) two separate connection instances to the same table. The first conn instance runs SELECT and places Shared lock on the table, the second runs INSERT and places a page or table lock.
A few words on possible cause. SSIS DFT reads table rows and processes it in batches. When number of rows is small, read is completed within a single batch, and Shared lock is eliminated when Insert takes place. When number of rows is substantial, SSIS splits rows into several batches, and processes it consequentially. This allows to perform steps following DFT Data Source before the Data Source completes reading.
The design - reading and writing the same table in the same Data Flow is not good because of possible locking issue. Ways to work it out:
Move all DFT logic inside single INSERT statement and get rid of DFT. Might not be possible.
Split DFT, move data into intermediate table, and then - move to the target table with following DFT or SQL Command. Additional table needed.
Set a Read Committed Snapshot Isolation (RCSI) on the DB and use Read Committed on SELECT. Applicable to MS SQL DB only.
The most universal way is the second with an additional table. The third is for MS SQL only.

Does a MERGE INTO statement lock the entire table when merging into a partitioned table based on the partitioned columns?

After some performance issues occurred in our production environment, I asked for assistance from our database administrators. While helping, they informed me that merging locks the table and suggested I use an UPDATE statement instead.
From everything I've read, I was under the impression that MERGE INTO and UPDATE have similar incremental locking patterns. Below is an example of the sort of MERGE INTO statement that our application is using.
MERGE INTO sample_merge_table smt
USING (
SELECT smt.*, sjt.*
FROM sample_merge_table smt
JOIN some_join_table sjt
ON smt.index_value = sjt.secondary_index_value
WHERE smt.partition_index = partitionIndex
) umt ON (smt.partition_index = partitionIndex AND smt.index_value = umt.index_value)
WHEN MATCHED THEN
UPDATE SET...
WHEN NOT MATCHED THEN
INSERT VALUES...
Upon running this statement, what would the locking procedure actually be? Would each table involved in the USING select be locked? Would the sample_merge_table be completely locked, or only the partition that is being accessed? Would the UPDATE statement lock incrementally, or does the MERGE INTO itself already own the required locks?
The merge statement works on a row basis, but locks everything before hand i.e. when the statement execution is finished planing and the affected rows identified.
Readings:
Deleted forum article