BigQuery Atomicity - google-bigquery

BigQuery Atomicity - google-bigquery

I am trying to do a full load of a table in big query daily, as part of ETL. The target table has dummy partition column of type integer and is clustered. I want to have the statement to be atomic i.e either it will completely overwrite the new data or rollback to old data in case of failure for any reason in between and it will serve user queries with old data until it completely overwritten.
One way of doing this is delete and insert but big query does not support multi statement transactions.
I am thinking to use the below statement. Please let me know if this is atomic.
create or replace table_1 partition by dummy_int cluster dummy_column
as select col1,col2,col3 from stage_table1

Related

Redshift: Support for concurrent inserts in the same table

I have a lambda code that fires some insert queries to the same Table concurrently through redshift data api.
1. Insert into Table ( select <some analytical logic> from someTable_1)
2. Insert into Table ( select <some analytical logic> from someTable_2)
3. Insert into Table ( select <some analytical logic> from someTable_n)
Considering such queries will be fired concurrently, does Redshift apply a lock to the Table for each insert? Or does it allow parallel insert queries in the same table?
I'm asking because postgres allows concurrent inserts.
https://www.postgresql.org/files/developer/concurrency.pdf

Both Redshift and Postgres DBs us MVCC - https://en.wikipedia.org/wiki/Multiversion_concurrency_control - so they will likely work the same. There are no write-locks, just serial progression through the commit queue when the commits are seen. I've see no functional problems with this in Redshift so you should be good.
Functionally this is good but Redshift is columnar and Postgres is row-based. This leads to differences in the updating side. Since these INSERTs are likely only adding a small (for Redshift) number of rows and the minimum write size on Redshift is 1MB per column per slice, there is likely to be a lot of unused space in these blocks. If this is done often there will be a lot of wasted space in the table and large need to vacuum. If you can you will want to look at this write pattern to see if more batching of the insert data can be done.

Based on the discussion in comments, one can conclude that concurrent inserts into the same table in Redshift are blocking in nature as opposed to postgres.
Refer to the docs:-
https://docs.aws.amazon.com/redshift/latest/dg/r_Serializable_isolation_example.html
Edit:-
FYI if you are thinking what is the exact information to look for in the above mentioned docs, I am directly pasting it below:-
Concurrent COPY operations into the same table
Transaction 1 copies rows into the LISTING table:
begin;
copy listing from ...;
end;
Transaction 2 starts concurrently in a separate session and attempts to copy more rows into the LISTING table. Transaction 2 must wait until transaction 1 releases the write lock on the LISTING table, then it can proceed.
begin;
[waits]
copy listing from ;
end;
The same behavior would occur if one or both transactions contained an INSERT command instead of a COPY command.

Why CREATE TABLE AS SELECT is more faster than INSERT with SELECT

I make a query with INNER JOIN and the result was 12 millions lines.
I like to put this in a table.
I did some tests and when I created the table using clause AS SELECT was more faster than, create the table first and run a INSERT with SELECT after.
I don't understand why.
Somebody can explain for me?
Tks

If you use 'create table as select' (CTAS)
CREATE TABLE new_table AS
SELECT *
FROM old_table
you automatically do a direct-path insert of the data. If you do an
INSERT INTO new_table AS
SELECT *
FROM old_table
you do a conventional insert. You have to use the APPEND-hint, if you want to do a direct path insert instead. So you have to do
INSERT /*+ APPEND */ INTO new_table AS
SELECT *
FROM old_table
to get a similar performance as in 'CREATE TABLE AS SELECT'.
How does the usual conventional insert work?
Oracle checks the free list of the table for an already used block of the table segment that has still free space. If the block isn't in the buffer cache it is read into the buffer cache. Eventually this block is read back to the disk.
During this process undo for the block is written (only a small amount of data is necessary here), data structures are updated, e.g. if necessary, the free list,that esides in the segment header and all these changes are written to the redo-buffer, too.
How does a direct-path insert work?
The process allocates space above the high water mark of the table, that is, beyond the already used space. It writes the data directly to the disk, without using a buffer cache. And it is also written to the redo buffer. When the session is committed, the highwater mark is raised beyond the new written data and this data is now visible to other sessions.
How can I improve CTAS and direct-path inserts?
You can create he tale in NOLOGGING mode, than no redo information is written. If you do this, you should make a backup of the tablespace that contains the table after the insert, otherwisse you can not recover the table if you need this.
You can do the select in parallel
You can do the insert in parallel
If you have to maintain indexes and constraints or even triggers during an insert operation this can slow down your insert operation drastically. So you should avoid this and create indexes after the insert and maybe create constraints with novalidata.

With SELECT STATEMENT The table you create has no primary key, index, identity ... the columns are always allow NULL.
And It does not have to be written to the transaction log (and therefore does not rollback). It's seem like a "Naked Table".
With INSERT ... SELECT then table must be created before so when you create table you can define key, index, identity ... And it will use transaction logs
When applied to large amounts of data, it is very slow.

SQL: Get real time data from different database

I want to insert data from database1 into database2 when something changes. Like a trigger but the thing is I am not allowed to create a trigger in database1.
I basically need to insert certain data into a table from a database into another database as they happen.
Any suggestions would be great!
Thanx

You can create a program to do that, there are many ways instead but I think this is the straightforward and easy one.
For example if you work with SQL server you can create C# code that connect to the two databases and check for data in the first one then send it to the second.
For example:
you can create a trigger for the first DB tables you want to check them, then you can create a web service that check if the trigger fire, it will get the data from first database and send it to the second, this will be better to enhance and you can change the code to do whatever you want without making any changes from the database side

MS SQL Server Replication
I think suitable for you is Transactional Replication
Transactional replication is asynchronous but close to real time

What about the merge operator ?
Merge operator will commit changes into the target table from the source table when something change.
As you cannot use trigger this cannot be a real time process but with a process scheduler like SQL Agent, you can run it each 10 seconds depending on your table size.
https://msdn.microsoft.com/en-us/library/bb510625.aspx
You can do it when your source come from multiple table by using CTE like that :
With TableSources As
(
Select id,Column1, Column2 from table1
UNION
Select id,Colum1,Column2 from table2
)
MERGE INTO TargetTable
USING TableSources ON TargetTable.id=TableSources.id
WHEN NOT MATCHED BY Target
Insert(Column1,Column2) Values(Column1,Column2)
WHEN NOT MATCHED BY Source
Delete
WHEN MATCHED
Update Set Column1=TableSources.Column1, Column2=TableSources.Column2
;

Populating a table from a view in Oracle with "locked" truncate/populate

I would like to populate a table from a (potentially large) view on a scheduled basis.
My process would be:
Disable indexes on table
Truncate table
Copy data from view to table
Enable indexes on table
In SQL Server, I can wrap the process in a transaction such that when I truncate the table a schema modification lock will be held until I commit. This effectively means that no other process can insert/update/whatever until the entire process is complete.
However I am aware that in Oracle the truncate table statement is considered DDL and will thus issue an implicit commit.
So my question is how can I mimic the behaviour of SQL Server here? I don't want any other process trying to insert/update/whatever whilst I am truncating and (re)populating the table. I would also prefer my other process to be unaware of any locks.
Thanks in advance.

Make your table a partitioned table with a single partition and local indexes only. Then whenever you need to refresh:
Copy data from view into a new temporary table
CREATE TABLE tmp AS SELECT ... FROM some_view;
Exchange the partition with the temporary table:
ALTER TABLE some_table
EXCHANGE PARTITION part WITH TABLE tmp
WITHOUT VALIDATION;
The table is only locked for the duration of the partition exchange, which, without validation and global index update, should be instant.

How to figure out which record has been deleted in an effiecient way?

I am working on an in-house ETL solution, from db1 (Oracle) to db2 (Sybase). We needs to transfer data incrementally (Change Data Capture?) into db2.
I have only read access to tables, so I can't create any table or trigger in Oracle db1.
The challenge I am facing is, how to detect record deletion in Oracle?
The solution which I can think of, is by using additional standalone/embedded db (e.g. derby, h2 etc). This db contains 2 tables, namely old_data, new_data.
old_data contains primary key field from tahle of interest in Oracle.
Every time ETL process runs, new_data table will be populated with primary key field from Oracle table. After that, I will run the following sql command to get the deleted rows:
SELECT old_data.id FROM old_data WHERE old_data.id NOT IN (SELECT new_data.id FROM new_data)
I think this will be a very expensive operation when the volume of data become very large. Do you have any better idea of doing this?
Thanks.

Which edition of Oracle ? If you have Enterprise Edition, look into Oracle Streams.
You can grab the deletes out of the REDO log rather than the database itself

One approach you could take is using the Oracle flashback capability (if you're using version 9i or later):
http://forums.oracle.com/forums/thread.jspa?messageID=2608773
This will allow you to select from a prior database state.
If there may not always be deleted records, you could be more efficient by:
Storing a row count with each query iteration.
Comparing that row count to the previous row count.
If they are different, you know you have a delete and you have to compare the current set with the historical data set from flashback. If not, then don't bother and you've saved a lot of cycles.
A quick note on your solution if flashback isn't an option: I don't think your select query is a big deal - it's all those inserts to populate those side tables that will really take a lot of time. Why not just run that query against the sybase production server before doing your update?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery Atomicity - google-bigquery

Related

Redshift: Support for concurrent inserts in the same table

Why CREATE TABLE AS SELECT is more faster than INSERT with SELECT

SQL: Get real time data from different database

Populating a table from a view in Oracle with "locked" truncate/populate

How to figure out which record has been deleted in an effiecient way?

Categories

Resources