Redshift: Support for concurrent inserts in the same table - sql

I have a lambda code that fires some insert queries to the same Table concurrently through redshift data api.
1. Insert into Table ( select <some analytical logic> from someTable_1)
2. Insert into Table ( select <some analytical logic> from someTable_2)
3. Insert into Table ( select <some analytical logic> from someTable_n)
Considering such queries will be fired concurrently, does Redshift apply a lock to the Table for each insert? Or does it allow parallel insert queries in the same table?
I'm asking because postgres allows concurrent inserts.
https://www.postgresql.org/files/developer/concurrency.pdf

Both Redshift and Postgres DBs us MVCC - https://en.wikipedia.org/wiki/Multiversion_concurrency_control - so they will likely work the same. There are no write-locks, just serial progression through the commit queue when the commits are seen. I've see no functional problems with this in Redshift so you should be good.
Functionally this is good but Redshift is columnar and Postgres is row-based. This leads to differences in the updating side. Since these INSERTs are likely only adding a small (for Redshift) number of rows and the minimum write size on Redshift is 1MB per column per slice, there is likely to be a lot of unused space in these blocks. If this is done often there will be a lot of wasted space in the table and large need to vacuum. If you can you will want to look at this write pattern to see if more batching of the insert data can be done.

Based on the discussion in comments, one can conclude that concurrent inserts into the same table in Redshift are blocking in nature as opposed to postgres.
Refer to the docs:-
https://docs.aws.amazon.com/redshift/latest/dg/r_Serializable_isolation_example.html
Edit:-
FYI if you are thinking what is the exact information to look for in the above mentioned docs, I am directly pasting it below:-
Concurrent COPY operations into the same table
Transaction 1 copies rows into the LISTING table:
begin;
copy listing from ...;
end;
Transaction 2 starts concurrently in a separate session and attempts to copy more rows into the LISTING table. Transaction 2 must wait until transaction 1 releases the write lock on the LISTING table, then it can proceed.
begin;
[waits]
copy listing from ;
end;
The same behavior would occur if one or both transactions contained an INSERT command instead of a COPY command.

Related

BigQuery, concurrent MERGE with Insert and Update -> insert duplicate

I'm contributing to a Kafka Connector loading data onto BigQuery.
It has a temporary table (my_tmp_tmp) and a destination table (detionation_tbl).
The way the data is loaded into detionation_tbl is through a MERGE
https://github.com/confluentinc/kafka-connect-bigquery/blob/d5f4eaeffa683ad8813a337cfeb66b5344e6dd91/kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/MergeQueries.java#L216
The MERGE statement uses:
dedup
both insert and updates
However:
on the first load,
all of the requests will contain only INSERT statement (nothing is in the table),
and if two merges are run at the same time, (many workers with retry),
the same records might be inserted twice (according to this A MERGE DML statement does not conflict with other concurrently running DML statements as long as the statement only inserts rows and does not delete or update any existing rows. This can include MERGE statements with UPDATE or DELETE clauses, as long as those clauses aren't invoked when the query runs., source). I also see this in practice leading to duplicates
Duplicates are not wanted since the whole point of running a MERGE is lost (compared to a solution that run INSERTs and dedup later)
Since it is a live dataset (being queried by users), Duplicates will break integrity of the dataset, hence we need to find a solution at Sink/BigQuery level.
Is it possible to make the merge statement always conflict with others so this doesn't happen? Any other solution?

BigQuery Atomicity

I am trying to do a full load of a table in big query daily, as part of ETL. The target table has dummy partition column of type integer and is clustered. I want to have the statement to be atomic i.e either it will completely overwrite the new data or rollback to old data in case of failure for any reason in between and it will serve user queries with old data until it completely overwritten.
One way of doing this is delete and insert but big query does not support multi statement transactions.
I am thinking to use the below statement. Please let me know if this is atomic.
create or replace table_1 partition by dummy_int cluster dummy_column
as select col1,col2,col3 from stage_table1

Oracle 12c - can we lock partition while inserting data into a table rather than whole table?

There is a need to run 2 jobs in parallel from Informatica to insert data in parallel to the same table in Oracle.
One process will insert to one partition and other process will insert into another partition.
How does oracle do by default? Does it lock the whole Oracle table so that no 2 parallel processes can insert data from Informatica or can it lock only partition of one table and another process can run in parallel and insert into another partition?
Thanks.
With plain old inserts, two statements won't ever block each other.
Oracle uses row-level locking and does not randomly escalate them into table locks. Since INSERTs by definition create new rows they can't possibly block each other.
There are a few weird exceptions to that rule. For example, if the table had a trigger that changed other rows, or if two INSERTs try to use the same primary key value. But both cases are extremely unlikely and would probably be an unintended bug that you wouldn't want to run anyway. Bitmap indexes aren't designed for concurrent DML and may also cause regular inserts to block.
With direct-path writes or parallel inserts, the answer is more complicated.
Things get more complicated with parallel processing and direct-path writes.
If the INSERT operation uses parallelism or a direct-path write, adding even a single row will lock the entire table and prevent other sessions from changing anything. For example, if it generates statements like insert /*+ parallel(8) */ ... they will prevent any other DML from running. This is because direct-path writes directly modify the data files for optimal performance and the typical concurrency controls don't work.
But it's possible that Informatica won't use Oracle's parallel processing, and will instead create multiple concurrent threads for homegrown parallel processing. If that's the case then the two processes won't block each other.
But it's also possible that Informatica does not use Oracle parallel processing but does use direct-path writes. For example, if it generates statements like insert /*+ append */ ... they will prevent any other DML from running, even if that DML is modifying a different partition. This is probably because Oracle can't easily predict which partition will be modified ahead of time and it's simpler to just lock them all.
But if the partitions are explicitly specified, then two parallel or direct-path writes can run concurrently, as long as they modify different partitions.
Below is a quick demonstration. First, create a simple test schema with a partitioned table.
create table test1(a number)
partition by list(a)
(
partition p1 values (1),
partition p2 values (2)
) nologging;
Demonstrate that a direct-path write in one session will block any type of insert in another session:
--Session 1: Run this statement but don't commit. It should finish in less than a second.
insert /*+ append */ into test1 select 1 from dual;
--Session 2: Run this statement but it will never finish, even though it
-- inserts into a different partition.
insert into test1 select 2 from dual;
Demonstrate that a direct-path write in one session will not block another session as long as the partitions are explicitly named:
--Session 1: Run this statement but don't commit. It should run in less than a second.
insert /*+ append */ into test1 partition (p1) select 1 from dual;
--Session 2: This statement will run immediately.
insert into test1 partition (p2) select 2 from dual;
Now what?
Any of the above scenarios are possible depending on how Informatica is configured. Start by checking the Informatica settings.
Look at the generated SQL to see what Informatica is running. Use a query like select * from gv$sql where sql_fulltext like '....
Look at the explain plans for those SQL statements to see if the queries are running the way you want them too. Use a query like select * from table(dbms_xplan.display_cursor(sql_id => '... to find the plan. Look at the operations column; LOAD AS SELECT means direct-path is being used and LOAD TABLE CONVENTIONAL means direct-path is not used. You can also check for PARALLEL or PX to see if Oracle parallel processing is used.
Unfortunately there are many reasons why direct-path or parallelism may be requested but not used. See my frustratingly long list of possible reason for a lack of parallelism here, and see this list for reasons for a lack of direct-path writes.
#OldProgrammer provided an excellent reference. There is no need to worry about table locks when inserting into Oracle. The only locks Oracle does by default is to the row being inserted, updated, or deleted, and a lock that prevents DDL operations while the DML operation is underway.

What will happen if a hive(0.13) SELECT and INSERT OVERWRITE are running at the same time

I would like to know what will happen if a hive SELECT and INSERT OVERWRITE is running at the same time. Please help me to understand what will hive query return in the below scenarios.
Run the query first, while the query is running, INSERT OVERWRITE the same table.
Run the INSERT OVERWRITE first, while overwriting, pull the data from the same table with SELECT.
Are we going to get the old data, new data, mixed data, nothing, or unpredictable data?
I am using MapR 4.0.1, Hive 0.13.
Best regards,
Ryan
Read Hive Locking:
For a non-partitioned table, the lock modes are pretty intuitive. When the table is being read, a S lock is acquired, whereas an X lock is acquired for all other operations (insert into the table, alter table of any kind etc.)
So SELECT and INSERT acquire incompatible locks so they can never run in parallel. One will acquire the lock first and the other will wait.
For partitioned tables things are a bit more complex as the locks acquire are hierarchical (S on table, S/X on partition). Read the link.

DROP TABLE or DELETE TABLE? Which is best practice?

Working on redesigning some databases in my SQL SERVER 2012 instance.
I have databases where I put my raw data (from vendors) and then I have client databases where I will (based on client name) create a view that only shows data for a specific client.
Because of the this data being volatile (Google Adwords & Google DFA) I typically just delete the last 6 days and insert 7 days everyday from the vendor databases. Doing this gives me comfort in knowing that Google has had time to solidify its data.
The question I am trying to answer is:
1. Instead of using views, would it be better use a 'SELECT INTO' statement and DROP the table everyday in the client database?
I'm afraid that by automating my process using the 'DROP TABLE' method will not scale well longterm. While testing it myself, it seems that performance is improved because it does not have to scan the entire table for the date range. I've also tested this with an index on the 'date' column and performance still seemed better with the 'DROP TABLE' method.
I am looking for best practices here.
NOTE: This is my first post. So I am not too familiar with how to format correctly. :)
Deleting rows from a table is a time-consuming process. All the deleted records get logged, and performance of the server suffers.
Instead, databases offer truncate table. This removes all the rows of the table without logging the rows, but keeps the structure intact. Also, triggers, indexes, constraints, stored procedures, and so on are not affected by the removal of rows.
In some databases, if you delete all rows from a table, then the operation is really truncate table. However, SQL Server is not one of those databases. In fact the documentation lists truncate as a best practice for deleting all rows:
To delete all the rows in a table, use TRUNCATE TABLE. TRUNCATE TABLE
is faster than DELETE and uses fewer system and transaction log
resources. TRUNCATE TABLE has restrictions, for example, the table
cannot participate in replication. For more information, see TRUNCATE
TABLE (Transact-SQL)
You can drop the table. But then you lose auxiliary metadata as well -- all the things listed above.
I would recommend that you truncate the table and reload the data using insert into or bulk insert.