What happens when a hive insert is failed halfway? - hive

Suppose an insert is expected to load 100 records in hive and 40 records have been inserted and the insert failed for some reason. will the transaction roll back completely, undoing 40 records which were inserted?
or Will we see 40 records in the hive table even after the insert query failed?

The operation is atomic (even for non-ACID table): If you inserting or rewriting data using HiveQL, it writes data into temporary location and only if the command succeeds files are moved to the table location (old files are deleted in case of INSERT OVERWRITE). If SQL statement fails the data remains as it was before statement execution.
Note about S3 direct writes: Direct writes to S3 feature should be disabled to allow Hive to write to temporary location and rewrite target folder only if operation succeeded:
-- Disable AWS S3 direct writes:
set hive.allow.move.on.s3=true;
Read also this documentation for more details on which ACID features supported in concurrency mode and limitations: What is ACID and why should you use it?
Up until Hive 0.13, atomicity, consistency, and durability were provided at the partition level. Isolation could be provided by turning on one of the available locking mechanisms (ZooKeeper or in memory). With the addition of transactions in Hive 0.13 it is now possible to provide full ACID semantics at the row level, so that one application can add rows while another reads from the same partition without interfering with each other.
Also read this about Hive locks with ACID enabled (transactional and non-transactional tables)
Update: Since DEC 2020 Amazon S3 is strongly consistent at no extra charge. So, removed part about S3 eventual consistency.

Related

Is hive `INSERT OVERWRITE` atomic? [duplicate]

Suppose an insert is expected to load 100 records in hive and 40 records have been inserted and the insert failed for some reason. will the transaction roll back completely, undoing 40 records which were inserted?
or Will we see 40 records in the hive table even after the insert query failed?
The operation is atomic (even for non-ACID table): If you inserting or rewriting data using HiveQL, it writes data into temporary location and only if the command succeeds files are moved to the table location (old files are deleted in case of INSERT OVERWRITE). If SQL statement fails the data remains as it was before statement execution.
Note about S3 direct writes: Direct writes to S3 feature should be disabled to allow Hive to write to temporary location and rewrite target folder only if operation succeeded:
-- Disable AWS S3 direct writes:
set hive.allow.move.on.s3=true;
Read also this documentation for more details on which ACID features supported in concurrency mode and limitations: What is ACID and why should you use it?
Up until Hive 0.13, atomicity, consistency, and durability were provided at the partition level. Isolation could be provided by turning on one of the available locking mechanisms (ZooKeeper or in memory). With the addition of transactions in Hive 0.13 it is now possible to provide full ACID semantics at the row level, so that one application can add rows while another reads from the same partition without interfering with each other.
Also read this about Hive locks with ACID enabled (transactional and non-transactional tables)
Update: Since DEC 2020 Amazon S3 is strongly consistent at no extra charge. So, removed part about S3 eventual consistency.

What are the performance consequences of different isolation levels in PostgreSQL?

I am writing an archival script (in Python using psycopg2) that needs to pull a very large amount of data out of a PostgreSQL database (9.4), process, upload and then delete it from the database.
I start a transaction, execute a select statement to create a named cursor, fetch N rows at a time from the cursor and do processing and uploading of parts (using S3 multipart upload). Once the cursor is depleted and no errors occurred, I finalize the upload and execute a delete statement using the same conditions as I did in select. If delete succeeds, I commit the transaction.
The database is being actively written to and it is important that both the same rows get archived and deleted and that reads and writes to the database (including the table being archived) continue uninterrupted. That said, the tables being archived contain logs, so existing records are never modified, only new records are added.
So the questions I have are:
What level of isolation should I use to ensure same rows get archived and deleted?
What impact will these operations have on database read/write ability? Does anything get write or read locked in the process I described above?
You have two good options:
Get the data with
SELECT ... FOR UPDATE
so that the rows get locked. Then the are guaranteed to be there when you delete them.
Use
DELETE FROM ... RETURNING *
Then insert the returned rows into your archive.
The second solution is better, because you need only one statement.
Nothing bad can happen. If the transaction fails for whatever reason, no row will be deleted.
You can use the default READ COMMITTED isolation level for both solutions.

How can I update the rows in external Hive table when ACID properties are off?

The transaction manager is non ACID so I cannot obviously use ACID transactional operation here. I tried using "insert Overwrite" and it is only working on managed table not on external table.
Is there a possible way to do it from Pyspark?
PS: hive table gets loaded by a job in production.There are few rows which we need to update manually. And the table is stored in AWS S3

Database Replication after Data Load

I'm trying to understand the ramifications of database replication (SQL Server or Golden Gate) for situations where the source database is completely repopulated every night. To clarify, all existing tables are dropped and then the database is reloaded with new tables using same name along with all the data.
Based on my understanding i.e. that replication uses a transaction log... I would assume it will also repeat the process of dropping the tables instead of identifying the differences and just adding the new data. Is that correct?
You can set up the replication using OracleGoldenGate so that it will be doing what you want it to do.
the TRUNCATE TABLE command can be replicated or it can be ignored
the populating of the source table (INSERT/bulk operations) can be replicated or it can be ignored
if a row already exists (meaning a row with the same PK exists) on the target and you INSERT it on the source you can either UPDATE the target or DELETE the old one and INSERT the new, or ignore it
Database replication is based on the redo (transaction) log. Only particular events that appear on the source databases, which are logged can be replicated. But the whole replication engine can make some additional transformations as it is replicating the changes.

BigQuery UPDATE or DELETE DML

Tables that have been written to recently via BigQuery Streaming
(tabledata.insertall) cannot be modified using UPDATE or DELETE
statements. To check if the table has a streaming buffer, check the
tables.get response for a section named streamingBuffer. If it is
absent, the table can be modified using UPDATE or DELETE statements.
When I try to modify my table (rows were recently inserted data, table created few days ago)
delete table_dataset.table1 where true
I have following error - Error: UPDATE or DELETE DML statements are not supported over table with streaming buffer However once I deleted all these records somehow maybe after some delay.
What is the streaming buffer ? When exactly I can modify my table ? If I use JOB which create table or export data from another source can I run UPDATE/DELETE DDL?
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations. You probably have to wait up to 90 minutes so all buffer is persisted on the cluster. You can use queries to see if the streaming buffer is empty or not like you mentioned.
If you use load job to create the table, you won't have streaming buffer.