Apache Hudi Upsert/Insert/Deletes at the same time - apache-hudi

Can we run write operation type Upsert and Delete at the same time and same table?
Is Apache Hudi meta get corrupted??
Please help here to do the same using other if any solutions.
Thanks in Advance !!

With Hudi, you can upsert and delete records in the same query, without corrupting the Hudi metadata, to achieve this you have two options:
develop your own hoodie.datasource.write.payload.class and implement the logic in the class, where you can delete the records based on some condition (for ex when you provide a null value or based on a column value)
add the column _hoodie_is_deleted to your dataset source, and provide true for the records you want to delete, and keep it null for the records you want to upsert (mode Append and operation upsert)
Update:
If you want to run them in two separate queries, they are considered as 2 concurrent writes, you can activate OCC (optimistic concurrency control) which allow concurrent writes when there is no overlap (DELETE from partition X, and INSERT in partition Y), but when both queries are writing to the same partitions, they will both fail.

Related

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

Delete rows not in BigQuery streaming buffer

I need to delete rows that match a given WHERE condition from a partitioned BigQuery table. The table always has a streaming buffer adding more data to it. I am not concerned with deleting from what is being streamed in, just historical data in yesterday's partition.
What is the correct strategy to remove data while a streaming buffer exists on a BigQuery table, ideally without downtime?
From this page Data Manipulation Language
"Rows that were written to a table recently via streaming (using the tabledata.insertall method) cannot be modified using UPDATE, DELETE, or MERGE statements. Recent writes are typically those that occur within the last 30 minutes. Note that all other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements."
This means that you should restrict your DML with a time, ideally you should have date created column so you can use that OR if there is a built in metadata column that you can use but I am not aware of such column
You can use a Data Manipulation Language DELETE statement. However keep in mind the following (from DML docs):
Support for using Data Manipulation Language statements to modify
partitioned table data is currently in Beta.
But you can always do select with filtering out records you want to delete and write the results back to the same partition.
There will be no downtime. Cost will be the same as the cost of a full single partition scan.

Performance bottleneck at target in informatica mapping

I have a informatica mapping wherein soft delete condition is as follows:
Pk_Src is null and Pk_Tgt is not null then set the active_flag to N.
Now based on this condition the mapping evaluated that there are 400k records which needs to be updated. Its a simple update but it is taking more than 3 hours using update strategy
Appreciate your valuable inputs.
Dex.
How many records are on the table and how many indexes does active_flag appear in? If active_flag is in many indexes then you should consider dropping those indexes before the session starts and redefining them after the session ends. Have you looked through the session log to see which steps are taking time? There may be more beside the update query. Another strategy to try is increasing your commit interval to 500000 (so long as your db undo can withstand that)
In general you should expect very slow performance when doing updates/deletes from an infa mapping. 10-100 times slower than inserts. This is due to the fact that there is no ‘array’ based update/delete, and each update therefore needs to be handled separately and sequentially, and you end up in a situation where 90% if the time is spend sending handshakes back and forth. With inserts the handshakes is only being done for each ‘array-size’ records (say 10000) and the overhead is therefore negligible.
The best solutions I have found so far is:
use push down optimization (draw back is that mapping needs to be 100% push-downable, which means: no variable ports, and both source and target behind the same connection and much more)
use a stage/apply approach:
In the pre-sql of the session, drop&create a STAGE table with all the keys you wish to delete
Override the table name for your target to point to this STAGE table
Let the mapping do inserts to the STAGE table
In the post-sql do a
Delete from TABLE where Exists (select * from STAGE where TABLE.ID=STAGE.ID)
I hope you cal follow me

Oracle SQL technique to avoid filling trans log

Newish to Oracle programming (from Sybase and MS SQL Server). What is the "Oracle way" to avoid filling the trans log with large updates?
In my specific case, I'm doing an update of potentially a very large number of rows. Here's my approach:
UPDATE my_table
SET a_col = null
WHERE my_table_id IN
(SELECT my_table_id FROM my_table WHERE some_col < some_val and rownum < 1000)
...where I execute this inside a loop until the updated row count is zero,
Is this the best approach?
Thanks,
The amount of updates to the redo and undo logs will not at all be reduced if you break up the UPDATE in multiple runs of, say 1000 records. On top of it, the total query time will be most likely be higher compared to running a single large SQL.
There's no real way to address the UNDO/REDO log issue in UPDATEs. With INSERTs and CREATE TABLEs you can use a DIRECT aka APPEND option, but I guess this doesn't easily work for you.
Depends on the percent of rows almost as much as the number. And it also depends on if the update makes the row longer than before. i.e. going from null to 200bytes in every row. This could have an effect on your performance - chained rows.
Either way, you might want to try this.
Build a new table with the column corrected as part of the select instead of an update. You can build that new table via CTAS (Create Table as Select) which can avoid logging.
Drop the original table.
Rename the new table.
Reindex, repoint contrainst, rebuild triggers, recompile packages, etc.
you can avoid a lot of logging this way.
Any UPDATE is going to generate redo. Realistically, a single UPDATE that updates all the rows is going to generate the smallest total amount of redo and run for the shortest period of time.
Assuming you are updating the vast majority of the rows in the table, if there are any indexes that use A_COL, you may be better off disabling those indexes before the update and then doing a rebuild of those indexes with NOLOGGING specified after the massive UPDATE statement. In addition, if there are any triggers or foreign keys that would need to be fired/ validated as a result of the update, getting rid of those temporarily might be helpful.