Impala KUDU table - howto bulk update - impala

I need to performing updates of KUDU table, Is there any option to du update in bulk?
The flow is following:
1 .Fetch 1000 rows
2. Process rows, calculate new value for each row
3. Update KUDU table with new values
Updating row by row with one DB query per row - slow. I am seeking bulk update solution. I found only this "You can update in bulk using the same approaches outlined in Inserting In Bulk." here https://www.cloudera.com/documentation/kudu/latest/topics/kudu_impala.html#update_bulk but how to du this? I need example, if it is possible
Thanks

Related

How to do bulk upsert operation in snowflake?

Am syncing my mongo DB data to snowflake on a daily basis using a node js script. So if a row is already existing in snowflake, then I want to replace that row with the new data, or if the row doesn't exist in snowflake then I want to insert a new row.
Also, I want to do this for a lot of data.
So is there any way to do bulk upsert in snowflake? If not, then what will be the optimal way to achieve this?
The table may have millions of rows and possibly go to billions in the future.
This is a typical use case for a merge statement. You can see the documentation for merge here: https://docs.snowflake.com/en/sql-reference/sql/merge.html
Using a merge statement for billions of rows can lead to some high-churn tables so it isn't ideal. It could be better if you can append to the table only and figure out the latest record with a select statement.
You can bulk copy your data into a staging table then use MERGE feature in snowflake.

Deleting records in a table with billion records using spark or scala

we have a table in Azure Data Warehouse with 17 billion records. Now we have a scenario where we have to delete records from this table based on some where condition. We are writing Spark in Scala language in Azure Databricks notebooks.
We searched for different options to do this in Spark, but all suggested to first read the entire table, delete records from this and then overwrite the entire table in Data Warehosue. However this approach will not work in our case due to huge number of records in our table.
Can you please suggest how we can achieve this functionality using spark/scala?
1) checked if we can call stored procedure through spark/scala code in azure databricks but Spark do not support stored procedures.
2) Tried reading the entire table first to delete the records but it goes into never ending loop.
Is possible to create view with select clause as per your requirement, then using of the view

Either insert or update database records via Apache NiFi flow

I am trying to transfer data between two databases with similar structure of tables using NiFi. Example of data structure:
User: {varchar name, integer id}.
There are no "Maximum-value Columns" so it is impossible to determine if there is new data or not. So each time I create "snapshot" of the full table content. The problem is that it is unclear either particular record should be inserted or updated in the target database.
I created two branches of processors: with inserts and with updates. Only insert works for new records and only update for existing. But (!) PutSQL processor works with bunch of flow files.
For example batch size is 100 and processors work once a day. Assume there was 98 records yesterday. They will be inserted. Today there are 200 records (98 from yesterday and 102 new). In this flow if NiFi tries to update first 100 records and insert them then both actions will fail: first 98 records should be updated while last 2 should be inserted.
How to solve this issue? I know it is possible to use batch size 1 but it work too slow.
I recommend solving this in your SQL statements, since NiFi will not know the prior status of the records. A MERGE statement would be ideal, if your database supports it (Oracle, SQL Server, MySQL insert). Otherwise, you can craft both an INSERT and an UPDATE for each record in the source table, making them conditional on the user existing in the table.

Pentaho update/insert

I am trying to have a setup in Pentaho where :
My source data is in MySQL DB and target database is Amazon redshift.
I want to have incremental loads on Redshift database table, based on the last updated timestamp from MySQL DB table.
Primary key is student ID.
Can I implement this using update/insert in Pentaho ?
Insert/Update step in Pentaho Data Integration serves the purpose of inserting the row if it doesn't exist in the destination table or updating it if it's already there. It has nothing to do with incremental loads, but if your loads should be inserting or updating the record based on some Change Data Capture mechanism then this is the right step at the end of the process.
For example you could go one of two ways:
If you have a CDC then limit the data at Table Input for MySQL since you already know the last time a record has been modified (last load)
If you don't have a CDC and you are comparing entire tables then go for joining the sets to produce rows that has changed and then perform a load (slower solution)

Is it possible to overwrite with a SSIS Insert or similar?

I have a .csv file that gets pivoted into 6 million rows during a SSIS package. I have a table in SQLServer 2005 of 25 million + rows. The .csv file has data that duplicates data in the table, is it possible for rows to get updated if it already exists or what would be the best method to achieve this efficiently?
Comparing 6m rows against 25m rows is not going to be too efficient with a lookup or a SQL command data flow component being called for each row to do an upsert. In these cases, sometimes it is most efficient to load them quickly into a staging table and use a single set-based SQL command to do the upsert.
Even if you do decide to do the lookup - split the flow into two streams, one which inserts and the other which inserts into a staging table for an update operation.
If you don't mind losing the old data (ie. the latest file is all that matters, not what's in the table) you could erase all the records in the table and insert them again.
You could also load into a temporary table and determine what needs to be updated and what needs to be inserted from there.
You can use the Lookup task to identify any matching rows in the CSV and the table, then pass the output of this to another table or data flow and use a SQL task to perform the required Update.