How to do bulk upsert operation in snowflake? - sql

Am syncing my mongo DB data to snowflake on a daily basis using a node js script. So if a row is already existing in snowflake, then I want to replace that row with the new data, or if the row doesn't exist in snowflake then I want to insert a new row.
Also, I want to do this for a lot of data.
So is there any way to do bulk upsert in snowflake? If not, then what will be the optimal way to achieve this?
The table may have millions of rows and possibly go to billions in the future.

This is a typical use case for a merge statement. You can see the documentation for merge here: https://docs.snowflake.com/en/sql-reference/sql/merge.html
Using a merge statement for billions of rows can lead to some high-churn tables so it isn't ideal. It could be better if you can append to the table only and figure out the latest record with a select statement.

You can bulk copy your data into a staging table then use MERGE feature in snowflake.

Related

Is it possible to prefilter data when copying it from csv to sql table?

I have a large .csv table that I want to insert into a Postgres DB. I don't need all rows from the table, is it possible to somehow filter it using SQL before it is uploaded to the database? Or the only option is to delete the rows I don't need afterward?

Updating/Inserting Oracle table from csv data file

I am still learning Oracle SQL, and I've been trying to find the best way to update/insert records in an OracleSQL table with the data from a CSV file.
So far, I've figured out how to load the csv into a temporary table using External Tables in Oracle, but I'm having difficulty finding a detailed guide on how to update/insert (UPSERT) the loaded data into an existing table.
What is the best way to do this, when I have 30+ fields in the table? For example, is it best to read the csv line by line with something like pandas and update each record one by one, or is it best to do it with a sql script using something like a merge statement? Not all records in the csv have a value for the primary key, in which case I need to insert rather than update. Thanks for the help!
That looks like a MERGE, indeed.
Data from external table would then be used to
update values in existing rows
create new rows in the target table
Pandas and row-by-row processing? I wouldn't do that. If you already have a powerful database, then use its capabilities. Row-by-row is usually slow-by-slow and there's rarely some benefit in doing it that way.

Deleting records in a table with billion records using spark or scala

we have a table in Azure Data Warehouse with 17 billion records. Now we have a scenario where we have to delete records from this table based on some where condition. We are writing Spark in Scala language in Azure Databricks notebooks.
We searched for different options to do this in Spark, but all suggested to first read the entire table, delete records from this and then overwrite the entire table in Data Warehosue. However this approach will not work in our case due to huge number of records in our table.
Can you please suggest how we can achieve this functionality using spark/scala?
1) checked if we can call stored procedure through spark/scala code in azure databricks but Spark do not support stored procedures.
2) Tried reading the entire table first to delete the records but it goes into never ending loop.
Is possible to create view with select clause as per your requirement, then using of the view

How to insert overwrite partitioned table in BigQuery UI?

We can insert data into specific partition of partitioned table, here we need to specify partition value.But my requirement is to overwrite all partitions in a table in one query using UI. Can we perform this operation?
Consulted bigquery team member. You can NOT write to all partitions in one query.
You can only write to a partition at a time.
As YY has pointed out, you would not be able to do this directly in BigQuery/SQL with one query (you could script something to run N queries). However, if you spun up a Cloud Dataflow pipeline and configured it to have multiple BigQueryIO sinks, with each one (sink) overwriting a partition, this could be one way I can think of doing it in one shot. It would be a straightforward pipeline to spin up and run.
At this time, BigQuery allows updating upto 2000 partitions in a single statement. If you need to just insert data into a partitioned table, you can use the INSERT DML statement to write to upto 2000 partitions in one statement. If you are updating or deleting existing partitions you can use the UPDATE or DELETE statements respectively. If you are both updating and inserting new data, you can use the MERGE DML statements to achieve this.

Is it possible to overwrite with a SSIS Insert or similar?

I have a .csv file that gets pivoted into 6 million rows during a SSIS package. I have a table in SQLServer 2005 of 25 million + rows. The .csv file has data that duplicates data in the table, is it possible for rows to get updated if it already exists or what would be the best method to achieve this efficiently?
Comparing 6m rows against 25m rows is not going to be too efficient with a lookup or a SQL command data flow component being called for each row to do an upsert. In these cases, sometimes it is most efficient to load them quickly into a staging table and use a single set-based SQL command to do the upsert.
Even if you do decide to do the lookup - split the flow into two streams, one which inserts and the other which inserts into a staging table for an update operation.
If you don't mind losing the old data (ie. the latest file is all that matters, not what's in the table) you could erase all the records in the table and insert them again.
You could also load into a temporary table and determine what needs to be updated and what needs to be inserted from there.
You can use the Lookup task to identify any matching rows in the CSV and the table, then pass the output of this to another table or data flow and use a SQL task to perform the required Update.