Is it possible to overwrite with a SSIS Insert or similar? - sql

I have a .csv file that gets pivoted into 6 million rows during a SSIS package. I have a table in SQLServer 2005 of 25 million + rows. The .csv file has data that duplicates data in the table, is it possible for rows to get updated if it already exists or what would be the best method to achieve this efficiently?

Comparing 6m rows against 25m rows is not going to be too efficient with a lookup or a SQL command data flow component being called for each row to do an upsert. In these cases, sometimes it is most efficient to load them quickly into a staging table and use a single set-based SQL command to do the upsert.
Even if you do decide to do the lookup - split the flow into two streams, one which inserts and the other which inserts into a staging table for an update operation.

If you don't mind losing the old data (ie. the latest file is all that matters, not what's in the table) you could erase all the records in the table and insert them again.
You could also load into a temporary table and determine what needs to be updated and what needs to be inserted from there.

You can use the Lookup task to identify any matching rows in the CSV and the table, then pass the output of this to another table or data flow and use a SQL task to perform the required Update.

Related

Updating/Inserting Oracle table from csv data file

I am still learning Oracle SQL, and I've been trying to find the best way to update/insert records in an OracleSQL table with the data from a CSV file.
So far, I've figured out how to load the csv into a temporary table using External Tables in Oracle, but I'm having difficulty finding a detailed guide on how to update/insert (UPSERT) the loaded data into an existing table.
What is the best way to do this, when I have 30+ fields in the table? For example, is it best to read the csv line by line with something like pandas and update each record one by one, or is it best to do it with a sql script using something like a merge statement? Not all records in the csv have a value for the primary key, in which case I need to insert rather than update. Thanks for the help!
That looks like a MERGE, indeed.
Data from external table would then be used to
update values in existing rows
create new rows in the target table
Pandas and row-by-row processing? I wouldn't do that. If you already have a powerful database, then use its capabilities. Row-by-row is usually slow-by-slow and there's rarely some benefit in doing it that way.

How to do bulk upsert operation in snowflake?

Am syncing my mongo DB data to snowflake on a daily basis using a node js script. So if a row is already existing in snowflake, then I want to replace that row with the new data, or if the row doesn't exist in snowflake then I want to insert a new row.
Also, I want to do this for a lot of data.
So is there any way to do bulk upsert in snowflake? If not, then what will be the optimal way to achieve this?
The table may have millions of rows and possibly go to billions in the future.
This is a typical use case for a merge statement. You can see the documentation for merge here: https://docs.snowflake.com/en/sql-reference/sql/merge.html
Using a merge statement for billions of rows can lead to some high-churn tables so it isn't ideal. It could be better if you can append to the table only and figure out the latest record with a select statement.
You can bulk copy your data into a staging table then use MERGE feature in snowflake.

Deleting records in a table with billion records using spark or scala

we have a table in Azure Data Warehouse with 17 billion records. Now we have a scenario where we have to delete records from this table based on some where condition. We are writing Spark in Scala language in Azure Databricks notebooks.
We searched for different options to do this in Spark, but all suggested to first read the entire table, delete records from this and then overwrite the entire table in Data Warehosue. However this approach will not work in our case due to huge number of records in our table.
Can you please suggest how we can achieve this functionality using spark/scala?
1) checked if we can call stored procedure through spark/scala code in azure databricks but Spark do not support stored procedures.
2) Tried reading the entire table first to delete the records but it goes into never ending loop.
Is possible to create view with select clause as per your requirement, then using of the view

How can I create a Primary Key in Oracle when composite key and a "generated always as identity" option won't work?

I'm working on an SSIS project that pulls data form Excel and loads to Oracle Database every month. I plan to pull data from Excel file and load to Oracle stage table. I will be using a merge statement because the data that gets loaded each month is a rolling 12 month list and the data can change, so need to be able to INSERT when records don't match or UPDATE when they do. My control flow looks like this: Truncate Stage Table (to clear out table from last package run)---> DATA FLOW from Excel to Stage Table---> Merge to Target Table in Oracle.
My problem is that the data in the source Excel file doesn't have any unique columns to select a primary key or a composite key, as it is a possibility (although very unlikely) that a new record could have the exact same information. I am unable to utilize the "generated always as identity" because my SSIS package needs to truncate at the beginning of each job to clear out the Stage Table. This would generate the same ID numbers in the new load and create problems in the Target Table.
Any suggestions as to how I can get around this problem?
Welcome to SO and ETL. Instead of using a staging table, in SSIS use two sources: Excel file and existing production table. Sort both inputs and then perform a merge join on the unique identifier. From there, use a derived column transformation to add a new column called 'Action' which will mark a row as either an INSERT/UPDATE/DELETE based on whether the join key is NULL. So:
NULL from file means DELETE (not in file, in database)
NULL from database means INSERT (in file, not in database)
Not NULL for both means UPDATE (in file, in database)
From there, use a conditional split to split rows to either a OLE DB Destination (INSERT), or SQL Command (UPDATE or DELETE). You can now remove the stage environment and MERGE command from your process. This has the added benefit of removing the ETL load from the SQL Server, assuming SSIS is running on a separate server.
Note: The sort transformation has the option to remove duplicates.

SSIS Alternatives to one-by-one update from RecordSet

I'm looking for a way to speed up the following process: I have a SSIS package that loads data from Excel files on a weekly basis to SQL Server. There are 3 fields: Brand, Date, Value.
In the dataflow, I check for existing combinations of Brand+Date, and new combinations go to the table directly, the existing ones go to a RecordSet destination for updates:
The next step is to update the Value of the existing combinations:
As you can see, there are thousands of records to update, and it takes too long. The number of records tend to grow week by week. Please suggest.
The fastest way will be do this inside a Stored procedure using ELT (Extract Load Transform) approach.
Push all data from excel as is into a table(called load to a staging table in theory). Since you do not seem to be concerned with data validation steps, this table can be a replica of final destination table columns.
Next step is to call a stored procedure using Execute SQL task. Inside this procedure you can put all your business logic. Since this steps with native data manipulation on SQL server entities, it is the fastest alternative.
As a last part, please delete all entries from the staging table.
You can use indexes on staging table to make the SP part even faster.