I am loading CSV files to Redshift daily. To handle duplicates i am loading the files to staging table and then using Update or Insert scripts based on key columns to load to the target table. Recently i found duplicate data in the target table unexpectedly.
I double checked my script and don't see any reason for having duplicates. Below are the Update and Insert script formats that i am using.
For Inserting:
Insert into target (key1, key2, col3, col4)
Select key1, key2, col3, col4
From stage s where not exists (select 1 from target t
where s.key1 = t.key1 and)
s.key2 = t.key2);
And for update:
Update target Set
key1=s.key1, key2=s.key2, col3=s.col3, col4=s.col4
From stage s where target.key1=s.key1 and target.key2=s.key2;
Any help is appreciated.
I ran into this too. The problem was in the insert...select... where the select itself produced duplicates. One solution for us was to use a cursor (outside of Redshift) to run the select and insert one record at a time, but this proved to have performance issues. Instead we now check for duplicates with an initial select
select key1,key2 from stage group by key1,key2 having count(*) > 1;
and stop the process if records are returned.
Related
I have a .csv file with 600 million plus rows. I need to upload this into a database. It will have 3 columns assigned as primary keys.
I use pandas to read the file in chunks of 1000 lines.
At each chunk iteration I use the
INSERT INTO db_name.dbo.table_name("col1", "col2", "col3", "col4")
VALUES (?,?,?,?)
cursor.executemany(query, df.values.tolist())
Syntax with pyodbc in python to upload data in chunks of 1000 lines.
Unfortunately, there are apparently some duplicate rows present. When the duplicate row is encountered the uploading stops with an error from SQL Server.
Question: how can I upload data such that whenever a duplicate is encountered instead of stopping it will just skip that line and upload the rest? I found some questions and answers on insert into table from another table, or insert into table from variables declared, but nothing on reading from a file and using insert into table col_names values() command.
Based on those answers one idea might be:
At each iteration of chunks:
Upload to a temp table
Do the insertion from the temp table into the final table
Delete the rows in the temp table
However, with such a large file each second counts, and I was looking for an answer with better efficiency.
I also tried to deal with duplicates using python, however, since the file is too large to fit into the memory I could not find a way to do that.
Question 2: if I were to use bulk insert, how would I achieve to skip over the duplicates?
Thank you
You can try to use a CTE and an INSERT ... SELECT ... WHERE NOT EXISTS.
WITH cte
AS
(
SELECT ? col1,
? col2,
? col3,
? col4
)
INSERT INTO db_name.dbo.table_name
(col1,
col2,
col3,
col4)
SELECT col1,
col2,
col3,
col4
FROM cte
WHERE NOT EXISTS (SELECT *
FROM db_name.dbo.table_name
WHERE table_name.col1 = cte.col1
AND table_name.col2 = cte.col2
AND table_name.col3 = cte.col3
AND table_name.col4 = cte.col4);
Possibly delete some of the table_name.col<n> = cte.col<n>, if the column isn't part of the primary key.
I would always load into a temporary load table first, which doesn't have any unique or PK constraint on those columns. This way you can always see that the whole file has loaded, which is an invaluable check in any ETL work, and for any other easy analysis of the source data.
After that then use an insert such as suggested by an earlier answer, or if you know that the target table is empty then simply
INSERT INTO db_name.dbo.table_name(col1,col2,col3,col4)
SELECT distinct col1,col2,col3,col4 from load_table
The best approach is to use a temporary table and execute a MERGE-INSERT statement. You can do something like this (not tested):
CREATE TABLE #MyTempTable (col1 VARCHAR(50), col2, col3...);
INSERT INTO #MyTempTable(col1, col2, col3, col4)
VALUES (?,?,?,?)
CREATE CLUSTERED INDEX ix_tempCol1 ON #MyTempTable (col1);
MERGE INTO db_name.dbo.table_name AS TARGET
USING #MyTempTable AS SOURCE ON TARGET.COL1 = SOURCE.COL1 AND TARGET.COL2 = SOURCE.COL2 ...
WHEN NOT MATCHED THEN
INSERT(col1, col2, col3, col4)
VALUES(source.col1, source.col2, source.col3, source.col4);
You need to consider the best indexes for your temporary table to make the MERGE faster. With the statement WHEN NOT MATCHED you avoid duplicates depending on the ON clause.
SQL Server Integration Services offers one method that can read data from a source (via a Dataflow task), then remove duplicates using it's Sort control (a checkbox to remove duplicates).
https://www.mssqltips.com/sqlservertip/3036/removing-duplicates-rows-with-ssis-sort-transformation/
Of course the data has to be sorted and 60 million+ rows isn't going to be fast.
If you want to use pure SQL Server then you need a staging table (without a pk constraint). After importing your data into Staging, you would insert into your target table using filtering for the composite PK combination. For example,
Insert into dbo.RealTable (KeyCol1, KeyCol2, KeyCol3, Col4)
Select Col1, Col2, Col3, Col4
from dbo.Staging S
where not exists (Select *
from dbo.RealTable RT
where RT.KeyCol1 = S.Col1
AND RT.KeyCol2 = S.Col2
AND RT.KeyCol3 = S.Col3
)
In theory you could also use the set operator EXCEPT since it takes the distinct values from both tables. For example:
INSERT INTO RealTable
SELECT * FROM Staging
EXCEPT
SELECT * FROM RealTable
Would insert distinct rows from Staging into RealTable (that don't already exist in RealTable). This method doesn't take into account the composite PK using different values on multiple rows- so an insert error would indicate different values are being assigned to the same PK composite key in the csv.
I am attempting to write a SQL Script to bulk delete rows in a table with input from a Text File. I am just getting into SQL Scripting.
Backstory: Someone in my previous role setup a table without a primary key and a program was designed to insert data into the table. However, the program would never check for duplicate entries first and just go ahead and do the insert.
I am attempting to clean-up the database.
First, I have run a query to see just how many rows are duplicates. There are roughly ~7,000 therefore, there is no way I am going to delete them one query at a time. [ID] should have been setup as a Primary Key.
Query to determine duplicates
SELECT [ID] FROM [testing].[dbo].[testingtable]
GROUP BY [ID]
HAVING COUNT(*) > 1
I can delete the duplicate rows by using the following query on an individual ID:
SET ROWCOUNT 1
DELETE FROM [testing].[dbo].[testingtable]
WHERE [ID] = SomeNumber
SET ROWCOUNT 0
I have a text file of all of the duplicate ID number entries, however, is there a bulk delete script that I can create so that I can feed in all of the ID duplicate numbers from the text file? Or is there a more efficient way. Please point me in the direction.
I don't understand why you have (or need) a text file of all duplicate IDs.
There are roughly ~7,000 therefore, there is no way I am going to delete them one query at a time Of course there is a way to delete them, here we go:
If you just want to remove duplicates from your table, use this code:
WITH CTE AS(
SELECT [ID]
,RN = ROW_NUMBER()OVER(PARTITION BY [ID])
FROM [testing].[dbo].[testingtable]
)
DELETE FROM CTE WHERE RN > 1
if you want to remove a very high percentage of rows...
SELECT col1, col2, ...
INTO #Holdingtable
FROM MyTable
WHERE ..opposite condition..
TRUNCATE TABLE MyTable
INSERT MyTable (col1, col2, ...)
SELECT col1, col2, ...
FROM #Holdingtable
I want to select some data using simple sql and insert those data into another table. Both table are same. Data types and column names all are same. Simply those are temporary table of masters table. Using single sql I want to insert those data into another table and in the where condition I check E_ID=? checking part. My another problem is sometime there may be any matching rows in the table. In that time is it may be out sql exception? Another problem is it may be multiple matching rows. That means one E_ID may have multiple rows. As a example in my attachment_master and attachments_temp table has multiple rows for one single ID. How do I solve those problems? I have another problem. My master table data can insert temp table using following code. But I want to change only one column and others are same data. Because I want to change temp table status column.
insert into dates_temp_table SELECT * FROM master_dates_table where e_id=?;
In here all data insert into my dates_temp_table. But I want to add all column data and change only dates_temp_table status column as "Modified". How should I change this code?
You could try this:
insert into table1 ( col1, col2, col3,.... )
SELECT col1, col2, col3, ....
FROM table2 where (you can check any condition here on table1 or table2 or mixed)
For more info have a look here and this similar question
Hope it may help you.
EDit : If I understand your requirement properly then this may be a helpful solution for you:
insert into table1 ( col-1, col-2, col-3,...., col-n, <Your modification col name here> )
SELECT col-1, col-2, col-3,...., col-n, 'modified'
FROM table2 where table1.e_id=<your id value here>
As per your comment in above other answer:
"I send my E_ID. I don't want to matching and get. I send my E_ID and
if that ID available I insert those data into my temp table and change
temp table status as 'Modified' and otherwise don't do anything."
As according to your above statements, If given e_id is there it will copy all the columns values to your table1 and will place a value 'modified' in the 'status' column of your table1
For more info look here
You can use merge statement if I understand your requirement correctly.
Documentation
As I do not have your table structure below is based on assumption, see whether this cater your requirement. I am assuming that e_id is primary key or change as per your table design.
MERGE INTO dates_temp_table trgt
USING (SELECT * FROM master_dates_table WHERE e_id=100) src
ON (trgt.prm_key = src.prm_key)
WHEN NOT MATCHED
THEN
INSERT (trgt.col, trgt.col2, trgt.status)
VALUES (src.col, src.col2, 'Modified');
More information and examples here
insert into tablename( column1, column2, column3,column4 ) SELECT column1,
column2, column3,column4 from anothertablename where tablename.ID=anothertablename.ID
IF multiple values are there then it will return the last result..If not you have narrow your search..
I have a Constraint on a table with IGNORE_DUP_KEY. This allows bulk inserts to partially work where some records are dupes and some are not (only inserting the non-dupes). However, it does not allow updates to partially work, where I only want those records updated where dupes will not be created.
Does anyone know how I can support IGNORE_DUP_KEY when applying updates?
I am using MS SQL 2005
If I understand correctly, you want to do UPDATEs without specifying the necessary WHERE logic to avoid creating duplicates?
create table #t (col1 int not null, col2 int not null, primary key (col1, col2))
insert into #t
select 1, 1 union all
select 1, 2 union all
select 2, 3
-- you want to do just this...
update #t set col2 = 1
-- ... but you really need to do this
update #t set col2 = 1
where not exists (
select * from #t t2
where #t.col1 = t2.col1 and col2 = 1
)
The main options that come to mind are:
Use a complete UPDATE statement to avoid creating duplicates
Use an INSTEAD OF UPDATE trigger to 'intercept' the UPDATE and only do UPDATEs that won't create a duplicate
Use a row-by-row processing technique such as cursors and wrap each UPDATE in TRY...CATCH... or whatever the language's equivalent is
I don't think anyone can tell you which one is best, because it depends on what you're trying to do and what environment you're working in. But because row-by-row processing could potentially produce some false positives, I would try to stick with a set-based approach.
I'm not sure what is really going on, but if you are inserting duplicates and updating Primary Keys as part of a bulk load process, then a staging table might be the solution for you. You create a table that you make sure is empty prior to the bulk load, then load it with the 100% raw data from the file, then process that data into your real tables (set based is best). You can do things like this to insert all rows that don't already exist:
INSERT INTO RealTable
(pk, col1, col2, col3)
SELECT
pk, col1, col2, col3
FROM StageTable s
WHERE NOT EXISTS (SELECT
1
FROM RealTable r
WHERE s.pk=r.pk
)
Prevent the duplicates in the first place is best. You could also do UPDATEs on your real table by joining in the staging table, etc. This will avoid the need to "work around" the constraints. When you work around the constraints, you usually create difficult to find bugs.
I have the feeling you should use the MERGE statement and then in the update part you should really not update the key you want to have unique. That also means that you have to define in your table that a key is unique (Setting a unique index or define as primary key). Then any update or insert with a duplicate key will fail.
Edit: I think this link will help on that:
http://msdn.microsoft.com/en-us/library/bb522522.aspx
I have a mssql stored procedure question for you experts:
I have two tables [table1] which retrieves new entries all the time. I then have another table [table2] which I need to copy the content of [table1] into.
I need to check if some of the the rows already exists in [table2], if it do, just update the Update-timestamp of [table2], otherwise insert it into [table2].
The tables can be rather big, about 100k entries, so which is the fastest way to do this?
It should be noticed that this is a simplified idea, since there is some more datahandling happening when copying new content from [Table1] -> [Table2].
So to sum up:
If a row exist both [Table1] and [Table2] update the timestamp of the row in [Table2] otherwise just insert a new record with the content into [Table1].
If you have SQL Server 2008, it has a MERGE command that can do an insert or update as appropriate.
This works across all versions of SQL Server. MERGE does the same in SQL Server 2008.
UPDATE
table2
SET
timestampcolumn = whatever
WHERE
EXISTS (SELECT *
FROM table1
WHERE
table1.key = table2.key)
INSERT table2 (col1, col2, col3...)
SELECT col1, col2, col3...
FROM table1
WHERE
NOT EXISTS (SELECT *
FROM table2
WHERE
table2.key = table1.key)
Given that this sounds like a ETL Process. Have you considered using SQL Server Integration Services?
If you are planning on export/loading/processing lost of data then this is the way to go in my view. You also have the added advantage of being able to run multiple threads in parallel and more options to tweak your data throughput and server memory utilisation etc.