I have a Constraint on a table with IGNORE_DUP_KEY. This allows bulk inserts to partially work where some records are dupes and some are not (only inserting the non-dupes). However, it does not allow updates to partially work, where I only want those records updated where dupes will not be created.
Does anyone know how I can support IGNORE_DUP_KEY when applying updates?
I am using MS SQL 2005
If I understand correctly, you want to do UPDATEs without specifying the necessary WHERE logic to avoid creating duplicates?
create table #t (col1 int not null, col2 int not null, primary key (col1, col2))
insert into #t
select 1, 1 union all
select 1, 2 union all
select 2, 3
-- you want to do just this...
update #t set col2 = 1
-- ... but you really need to do this
update #t set col2 = 1
where not exists (
select * from #t t2
where #t.col1 = t2.col1 and col2 = 1
)
The main options that come to mind are:
Use a complete UPDATE statement to avoid creating duplicates
Use an INSTEAD OF UPDATE trigger to 'intercept' the UPDATE and only do UPDATEs that won't create a duplicate
Use a row-by-row processing technique such as cursors and wrap each UPDATE in TRY...CATCH... or whatever the language's equivalent is
I don't think anyone can tell you which one is best, because it depends on what you're trying to do and what environment you're working in. But because row-by-row processing could potentially produce some false positives, I would try to stick with a set-based approach.
I'm not sure what is really going on, but if you are inserting duplicates and updating Primary Keys as part of a bulk load process, then a staging table might be the solution for you. You create a table that you make sure is empty prior to the bulk load, then load it with the 100% raw data from the file, then process that data into your real tables (set based is best). You can do things like this to insert all rows that don't already exist:
INSERT INTO RealTable
(pk, col1, col2, col3)
SELECT
pk, col1, col2, col3
FROM StageTable s
WHERE NOT EXISTS (SELECT
1
FROM RealTable r
WHERE s.pk=r.pk
)
Prevent the duplicates in the first place is best. You could also do UPDATEs on your real table by joining in the staging table, etc. This will avoid the need to "work around" the constraints. When you work around the constraints, you usually create difficult to find bugs.
I have the feeling you should use the MERGE statement and then in the update part you should really not update the key you want to have unique. That also means that you have to define in your table that a key is unique (Setting a unique index or define as primary key). Then any update or insert with a duplicate key will fail.
Edit: I think this link will help on that:
http://msdn.microsoft.com/en-us/library/bb522522.aspx
Related
I have a .csv file with 600 million plus rows. I need to upload this into a database. It will have 3 columns assigned as primary keys.
I use pandas to read the file in chunks of 1000 lines.
At each chunk iteration I use the
INSERT INTO db_name.dbo.table_name("col1", "col2", "col3", "col4")
VALUES (?,?,?,?)
cursor.executemany(query, df.values.tolist())
Syntax with pyodbc in python to upload data in chunks of 1000 lines.
Unfortunately, there are apparently some duplicate rows present. When the duplicate row is encountered the uploading stops with an error from SQL Server.
Question: how can I upload data such that whenever a duplicate is encountered instead of stopping it will just skip that line and upload the rest? I found some questions and answers on insert into table from another table, or insert into table from variables declared, but nothing on reading from a file and using insert into table col_names values() command.
Based on those answers one idea might be:
At each iteration of chunks:
Upload to a temp table
Do the insertion from the temp table into the final table
Delete the rows in the temp table
However, with such a large file each second counts, and I was looking for an answer with better efficiency.
I also tried to deal with duplicates using python, however, since the file is too large to fit into the memory I could not find a way to do that.
Question 2: if I were to use bulk insert, how would I achieve to skip over the duplicates?
Thank you
You can try to use a CTE and an INSERT ... SELECT ... WHERE NOT EXISTS.
WITH cte
AS
(
SELECT ? col1,
? col2,
? col3,
? col4
)
INSERT INTO db_name.dbo.table_name
(col1,
col2,
col3,
col4)
SELECT col1,
col2,
col3,
col4
FROM cte
WHERE NOT EXISTS (SELECT *
FROM db_name.dbo.table_name
WHERE table_name.col1 = cte.col1
AND table_name.col2 = cte.col2
AND table_name.col3 = cte.col3
AND table_name.col4 = cte.col4);
Possibly delete some of the table_name.col<n> = cte.col<n>, if the column isn't part of the primary key.
I would always load into a temporary load table first, which doesn't have any unique or PK constraint on those columns. This way you can always see that the whole file has loaded, which is an invaluable check in any ETL work, and for any other easy analysis of the source data.
After that then use an insert such as suggested by an earlier answer, or if you know that the target table is empty then simply
INSERT INTO db_name.dbo.table_name(col1,col2,col3,col4)
SELECT distinct col1,col2,col3,col4 from load_table
The best approach is to use a temporary table and execute a MERGE-INSERT statement. You can do something like this (not tested):
CREATE TABLE #MyTempTable (col1 VARCHAR(50), col2, col3...);
INSERT INTO #MyTempTable(col1, col2, col3, col4)
VALUES (?,?,?,?)
CREATE CLUSTERED INDEX ix_tempCol1 ON #MyTempTable (col1);
MERGE INTO db_name.dbo.table_name AS TARGET
USING #MyTempTable AS SOURCE ON TARGET.COL1 = SOURCE.COL1 AND TARGET.COL2 = SOURCE.COL2 ...
WHEN NOT MATCHED THEN
INSERT(col1, col2, col3, col4)
VALUES(source.col1, source.col2, source.col3, source.col4);
You need to consider the best indexes for your temporary table to make the MERGE faster. With the statement WHEN NOT MATCHED you avoid duplicates depending on the ON clause.
SQL Server Integration Services offers one method that can read data from a source (via a Dataflow task), then remove duplicates using it's Sort control (a checkbox to remove duplicates).
https://www.mssqltips.com/sqlservertip/3036/removing-duplicates-rows-with-ssis-sort-transformation/
Of course the data has to be sorted and 60 million+ rows isn't going to be fast.
If you want to use pure SQL Server then you need a staging table (without a pk constraint). After importing your data into Staging, you would insert into your target table using filtering for the composite PK combination. For example,
Insert into dbo.RealTable (KeyCol1, KeyCol2, KeyCol3, Col4)
Select Col1, Col2, Col3, Col4
from dbo.Staging S
where not exists (Select *
from dbo.RealTable RT
where RT.KeyCol1 = S.Col1
AND RT.KeyCol2 = S.Col2
AND RT.KeyCol3 = S.Col3
)
In theory you could also use the set operator EXCEPT since it takes the distinct values from both tables. For example:
INSERT INTO RealTable
SELECT * FROM Staging
EXCEPT
SELECT * FROM RealTable
Would insert distinct rows from Staging into RealTable (that don't already exist in RealTable). This method doesn't take into account the composite PK using different values on multiple rows- so an insert error would indicate different values are being assigned to the same PK composite key in the csv.
I have a SQL Server table with three columns, the first two columns are the primary key. I'm writing a stored procedure that will update the last two columns in mass and it works fine for that as long as there are are no primary key violations but when there is a primary key violation it throws an error and stops executing.
How can I make it to ignore the line and continue updating the record as long as there is no primary key violation?
Is there a better way to approach this problem? I'm only doing a simple update with where as column2= somevalue AND column 3 = some value.
In SQL Server you'd use MERGE to upsert (i.e. insert or update):
MERGE mytable
USING (SELECT 1 as key1, 2 as key2, 3 as col1, 4 as col2) AS src
ON (mytable.key1 = src.key1 AND mytable.key2 = src.key2)
WHEN MATCHED THEN
UPDATE SET col1 = src.col1, col2 = src.col2
WHEN NOT MATCHED THEN
INSERT (key1, key2, col1, col2) VALUES (src.key1, src.key2, src.col1, src.col2);
There is nothing inherently wrong with your question, despite the rather loud protestations. Your question is confusing, especially when you refer to columns by position. That is a big no-no. So, a script that demonstrates your problem is generally the best way to both demonstrate your problem and get useful suggestions.
The short answer to your question is - you can't. A statement either succeeds or fails as a whole. If you want to update each row individually and ignore certain errors, then you need to write your tsql to do that.
And despite the protests (again), there are situations where it is necessary to update columns that are part of the primary key. It is unusual - very unusual - but you should also be wary of any absolute statement about tsql. When you find yourself doing unusual things, you should review your schema (and your approach) because it is quite possible that there are better ways to accomplish your goal.
And in this case, I suggest that you SHOULD really think about what you are trying to accomplish. If you want to update a set of rows in a particular way and the statement fails - that means there is a flaw somewhere!. Typically, this error implies that your update logic is not correct. Perhaps you assume something about your data that is not accurate? It is impossible to know from a distance. The error message will tell you what set of values caused the conflict - so that should give you sufficient information to investigate. As another tool, write a select statement that demonstrates your proposed update and look for the values in the error message. E.g.
set nocount on;
create table #x (a smallint not null, b smallint not null, c varchar(10) not null, constraint xx primary key(a, b));
insert #x (a, b, c) values (1, 1, 'test'), (1, 2, 'zork');
select * from #x;
update #x set b = 2, c = 'dork';
select a, b, c, cast(2 as smallint) as new_b, 'dork' as new_c
from #x
order by a, new_b;
drop table #x;
Table named T1 with following values
Col1 Col2 Col3
Rs1 S S2
Rs2 SX S3
Rs3 S S2
From a csv, I need to insert some values into the table, having values Rs4, SX and S3 respectively to each column.
I need to apply a check with following constraints.
One S3 can belong to only one SX, but S3 and SX as pair can belong can belong to multiple columns1's values.
What will be the oracle query for this? And if the above condition is true then I need to run an insertion query which is prepared. How can it validated?
PS: we can't create another table.
Had to do a little discovery after I was informed that I totally missed the ORACLE tag. Knowing what you do not know is very important to me. This post should be sufficiently different.
THE BASIC PROBLEM WITH ORACLE'S CHECK
A check constraint can NOT be defined on a SQL View. The check constraint defined on a table must refer to only columns in that
table. It can not refer to columns in other tables.
A check constraint can NOT include a SQL Subquery.
A check constraint can be defined in either a SQL CREATE TABLE statement or a SQL ALTER TABLE statement.
REVISITING THE PROBLEM
We know that (Col2,Col3)| #(Col2,COl3) >= 1.
We know that {Col1}∩(Col2,Col3)
However, the #Cardinality of Col1? Can it be more than 1?
Clearly, the business requirements are not fully explained.
REVISITING THE SOLUTIONS
Adding Objects to the database.
While adding additional tables has been voted down, is it possible to add an ID column? Assuming Col1 is NOT unique to the subsets of (Col2,COl3), then you can add a true ID Column that fulfills the need for normalization while providing true indexing power in your query.
Col1 Col2 Col3 Col4
Rs1 S S2 1
Rs2 SX S3 2
Rs3 S S2 1
To be clear, Col4 would still be an ID since the values of Col2, Col3 are determined by Col4. (Col2,Col3) 1:1 Col4.
CHECKS
Multiple CHECK constraints, each with a simple condition enforcing a
single business rule, are preferable to a single CHECK constraint with
a complicated condition enforcing multiple business rules ORACLE - Constraint
A single column can have multiple CHECK constraints that reference the
column in its definition. There is no limit to the number of CHECK
constraints that you can define on a column. ORACLE - Data Integrity
If you can add a column...by the love of monkeys, please do...not only will it make your life much easier, but you can also make QUERYING the table very efficient. However, for the rest of this post, I will assume you cannot add columns:
RESTATING THE PROBLEM IN CONSTRAINTS
Col2 may not appear with a different Col3. Vice Versa.
(Col2,Col3) may have multiple Co1...what is the possible Cardinality of Col1? Can it be repetitive? I read no.
WRITING OUT THE THEORY ON CHECKS
IF Col1 truly is unique in {(col2,col3)}, then the following already works:
ALTER TABLE EXAMPLE3
ADD CONSTRAINT ch_example3_3way UNIQUE (C, D, X) --only works if these valus never repeat.
The other main constraint #(Col2,Col3) > 1 simply cannot work unless you knew what value was being entered so as to enforce a real SARG. Any Col1 = Col1 or Col1 IN Col1 is the same thing as writing 1 = 1.
ON TRIGGERS
As tempting as the idea sounds, a quick glance through ORACLE lane left me warning against the use. Some reasons from ORACLE:
ORACLE - USING TRIGGERS
Do not create recursive triggers.
For example, if you create an AFTER UPDATE statement trigger on the
employees table, and the trigger itself issues an UPDATE statement on
the employees table, the trigger fires recursively until it runs out
of memory.
Use triggers on DATABASE judiciously. They are executed for every user every time the event occurs on which the trigger is created
Other problems include: TOADWORLD - ORACLE WIKI
Not Compiled -STORED PROCs can reuse a cached plan
No SELECT Trigger Support
Complete Trigger Failure
Disabled Triggers
No Version Control
Update OF COLUMN
No Support of SYS Table Triggers
Mutating Triggers
Hidden Behavior
Still, there are advantages of TRIGGERs, and you could still enforce data integrity by using a query where the first result of
SELECT Col2, Col3 FROM T1 WHERE ROWNUM = 1
Is compared to the inserted value *new.*Col2, *new.*Col3, but this would require the trigger to fire EVERY TIME a row was inserted...recompiled and everything,..I STRONGLY URGE AVOIDANCE.
STORED PROCS
Whatever you may think of STORED PROCEDURES, I suggest you consider them again. Everything from Functions, DML, DDL, database management, RECURSIVE LOGIC, sp_executesql, and beyond can be accomplished through a PROC.
Easily managed, provides encapsulation from accidental or malicious disabling or mutilization of coding.
PROCs are compiled once and can be reuse query plan caches, providing improved performances.
Provides superior portability, can be embedded into TRIGGERS, ORM framework, applications and beyond.
Can literally automate almost any function in a database including ETL, Resource management, security, and discovery. Views are commonly run through stored Procs.
THE UNIQUE ADVANTAGE OF ORACLE
Perhaps forgotten, consider that this is ORACLE which allows you to suspend CONSTRAINTS by inserting in the CONSTRAINT DEFFERABLE. From an ETL specialist perspective, this is essentially making a staging table out of your only table...which is pretty sweet in your predicament of having limited DDL rights.
CONCLUDING COMMENTS
There are a few efficient methods to delete duplicates in your data.
DELETE FROM T1
WHERE rowid NOT IN
(SELECT MAX(rowid)
FROM T1
GROUP BY Col1, Col2, Col3);
NOTE: rowid is the physical location of the row, while rownum represents the logical position in the query.
Lastly, my last attempt at rowid. Unfortunately, time is running late, and the free COMPILER from ORACLE is unhelpful. But I think the idea is what is important.
CREATE TABLE Example3 (MUT VARCHAR(50), D VARCHAR(50), X VARCHAR(50) );
INSERT INTO Example3 (MUT, D, X) VALUES('MUT', 'T', 'M' );
INSERT INTO Example3 (MUT, D, X) VALUES('MUT', 'T', 'P' );
INSERT INTO Example3 (MUT, D, X) VALUES('MUT', 'X', 'LP');
INSERT INTO Example3 (MUT, D, X) VALUES('MUT', 'X', 'Z');
INSERT INTO Example3 (MUT, D, X) VALUES('MUT', 'Y', 'POP');
SELECT C.D, B.X, B.rowid
FROM EXAMPLE3 A
LEFT OUTER JOIN (
SELECT DISTINCT X, C.rowid
FROM EXAMPLE3) B ON B.rowid = A.rowid
LEFT OUTER JOIN (
SELECT DISTINCT D, MAX(rowid) AS [rowid]
FROM Example3) C ON C.rowid = B.rowid
Finally, I'm able to resolve the question with a some select queries and few if conditions being applied. I have done this in a stored procedure.
SELECT count(col3)
INTO V_exist_value
FROM T3
WHERE col3's value = Variable_col3
AND col1's value <> Variable_col1
AND col2's value = Variable_col2;
IF (V_exist_value >= 1) THEN
INSERT INTO T3 (col1, col2, col3)
VALUES (Variable_col1, Variable_col2, Variable_col3);
ELSE
SELECT count(col3)
INTO V_exist_value1
FROM T3
WHERE col3's value = Variable_col3;
IF (V_exist_value1 = 0) THEN
INSERT INTO T3 (col1, col2, col3)
VALUES (Variable_col1, Variable_col2, Variable_col3);
ELSE
RAISE Exception_col3_value_exists;
END IF;
END IF;
If you don't want to use a trigger then you must normalize your tables.
Create a second table - say T1_PAIRS - that will store all permitted pairs of (col2, col3).
Create an unique constraint on col2 column in table T1_PAIRS - this constraint allows only for unique values of COL2 - for example no more than one S3 value can be used in all pairs ==> this enforces the rule: "One S3 can belong to only one SX"
Create a primary key on ( col2, col3 ) columns in this table T1_PAIRS.
Create a foreign key constraint on ( col2, col3 ) in T1 table that references the primary key of T1_PAIRS table.
In the end create an unique constraint on (col1, col2, col3) columnt to enforce a rule ==> S3 and SX as pair can belong can belong to multiple columns1's values (but no more than one value of column1)"
An example:
CREATE TABLE T1_PAIRS (
Col2 varchar2(10), Col3 varchar2(10),
CONSTRAINT T1_PAIRS_PK PRIMARY KEY( col2, col3 ),
CONSTRAINT T1_col2_UQ UNIQUE( col2 )
);
INSERT ALL
INTO T1_PAIRS( col2, col3 ) VALUES( 'S', 'S2' )
INTO T1_PAIRS( col2, col3 ) VALUES( 'SX', 'S3' )
SELECT 1 FROM dual;
ALTER TABLE T1
ADD CONSTRAINT col2_col3_pair_fk
FOREIGN KEY ( col2, col3 ) REFERENCES T1_pairs( col2, col3 );
ALTER TABLE T1
ADD CONSTRAINT pair_can_belong_to_multi_col1 UNIQUE( col1, col2, col3 );
I want to set up a table with a constraint on it, but when I insert records, I don't want to get any constraint violation errors. I would like SQL to quietly drop any records that aren't unique, but carry on inserting those that can be inserted.
for example....
create table table1
(value1 int,
value2 int,
constraint uc_tab1 Unique (value1,value2)
)
create table table2
(value1 int,
value2 int
)
insert into table2 (value1,value2)
select 1,1
union all
select 2,1
union all
select 3,1
union all
select 1,1
insert into table1
select value1,value2 from table2
At the moment, this will fall over on a violation constraint. I want to suppress that error, so that table1 contains...
1,1
2,1
3,1
(in this example, I could just do a group by on table2, but in my actual application that isn't really viable)
I vaguely remember reading something about this years ago, but I might have imagined it. Is this possible?
Many thanks in advance
Please don't do this, you will lose data very easily
Instead try to change your application so it only inserts valid data isntead of dropping incorrect data
You can use the IGNORE_DUP_KEY index option, although personally I think it is better to find another way of solving your problem.
You can set it to ON to only generate warnings for inserted rows that violate the unique constraint instead of generating errors.
Look into the MERGE statement. It's complex, but can be made to do what you are describing.
(There is or was something that could cause an INSERT statement to continue to insert data even if some rows could not be inserted, but for the life of me I can't find it in BOL or recall what it was called. I'm pretty sure it raised errors anyway, and it always sounded like a horrible idea to me.)
Specifying Ignore_Dup_Key when I created my constraint did the trick. In the above example, I changed the table1 definition to....
create table table1
(value1 int,
value2 int,
constraint uc_tab1 Unique (value1,value2) WITH (IGNORE_DUP_KEY = ON)
)
And it worked perfectly
In a SP I want to delete some rows from a Table and after some code insert the deleted rows in the same table.
How can I do it?
Thanks all.
Update:
I have a Table:
SampleTable(Col1, Col2, Col3, Col4)
I want to do that:
DELETE FROM SampleTable
WHERE Col1 = "foo"
-- SOME CODE...
INSERT INTO SampleTable
[DELETED VALUES...]
UPDATE:
Sorry but now I can't see the DB.
The problem is that in the SOME CODE... part, written by others, there is a delete that give me an error, but after the delete there is an insert with the SP input that replaced the deleted row with the same key.
I know that an UPDATE apparently solve my problem but there is a lot of logic and I don't want to change the SOME CODE... part, so I'm looking for a workaround, and so I want to temporary ignore foreign key
select * into #ttable FROM SampleTable
WHERE Col1 = "foo"
DELETE FROM SampleTable
WHERE Col1 = "foo"
-- SOME CODE...
INSERT INTO SampleTable
select * from #ttable
Deleting and re-inserting the rows can introduce all sorts of problems. For instance, identity() values will change (as well as automatically assigned creation times). In addition, you might have constraints. In theory, anything could happen to the database between the deletion and re-insertion, so constraints that once worked might fail.
How about creating a view?
create view v_SampleTable as
select *
from SampleTable
where col1 = 'foo' or col1 is null;
Then change the code to use v_SampleTable instead of SampleTable. This is an updatable view, so it will even permit modifications to the data inside the table.
You could go even one step further and rename the table first and then create a view with the same name.