SQL Import skip duplicates - sql

I am trying to do a bulk upload into a SQL server DB. The source file has duplicates which I want to remove, so I was hoping that the operation would automatically upload the first one, then discard the rest. (I've set a unique key constraint). Problem is, the moment a duplicate upload is attempted the whole thing fails and gets rolled back. Is there any way I can just tell SQL to keep going?

Try to bulk insert the data to the temporary table and then SELECT DISTINCT as #madcolor suggested or
INSERT INTO yourTable
SELECT * FROM #tempTable tt
WHERE NOT EXISTS (SELECT 1 FROM youTable yt WHERE yt.id = tt.id)
or other field in WHERE clause.

If you're doing this through some SQL tool like SQL Plus or DBVis or Toad, then I suspect not. If you're doing this programatically in a language, then you need to divide and conquer. Presumably executing an update line by line and catching each exception would be too lengthy a process, so instead you could do a batch operation first on the whole SQL block, and if it fails, do it on the first half, and if that fails, do it on the first half of the first half. Iterate this way until you have a block that succeeds. Discard the block and do the same procedure on the rest of the SQL. Anything that violates a constraint will eventually end up as a sole SQL statement which you know to log and discard. This should import with as much bulk processing as is possible while still throwing out the invalid lines.

Use SSIS for this. You can tell it to skip the duplicates. But first make sure they are true duplicates. What if the data in some of the columns is different, how do you know which is the better record to keep?

Related

delete first X row Ingres ANSI

I have 730000+ records which I need to delete in Ingres db which work with ANSI92 and I need to delete then without overload db, simple delete where search condition, doesn't work, DB just use all memory and trowing error. thinking to run it in loop, and delete by portions 10-20K of records .
i tried to use top and it didn't work
delete top (10)from TABLE where web_id <0 ;
, also was trying to use Limit also didnt work
DELETE FROM from TABLE where web_id <0 LIMIT 10;
any ideas how to do it ? Thank you !
You could use a session temporary table to hold the first 10 tids (tuple id's) and then delete based on those:
declare global temporary table session.tenrows as
select first 10 tid the_tid from "table" where web_id<0
on commit preserve rows with norecovery;
delete from "table" where tid in (select the_tid from session.tenrows);
When you say "without overload db", do you mean avoiding hitting the force-abort limit of the transaction log file? If so what might work for you is:
set session with on_logfull=notify;
delete from table where web_id<0;
This would automatically commit your transaction at points where force-abort is reached then carry on, rather than rolling back and reporting an error.
A downside of using this setting is that it can be tricky to unpick what has/hasn't been done if any other error should occur (your work will likely be partially committed), but since this appears to be a straight delete from a table it should be quite obvious which rows remain and which don't.
The "set session" statement must be run at the start of a transaction.
I would advise not running concurrent sessions with "on_logfull=notify" (there have been bugs in this area, whether they're fixed in your installation depends on your version/patch level).

sql server, composite keys - ignoring duplicate

Is there a way to prevent sql from throwing an error when I try to save a record that already exists. I've got a composite key table for a many-to-many relationship that has only the two values, when I update a model from my application, it tries to save all records, the records that already exist throw an error Cannot insert duplicate key is there a way of having the database ignore these, or do I have to handle it in the application?
you are calling an INSERT and trying to add duplicated keys. This error is by design, and essential. The DB is throwing an exception for an exceptional and erroneous condition.
If you are, instead, trying to perform an "upsert" you may need to use a stored procedure or use the MERGE syntax.
If, instead, you don't want to UPDATE but to just ignore rows already in the table, then you need to simply add an exception to your INSERT statement... such as
....
WHERE
table.Key <> interting.key
Try something like this with your insert statement.
insert into foo (x,y)
select #x,#y
except
select x,y from foo
This will add a record to foo, ONLY if it is not already in the table.
You could try creating your index with the IGNORE_DUP_KEY option so that you only get a warning when you have duplicate keys rather than a true error.
The other option and possibly the better one is to use the MERGE statement rather than insert. The MERGE statement let's you do Inserts, Updates and Deletes all in one statement and sounds like it should work out well for what you are trying to do.
Last but not least, as you said fix it in your app and only insert the rows that need to be added.

Debugging sub-queries in TSQL Stored Procedure

How do I debug a complex query with multiple nested sub-queries in SQL Server 2005?
I'm debugging a stored procedure and trigger in Visual Studio 2005. I'd like to be able to see what the results of these sub-queries are, as I feel that this is where the bug is coming from. An example query (slightly redacted) is below:
UPDATE
foo
SET
DateUpdated = ( SELECT TOP 1 inserted.DateUpdated FROM inserted )
...
FROM
tblEP ep
JOIN tblED ed ON ep.EnrollmentID = ed.EnrollmentID
WHERE
ProgramPhaseID = ( SELECT ...)
Visual Studio doesn't seem to offer a way for me to Watch the result of the sub query. Also, if I use a temporary table to store the results (temporary tables are used elsewhere also) I can't view the values stored in that table.
Is there anyway that I can add a watch or in some other way view these sub-queries? I would love it if there was some way to "Step Into" the query itself, but I imagine that wouldn't be possible.
Ok first I would be leary of using subqueries in a trigger. Triggers should be as fast as possible, so get rid of any correlated subqueries which might run row by row instead of in a set-based fashion. Rewrite to joins. If you only want to update records based on what was in the inserted table, then join to it. Also join to the table you are updating. Exactly what are you trying to accomplish with this trigger? It might be easier to give advice if we understood the business rule you are trying to implement.
To debug a trigger this is what I do.
I write a script to:
Do the actual insert to the table
without the trigger on on it
Create a temp table named #inserted
(and/or one named #deleted)
Populate the table as I would expect
the inserted table in the trigger to
be populated from the insert you do.
Add the trigger code (minus the
create or alter trigger parts)
substituting #inserted every time I
reference inserted. (if you plan to
run multiple times until you are
ready to use it in a trigger throw
it in an explicit transaction and
rollback after checking your
results.
Add a query to check the table(s)
you are changing with the trigger for
the values you wanted to change.
Now if you need to add debug
statements to see what is happening
between steps, you can do so.
Run making changes until you get the
results you want.
Once you have the query working as
you expect it to, it is easy to take
the # signs off inserted and use it
to create the body of the trigger.
This is what I usually do in this type of scenerio:
Print out the exact sqls getting generated by each subquery
Then run each of then in the Management Studio as suggested above.
You should check if different parts are giving you the right data you expect.

SQL continue executing queries after duplicate key violation

I have a situation where I want to insert a row if it doesn't exist, and to not insert it if it already does. I tried creating sql queries that prevented this from happening (see here), but I was told a solution is to create constraints and catch the exception when they're violated.
I have constraints in place already. My question is - how can I catch the exception and continue executing more queries? If my code looks like this:
cur = transaction.cursor()
#execute some queries that succeed
try:
cur.execute(fooquery, bardata) #this query might fail, but that's OK
except psycopg2.IntegrityError:
pass
cur.execute(fooquery2, bardata2)
Then I get an error on the second execute:
psycopg2.InternalError: current transaction is aborted, commands ignored until end of transaction block
How can I tell the computer that I want it to keep executing queries? I don't want to transaction.commit(), because I might want to roll back the entire transaction (the queries that succeeded before).
I think what you could do is use a SAVEPOINT before trying to execute the statement which could cause the violation. If the violation happens, then you could rollback to the SAVEPOINT, but keep your original transaction.
Here's another thread which may be helpful:
Continuing a transaction after primary key violation error
I gave an up-vote to the SAVEPOINT answer--especially since it links to a question where my answer was accepted. ;)
However, given your statement in the comments section that you expect errors "more often than not," may I suggest another alternative?
This solution actually harkens back to your other question. The difference here is how to load the data very quickly into the right place and format in order to move data around a single SELECT -and- is generic for any table you want to populate (so the same code could be used for multiple different tables). Here's a rough layout of how I would do it in pure PostgreSQL, assuming I had a CSV file in the same format of the table to be inserted into:
CREATE TEMP TABLE input_file (LIKE target_table);
COPY input_file FROM '/path/to/file.csv' WITH CSV;
INSERT INTO target_table
SELECT * FROM input_file
WHERE (<unique key field list>) NOT IN (
SELECT <unique key field list>
FROM target_table
);
Okay, this is a idealized example and I'm also glossing over several things (like reporting back the duplicates, pushing the data into the table via Python in-memory data, COPY from STDIN rather than via a file, etc.), but hopefully the basic idea is there and it's going to avoid much of the overhead if you expect more records to be rejected than accepted.

How can remove lock from table in SQL Server 2005?

I am using the Function in stored procedure , procedure contain transaction and update the table and insert values in the same table , while the function is call in procedure is also fetch data from same table.
i get the procedure is hang with function.
Can have any solution for the same?
If I'm hearing you right, you're talking about an insert BLOCKING ITSELF, not two separate queries blocking each other.
We had a similar problem, an SSIS package was trying to insert a bunch of data into a table, but was trying to make sure those rows didn't already exist. The existing code was something like (vastly simplified):
INSERT INTO bigtable
SELECT customerid, productid, ...
FROM rawtable
WHERE NOT EXISTS (SELECT CustomerID, ProductID From bigtable)
AND ... (other conditions)
This ended up blocking itself because the select on the WHERE NOT EXISTS was preventing the INSERT from occurring.
We considered a few different options, I'll let you decide which approach works for you:
Change the transaction isolation level (see this MSDN article). Our SSIS package was defaulted to SERIALIZABLE, which is the most restrictive. (note, be aware of issues with READ UNCOMMITTED or NOLOCK before you choose this option)
Create a UNIQUE index with IGNORE_DUP_KEY = ON. This means we can insert ALL rows (and remove the "WHERE NOT IN" clause altogether). Duplicates will be rejected, but the batch won't fail completely, and all other valid rows will still insert.
Change your query logic to do something like put all candidate rows into a temp table, then delete all rows that are already in the destination, then insert the rest.
In our case, we already had the data in a temp table, so we simply deleted the rows we didn't want inserted, and did a simple insert on the rest.
This can be difficult to diagnose. Microsoft has provided some information here:
INF: Understanding and resolving SQL Server blocking problems
A brute force way to kill the connection(s) causing the lock is documented here:
http://shujaatsiddiqi.blogspot.com/2009/01/killing-sql-server-process-with-x-lock.html
Some more Microsoft info here: http://support.microsoft.com/kb/323630
How big is the table? Do you have problem if you call the procedure from separate windows? Maybe the problem is related to the amount of data the procedure is working with and lack of indexes.