postgresql: \copy method enter valid entries and discard exceptions - sql

When entering the following command:
\copy mmcompany from '<path>/mmcompany.txt' delimiter ',' csv;
I get the following error:
ERROR: duplicate key value violates unique constraint "mmcompany_phonenumber_key"
I understand why it's happening, but how do I execute the command in a way that valid entries will be inserted and ones that create an error will be discarded?

The reason PostgreSQL doesn't do this is related to how it implements constraints and validation. When a constraint fails it causes a transaction abort. The transaction is in an unclean state and cannot be resumed.
It is possible to create a new subtransaction for each row but this is very slow and defeats the purpose of using COPY in the first place, so it isn't supported by PostgreSQL in COPY at this time. You can do it yourself in PL/PgSQL with a BEGIN ... EXCEPTION block inside a LOOP over a select from the data copied into a temporary table. This works fairly well but can be slow.
It's better, if possible, to use SQL to check the constraints before doing any insert that violates them. That way you can just:
CREATE TEMPORARY TABLE stagingtable(...);
\copy stagingtable FROM 'somefile.csv'
INSERT INTO realtable
SELECT * FROM stagingtable
WHERE check_constraints_here;
Do keep concurrency issues in mind though. If you're trying to do a merge/upsert via COPY you must LOCK TABLE realtable; at the start of your transaction or you will still have the potential for errors. It looks like that's what you're trying to do - a copy if not exists. If so, skipping errors is absolutely the wrong approach. See:
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
Insert, on duplicate update in PostgreSQL?
Postgresql - Clean way to insert records if they don't exist, update if they do
Can COPY be used with a function?
Postgresql csv importation that skips rows
... this is a much-discussed issue.

One way to handle the constraint violations is to define triggers on the target table to handle the errors. This is not ideal as there can still be race conditions (if concurrently loading), and triggers have pretty high overhead.
Another method: COPY into a staging table and load the data into the target table using SQL with some handling to skip existing entries.
Additionally, another useful method is to use pgloader

Related

Get all constraint errors when inserting data from another table

I have a staging table without any constraints in my Azure SQL database (Azure SQL database 12.0.2000.8). I want to insert the data from the Staging table into the "real" table on which multiple constraints are set. When inserting the data, I use a statement of the kind
INSERT INTO <someTable> SELECT <columns> FROM StagingTable;
Now I only get the first error when violating some constraints. However, for my use case, it is important to get all violations, so they can be resolved altogether.
I have tried using TRY...CATCH mechanisms, however, this will throw an error on the first error and run the catch clause, but it will not continue with the other data. Note that the correct data that has no violations should not be inserted, so the whole insert statement can be rolled back on one error, however, I want to see all violations to be able to correct them all without having to run the insert statement multiple times to get all errors.
EDIT:
The types of constraints that need to be checked are foreign key constraints, NOT NULL constraints, duplicate keys. No casting is done, so no need to check for conversions.
There are couple of options:
If you want to catch row level information, you have to go for cursors or while loop and try to insert each row in TRY CATCH block and see if you are getting any error, and log the same.
Create another table similar to main table(say, MainCheckTable) with all constraints and disable all the constraints and load the data.
Now, you can leverage DBCC CHECKCONSTRAINTS to see all the constraint violations.Read more on this .
USE DBName;
DBCC CHECKCONSTRAINTS(MainCheckTable) WITH ALL_CONSTRAINTS;
First, don't look at your primary table(s). Look at the related tables e.g. lookups etc. Populate these first. Once you have populated the related tables (i.e.) satisfy all related constraints, then add the data.
You need to work backwards from the least constrained tables to the most constrained if that makes sense.
You should check that your related tables have the required reference values/fields that you intend to insert. This is easy to do, since you already have a staging table.

The "proper" way to atomically replace all contents in a PostgreSQL table?

In the project I have been recently working on, many (PostgreSQL) database tables are just used as big lookup arrays. We have several background worker services, which periodically pull the latest data from a server, then replace all contents of a table with the latest data. The replacing has to be atomic because we don't want a partially completed table to be seen by lookup-ers.
I thought the simplest way to do the replacing is something like this:
BEGIN;
DELETE FROM some_table;
COPY some_table FROM 'source file';
COMMIT;
But I found a lot of production code use this method instead:
BEGIN;
CREATE TABLE some_table_tmp (LIKE some_table);
COPY some_table_tmp FROM 'source file';
DROP TABLE some_table;
ALTER TABLE some_table_tmp RENAME TO some_table;
COMMIT;
(I omit some logic such as change the owner of a sequence, etc.)
I just can't see any advantage of this method. Especially after some discoveries and experiments. SQL statements like ALTER TABLE and DROP TABLE acquire an ACCESS EXCLUSIVE lock, which even blocks a SELECT.
Can anyone explain what problem the latter SQL pattern is trying to solve? Or it's wrong and we should avoid using it?

sql server, composite keys - ignoring duplicate

Is there a way to prevent sql from throwing an error when I try to save a record that already exists. I've got a composite key table for a many-to-many relationship that has only the two values, when I update a model from my application, it tries to save all records, the records that already exist throw an error Cannot insert duplicate key is there a way of having the database ignore these, or do I have to handle it in the application?
you are calling an INSERT and trying to add duplicated keys. This error is by design, and essential. The DB is throwing an exception for an exceptional and erroneous condition.
If you are, instead, trying to perform an "upsert" you may need to use a stored procedure or use the MERGE syntax.
If, instead, you don't want to UPDATE but to just ignore rows already in the table, then you need to simply add an exception to your INSERT statement... such as
....
WHERE
table.Key <> interting.key
Try something like this with your insert statement.
insert into foo (x,y)
select #x,#y
except
select x,y from foo
This will add a record to foo, ONLY if it is not already in the table.
You could try creating your index with the IGNORE_DUP_KEY option so that you only get a warning when you have duplicate keys rather than a true error.
The other option and possibly the better one is to use the MERGE statement rather than insert. The MERGE statement let's you do Inserts, Updates and Deletes all in one statement and sounds like it should work out well for what you are trying to do.
Last but not least, as you said fix it in your app and only insert the rows that need to be added.

sql error - unique constraint

I have one data migration script like this.
Data_migration.sql
It's contents are
insert into table1 select * from old_schema.table1;
commit;
insert into table2 select * from old_schema.table2;
commit;
And table1 has the pk_productname constraint when I execute the script
SQL> # "data_migration.sql"
I will get an unique constraint(pk_productname) violation. But when I execute the individual sql statements I won't get any error. Any reason behind this. And how to resolve this.
The failure of the unique constraint means you are attempting to insert one of more records whose primary key columns collide.
If it happens when you run a script but not when you run the individual statements then there must be a bug in your script. Without seeing the script it is impossible for us to be sure what that bug is, but the most likely thing is you are somehow running the same statement twice.
Another possible cause is that the constraint is deferred. This means it is not enforced until the end of the transaction. So the INSERT statement would appear to succeed if you run it without issuing the subsequent COMMIT.
It is common to run data migration without enabled constraints. Re-enable them afterwards using an EXCEPTIONS table. This makes it easier to investigate problems. Find out more.

SQL continue executing queries after duplicate key violation

I have a situation where I want to insert a row if it doesn't exist, and to not insert it if it already does. I tried creating sql queries that prevented this from happening (see here), but I was told a solution is to create constraints and catch the exception when they're violated.
I have constraints in place already. My question is - how can I catch the exception and continue executing more queries? If my code looks like this:
cur = transaction.cursor()
#execute some queries that succeed
try:
cur.execute(fooquery, bardata) #this query might fail, but that's OK
except psycopg2.IntegrityError:
pass
cur.execute(fooquery2, bardata2)
Then I get an error on the second execute:
psycopg2.InternalError: current transaction is aborted, commands ignored until end of transaction block
How can I tell the computer that I want it to keep executing queries? I don't want to transaction.commit(), because I might want to roll back the entire transaction (the queries that succeeded before).
I think what you could do is use a SAVEPOINT before trying to execute the statement which could cause the violation. If the violation happens, then you could rollback to the SAVEPOINT, but keep your original transaction.
Here's another thread which may be helpful:
Continuing a transaction after primary key violation error
I gave an up-vote to the SAVEPOINT answer--especially since it links to a question where my answer was accepted. ;)
However, given your statement in the comments section that you expect errors "more often than not," may I suggest another alternative?
This solution actually harkens back to your other question. The difference here is how to load the data very quickly into the right place and format in order to move data around a single SELECT -and- is generic for any table you want to populate (so the same code could be used for multiple different tables). Here's a rough layout of how I would do it in pure PostgreSQL, assuming I had a CSV file in the same format of the table to be inserted into:
CREATE TEMP TABLE input_file (LIKE target_table);
COPY input_file FROM '/path/to/file.csv' WITH CSV;
INSERT INTO target_table
SELECT * FROM input_file
WHERE (<unique key field list>) NOT IN (
SELECT <unique key field list>
FROM target_table
);
Okay, this is a idealized example and I'm also glossing over several things (like reporting back the duplicates, pushing the data into the table via Python in-memory data, COPY from STDIN rather than via a file, etc.), but hopefully the basic idea is there and it's going to avoid much of the overhead if you expect more records to be rejected than accepted.