When using LOAD DATA INFILE, is there a way to either flag a duplicate row, or dump any/all duplicates into a separate table?
From the LOAD DATE INFILE documentation:
The REPLACE and IGNORE keywords control handling of input rows that duplicate existing rows on unique key values:
If you specify REPLACE, input rows replace existing rows. In other words, rows that have the same value for a primary key or unique index as an existing row. See Section 12.2.7, “REPLACE Syntax”.
If you specify IGNORE, input rows that duplicate an existing row on a unique key value are skipped. If you do not specify either option, the behavior depends on whether the LOCAL keyword is specified. Without LOCAL, an error occurs when a duplicate key value is found, and the rest of the text file is ignored. With LOCAL, the default behavior is the same as if IGNORE is specified; this is because the server has no way to stop transmission of the file in the middle of the operation.
Effectively, there's no way to redirect the duplicate records to a different table. You'd have to load them all in, and then create another table to hold the non-duplicated records.
It looks as if there actually is something you can do when it comes to duplicate rows for LOAD DATA calls. However, the approach that I've found isn't perfect: it acts more as a log for all deletes on a table, instead of just for LOAD DATA calls. Here's my approach:
Table test:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
text VARCHAR(255) DEFAULT NULL
);
Table test_log:
CREATE TABLE test_log (
id INTEGER, -- not primary key, we want to accept duplicate rows
text VARCHAR(255) DEFAULT NULL,
time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Trigger del_chk:
delimiter //
drop trigger if exists del_chk;
CREATE TRIGGER del_chk AFTER DELETE ON test
FOR EACH ROW
BEGIN
INSERT INTO test_log(id,text) values(OLD.id,OLD.text);
END;//
delimiter ;
Test import (/home/user/test.csv):
1,asdf
2,jkl
3,qwer
1,tyui
1,zxcv
2,bnm
Query:
LOAD DATA INFILE '/home/ken/test.csv'
REPLACE INTO TABLE test
FIELDS
TERMINATED BY ','
LINES
TERMINATED BY '\n' (id,text);
Running the above query will result in 1,asdf, 1,tyui, and 2,jkl being added to the log table. Based on a timestamp, it could be possible to associate the rows with a particular LOAD DATA statement.
Related
I searched but only could found partial answer to this question
The goal would be here to create a new ID column on an existing table.
This new column would be the primary key for the table and I simply want it to be filled with integer values from 1 to number of rows.
What would be the query for that?
I know I have to first alter table to create the new column :
ALTER TABLE <MYTABLE> ADD (ID INTEGER);
Then I could use the series generator :
INSERT INTO <MYTABLE.ID> SELECT SERIES_GENERATE_INTEGER(1,1,(number of rows));
Once the column is filled I could use this line:
ALTER TABLE <MYTABLE> ADD PRIMARY KEY ("ID");
I am sure there is an easier way to do this
You wrote that you want to add a "new ID column to an existing table" and fill it with unique values.
That's not a "standard" operation in any DBMS, as the usual assumption is that records are created with a primary key and not retro fitted.
Thus, "ease" of operation for this is relative to what else you want to do.
For example, if you want to continue using this ID as a primary key for further operations, then using a once-off generator function like the SERIES_GENERATE_INTEGER or a query won't be very helpful since you have to avoid duplicates of already existing values.
Two, relatively easy, options come to mind:
Using a sequence:
create sequence myid;
update <table> set ID = myid.nextval;
And for succeeding inserts:
insert into <table> (id, ..., ...) VALUES (myid.nextval, ..., ...) ;
Note that this generates a value for every existing record and not a predefined set of size X.
Using a GUID
By using a GUID you generate a unique value every time you call the 'SYSUUID' function in SAP HANA. check docu here
Something like
update <table> set ID = SYSUUID;
should do the trick here.
Subsequent inserts would simply call the function for values of ID.
I have a 30 GB tab separated text file which has more than 100 million rows, when I want to import this text file to a PostgreSQL table using \copy command, some rows cause error. how can I ignore those rows and also take a record of the ignored rows while importing to postgresql?
I connect to my machine by SSH so I can not use pgadmin!
it's very hard to edit the text file before importing because so many different rows have different problems. if there exists a way to check the rows one by one before importing and then run the \copy command for individual rows, it would be helpful.
Below is the code which generates the table:
CREATE TABLE Papers(
Paper_ID CHARACTER(8) PRIMARY KEY,
Original_paper_title TEXT,
Normalized_paper_title TEXT,
Paper_publish_year INTEGER,
Paper_publish_date DATE,
Paper_Document_Object_Identifier TEXT,
Original_venue_name TEXT,
Normalized_venue_name TEXT,
Journal_ID_mapped_to_venue_name CHARACTER(8),
Conference_ID_mapped_to_venue_name CHARACTER(8),
Paper_rank BIGINT,
FOREIGN KEY(Journal_ID_mapped_to_venue_name) REFERENCES Journals(Journal_ID),
FOREIGN KEY(Conference_ID_mapped_to_venue_name) REFERENCES Conferences(Conference_ID));
Don't load directly to your destination table but to a single column staging table.
create table Papers_stg (rec text);
Once you have all the data loaded you can the do verifications on the data using SQL.
Find records with wrong number of fields:
select rec
from Papers_stg
where cardinality(string_to_array(rec,' ')) <> 11
Create a table with all text fields
create table Papers_fields_text
as
select fields[1] as Paper_ID
,fields[2] as Original_paper_title
,fields[3] as Normalized_paper_title
,fields[4] as Paper_publish_year
,fields[5] as Paper_publish_date
,fields[6] as Paper_Document_Object_Identifier
,fields[7] as Original_venue_name
,fields[8] as Normalized_venue_name
,fields[9] as Journal_ID_mapped_to_venue_name
,fields[10] as Conference_ID_mapped_to_venue_name
,fields[11] as Paper_rank
from (select string_to_array(rec,' ') as fields
from Papers_stg
) t
where cardinality(fields) = 11
For fields conversion checks you might want to use the concept described here
Your only option is to use row-by-row processing. Write shell script (for example) that will loop thru input file and send each row to "copy" then check execution result, then write failed rows to some "err_input.txt".
More complicated logic can increase processing speed. Using "portions" instead of row-by-row and use row-by-row logic on failed segments.
Consider using pgloader
Check BATCHES AND RETRY BEHAVIOUR
You could use an BEFORE INSERT - trigger and check your criteria. If the record fails the check, write a log (or an entry into a separate table) and return null. You could even correct some values, if possible and feasible.
Of course, if checking criteria requires other queries (like finding duplicate keys etc.), you might get a performance issue. But I'm not sure which kind of "different problems in different rows" you mean...
Confer also an answer on StackExchange Database Administrators, and the following example taken from Bartosz Dmytrak at PostgreSQL forum:
CREATE OR REPLACE FUNCTION "myschema"."checkTriggerFunction" ()
RETURNS TRIGGER
AS
$BODY$
BEGIN
IF EXISTS (SELECT 1 FROM "myschema".mytable WHERE "MyKey" = NEW."MyKey")
THEN
RETURN NULL;
ELSE
RETURN NEW;
END IF;
END;
$BODY$
LANGUAGE plpgsql;
and trigger:
CREATE TRIGGER "checkTrigger"
BEFORE INSERT
ON "myschema".mytable
FOR EACH ROW
EXECUTE PROCEDURE "myschema"."checkTriggerFunction"();
Application has different versions. Each version has it's own set of values in each table. I need to provide functionality to copy data from one version to another. Problem :
By inserting data I am trying to insert Ids which has already been in use in this table. So, I need to change ids of components which I want to insert but I must save relationship between those components. How cat I do that?
Create a master table which has a surrogate key as your primary key. A numeric value of type NUMBER(9) works well. You can create a sequence and trigger to automatically insert this.
The rest of the table is the column of your current table plus a column to indicate which version the row is for.
For simplicity you may wish to create views on top of the table along the lines of
select * from master_table where version_id = ####;
To copy the data from one version to another this will work:
Insert into master_table seq_master_table.nextval, new version_id,.....
from master_table
where version_id = ####;
In pgsql, is there a way to have a table of several values, and choose one of them (say, other_id), find out what its highest value is and make every new entry that is put in the table increment from that value.
I suppose this was just too easy to have had a chance of working..
ALTER TABLE address ALTER COLUMN new_id TYPE SERIAL
____________________________________
ERROR: type "serial" does not exist
Thanks much for any insight!
Look into postgresql documentation of datatype serial. Serial is only short hand.
CREATE TABLE tablename (
colname SERIAL
);
is equivalent to specifying:
CREATE SEQUENCE tablename_colname_seq;
CREATE TABLE tablename (
colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')
);
ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;
This happened because you may use the serial data type only when you are creating a new table or adding a new column to a table. If you'll try to ALTER an existing table using this data type you'll get an error. Because serial is not a true data type, but merely an abbreviation or alias for a longer query.
In case you would like to achieve the same effect, as you are expecting from using serial data type when you are altering existing table you may do this:
CREATE SEQUENCE my_serial AS integer START 1 OWNED BY address.new_id;
ALTER TABLE address ALTER COLUMN new_id SET DEFAULT nextval('my_serial');
The first line of the query creates your own sequence called my_serial. The
OWNED BY statement connects the newly created sequence with the exact column of your table. In your case the table is address and the column is new_id.
The START statement defines what value this sequence should start from.
The second line alters your table with the new default value, which will be determined by the previously created sequence.
It will give you the same result as you were expecting from using serial.
A quick glance at the docs tells you that
The data types smallserial, serial and bigserial are not true types
but merely a notational convenience for creating unique identifier columns
If you want to make an existing (integer) column to work as a "serial", just create the sequence by hand (the name is arbitrary), set its current value to the maximum (or bigger) of your current address.new_id value, at set it as default value for your address.new_id column.
To set the value of your sequence see here.
SELECT setval('address_new_id_seq', 10000);
This is just an example, use your own sequence name (arbitrary, you create it), and a number greater than the maximum current value of your column.
Update: as pointed out by Lucas' answer (which should be the acccepted one) you should also specify to which column the sequence "belongs to" by using CREATE/ALTER SEQUENCE ... OWNED BY ...
I'm inserting bulk records using COPY statement in PostgreSQL. What I realize is, the sequence IDs are not getting updated and when I try to insert a record later, it throws duplicate sequence ID. Should I manually update the sequence number to get the number of records after performing COPY? Isn't there a solution while performing COPY, just increment the sequence variable, that is, the primary key field of the table? Please clarify me on this. Thanks in advance!
For instance, if I insert 200 records, COPY does good and my table shows all the records. When I manually insert a record later, it says duplicate sequence ID error. It very well implies that it didn’t increment the sequence ids during COPYing as work fine during normal INSERTing. Instead of instructing the sequence id to set the max number of records, won’t there be any mechanism to educate the COPY command to increment the sequence IDs during its bulk COPYing option?
You ask:
Should I manually update the sequence number to get the number of records after performing COPY?
Yes, you should, as documented here:
Update the sequence value after a COPY FROM:
| BEGIN;
| COPY distributors FROM 'input_file';
| SELECT setval('serial', max(id)) FROM distributors;
| END;
You write:
it didn’t increment the sequence ids during COPYing as work fine during normal INSERTing
But that is not so! :) When you perform a normal INSERT, typically you do not specify an explicit value for the SEQUENCE-backed primary key. If you did, you would run in to the same problems as you are having now:
postgres=> create table uh_oh (id serial not null primary key, data char(1));
NOTICE: CREATE TABLE will create implicit sequence "uh_oh_id_seq" for serial column "uh_oh.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "uh_oh_pkey" for table "uh_oh"
CREATE TABLE
postgres=> insert into uh_oh (id, data) values (1, 'x');
INSERT 0 1
postgres=> insert into uh_oh (data) values ('a');
ERROR: duplicate key value violates unique constraint "uh_oh_pkey"
DETAIL: Key (id)=(1) already exists.
Your COPY command, of course, is supplying an explicit id value, just like the example INSERT above.
I realize that this is a bit old but maybe someone might still be looking for the answer.
As other said COPY works in a similar way as INSERT, so for inserting into a table that has a sequence, you simply don't mention the sequence field at all and it is taken care of for you. For COPY it works in the same exact way. But doesn't it COPY require ALL fields in the table to be present in the text file? The correct answer is NO, it doesn't, but it is the default behavior.
To COPY and leave the sequence out do the following:
COPY $YOURSCHEMA.$YOURTABLE(col1,col2,col3,col4) FROM '$your_input_file' DELIMITER ',' CSV HEADER;
No need to manually update the schema afterwards, it works as intended and in my testing is just about as fast.
You could copy to a sister table, then insert into mytable select * from sister - that would increment the sequence.
If your loaded data has the id field, don't select it for the insert: insert into mytable (col1, col2, col3) select col1, col2, col3 from sister