Replace value by fkey after moving data to related table - sql

A numeric column should be extended to hold multiple values, i.e. reference some different entity. SQL only (Postgres specifically if no standard solution available).
Schema now:
Table X with columns ID, VAL, STUFF
Table Y with columns ID, VAL1, VAL2
What I want to achieve:
Table X with columns ID, YID, STUFF
Table Y won't be altered (neither existing data touched)
Table Y gets inserts for all rows of table X where X.VAL should be inserted as Y.VAL1. Y.ID auto-incremented, Y.VAL2 may remain NULL. Table X should then be updated to hold Y's ID as foreign key X.YID instead of the actual value X.VAL that is now stored in Y.VAL1.
Somehow I think it has to be possible to achieve that with a clean SQL-only solution. What I've found so far:
create some PG/SQL script: just loop over table X, insert the stuff to table Y row by row returning the ID and updating table X
plain SQL: get the number of entries in table Y, INSERT INTO Y with SELECT FROM X ... ORDER BY ID, INSERT INTO X with SELECT FROM Y ... skipping the number of entries that have been there before so the order should remain stable. I really don't like that solution. Sounds dirty to me.
Any suggestions? Or better go with PG/SQL?
TIA

There is a third option: a single SQL statement. Postgres allows DML within a CTE. So create a CTE that performs the insert and returns the generated id. Use the returned id in the main query which updates the original table. This then does what you are looking for in a single SQL statement.
with cte as
( insert into y(val1)
select val
from x
returning y.id, y.val1
)
update x
set val = cte.id
from cte
where x.val = cte.val1;
Then assuming you want to maintain referential integrity:
alter table x
add constraint x2y_fk
foreign key (val)
references y(id) ;
See Demo: Note: The demo copies both val and stuff from table x into table y. This was strictly for demonstration purposes and is not necessary.

Related

Can I keep old keys linked to new keys when making a copy in SQL?

I am trying to copy a record in a table and change a few values with a stored procedure in SQL Server 2005. This is simple, but I also need to copy relationships in other tables with the new primary keys. As this proc is being used to batch copy records, I've found it difficult to store some relationship between old keys and new keys.
Right now, I am grabbing new keys from the batch insert using OUTPUT INTO.
ex:
INSERT INTO table
(column1, column2,...)
OUTPUT INSERTED.PrimaryKey INTO #TableVariable
SELECT column1, column2,...
Is there a way like this to easily get the old keys inserted at the same time I am inserting new keys (to ensure I have paired up the proper corresponding keys)?
I know cursors are an option, but I have never used them and have only heard them referenced in a horror story fashion. I'd much prefer to use OUTPUT INTO, or something like it.
If you need to track both old and new keys in your temp table, you need to cheat and use MERGE:
Data setup:
create table T (
ID int IDENTITY(5,7) not null,
Col1 varchar(10) not null
);
go
insert into T (Col1) values ('abc'),('def');
And the replacement for your INSERT statement:
declare #TV table (
Old_ID int not null,
New_ID int not null
);
merge into T t1
using (select ID,Col1 from T) t2
on 1 = 0
when not matched then insert (Col1) values (t2.Col1)
output t2.ID,inserted.ID into #TV;
And (actually needs to be in the same batch so that you can access the table variable):
select * from T;
select * from #TV;
Produces:
ID Col1
5 abc
12 def
19 abc
26 def
Old_ID New_ID
5 19
12 26
The reason you have to do this is because of an irritating limitation on the OUTPUT clause when used with INSERT - you can only access the inserted table, not any of the tables that might be part of a SELECT.
Related - More explanation of the MERGE abuse
INSERT statements loading data into tables with an IDENTITY column are guaranteed to generate the values in the same order as the ORDER BY clause in the SELECT.
If you want the IDENTITY values to be assigned in a sequential fashion
that follows the ordering in the ORDER BY clause, create a table that
contains a column with the IDENTITY property and then run an INSERT ..
SELECT … ORDER BY query to populate this table.
From: The behavior of the IDENTITY function when used with SELECT INTO or INSERT .. SELECT queries that contain an ORDER BY clause
You can use this fact to match your old with your new identity values. First collect the list of primary keys that you intend to copy into a temporary table. You can also include your modified column values as well if needed:
select
PrimaryKey,
Col1
--Col2... etc
into #NewRecords
from Table
--where whatever...
Then do your INSERT with the OUTPUT clause to capture your new ids into the table variable:
declare #TableVariable table (
New_ID int not null
);
INSERT INTO #table
(Col1 /*,Col2... ect.*/)
OUTPUT INSERTED.PrimaryKey INTO #NewIds
SELECT Col1 /*,Col2... ect.*/
from #NewRecords
order by PrimaryKey
Because of the ORDER BY PrimaryKey statement, you will be guaranteed that your New_ID numbers will be generated in the same order as the PrimaryKey field of the copied records. Now you can match them up by row numbers ordered by the ID values. The following query would give you the parings:
select PrimaryKey, New_ID
from
(select PrimaryKey,
ROW_NUMBER() over (order by PrimaryKey) OldRow
from #NewRecords
) PrimaryKeys
join
(
select New_ID,
ROW_NUMBER() over (order by New_ID) NewRow
from #NewIds
) New_IDs
on OldRow = NewRow

INSERT new row if value does not exist and get id either way

I would like to insert a record into a table and if the record is already present get its id, otherwise run the insert and get the new record's id.
I will be inserting millions of records and have no idea how to do this in an efficient manner. What I am doing now is to run a select to check if the record is already present, and if not, insert it and get the inserted record's id. As the table is growing I imagine that SELECT is going to kill me.
What I am doing now in python with psycopg2 looks like this:
select = ("SELECT id FROM ... WHERE ...", [...])
cur.execute(*select)
if not cur.rowcount:
insert = ("INSERT INTO ... VALUES ... RETURNING id", [...])
cur.execute(*insert)
rid = cur.fetchone()[0]
Is it maybe possible to do something in a stored procedure like this:
BEGIN
EXECUTE sql_insert;
RETURN id;
EXCEPTION WHEN unique_violation THEN
-- return id of already existing record
-- from the exception info ?
END;
Any ideas of how optimize a case like this?
First off, this is obviously not an UPSERT as UPDATE was never mentioned. Similar concurrency issues apply, though.
There will always be a race condition for this kind of task, but you can minimize it to an extremely tiny time slot, while at the same time querying for the ID only once with a data-modifying CTE (introduced with PostgreSQL 9.1):
Given a table tbl:
CREATE TABLE tbl(tbl_id serial PRIMARY KEY, some_col text UNIQUE);
Use this query:
WITH x AS (SELECT 'baz'::text AS some_col) -- enter value(s) once
, y AS (
SELECT x.some_col
, (SELECT t.tbl_id FROM tbl t WHERE t.some_col = x.some_col) AS tbl_id
FROM x
)
, z AS (
INSERT INTO tbl(some_col)
SELECT y.some_col
FROM y
WHERE y.tbl_id IS NULL
RETURNING tbl_id
)
SELECT COALESCE(
(SELECT tbl_id FROM z)
,(SELECT tbl_id FROM y)
);
CTE x is only for convenience: enter values once.
CTE y retrieves tbl_id - if it already exists.
CTE z inserts the new row - if it doesn't.
The final SELECT avoids running another query on the table with the COALESCE construct.
Now, this can still fail if a concurrent transaction commits a new row with some_col = 'foo' exactly between CTE y and z, but that's extremely unlikely. If it happens you get a duplicate key violation and have to retry. Nothing lost. If you don't face concurrent writes, you can just forget about this.
You can put this into a plpgsql function and rerun the query on duplicate key error automatically.
Goes without saying that you need two indexes in this setup (like displayed in my CREATE TABLE statement above):
a UNIQUE or PRIMARY KEY constraint on tbl_id (which is of serial type!)
another UNIQUE or PRIMARY KEY constraint on some_col
Both implement an index automatically.

In PostgreSQL, how to INSERT INTO and use destination table's autoincrement column

Suppose you have two tables in PostgreSQL. Table A has field x, which is of type character varying and has a lot of duplicates. Table B has fields y, z, and w. y is a serial column, z has the same type as x, and w is an integer.
If I issue this query:
INSERT INTO B
SELECT DISTINCT ______, A.x, COUNT(A.x)
FROM A
WHERE x IS NOT NULL
GROUP BY x;
I get an error regardless of what I have in ______. I've even gotten as exotic as CAST(NULL as INTEGER), but that just gives me this error:
a null value in column "id" violates not-null constraint
Is there a simple solution?
You are allowed and even encouraged to specify your columns when using INSERT (and you really should always specify the columns):
insert into b (z, w)
select x, count(x)
from a
where x is not null
group by x
And I don't see the point of distinct when you're already grouping by x so I dropped that; I also dropped the column prefixes since they aren't needed and just add noise to the SQL.
If you don't specify a column when using INSERT, you get the default value and that will give you the sequence value that you're looking for.

Create a unique primary key (hash) from database columns

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S
The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.
Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.
Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.
If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?

Row number in Sybase tables

Sybase db tables do not have a concept of self updating row numbers. However , for one of the modules , I require the presence of rownumber corresponding to each row in the database such that max(Column) would always tell me the number of rows in the table.
I thought I'll introduce an int column and keep updating this column to keep track of the row number. However I'm having problems in updating this column in case of deletes. What sql should I use in delete trigger to update this column?
You can easily assign a unique number to each row by using an identity column. The identity can be a numeric or an integer (in ASE12+).
This will almost do what you require. There are certain circumstances in which you will get a gap in the identity sequence. (These are called "identity gaps", the best discussion on them is here). Also deletes will cause gaps in the sequence as you've identified.
Why do you need to use max(col) to get the number of rows in the table, when you could just use count(*)? If you're trying to get the last row from the table, then you can do
select * from table where column = (select max(column) from table).
Regarding the delete trigger to update a manually managed column, I think this would be a potential source of deadlocks, and many performance issues. Imagine you have 1 million rows in your table, and you delete row 1, that's 999999 rows you now have to update to subtract 1 from the id.
Delete trigger
CREATE TRIGGER tigger ON myTable FOR DELETE
AS
update myTable
set id = id - (select count(*) from deleted d where d.id < t.id)
from myTable t
To avoid locking problems
You could add an extra table (which joins to your primary table) like this:
CREATE TABLE rowCounter
(id int, -- foreign key to main table
rownum int)
... and use the rownum field from this table.
If you put the delete trigger on this table then you would hugely reduce the potential for locking problems.
Approximate solution?
Does the table need to keep its rownumbers up to date all the time?
If not, you could have a job which runs every minute or so, which checks for gaps in the rownum, and does an update.
Question: do the rownumbers have to reflect the order in which rows were inserted?
If not, you could do far fewer updates, but only updating the most recent rows, "moving" them into gaps.
Leave a comment if you would like me to post any SQL for these ideas.
I'm not sure why you would want to do this. You could experiment with using temporary tables and "select into" with an Identity column like below.
create table test
(
col1 int,
col2 varchar(3)
)
insert into test values (100, "abc")
insert into test values (111, "def")
insert into test values (222, "ghi")
insert into test values (300, "jkl")
insert into test values (400, "mno")
select rank = identity(10), col1 into #t1 from Test
select * from #t1
delete from test where col2="ghi"
select rank = identity(10), col1 into #t2 from Test
select * from #t2
drop table test
drop table #t1
drop table #t2
This would give you a dynamic id (of sorts)