best temp table strategy for update/insert operation - sql

I'm given a list of transaction records from a remote server, some of which already exist in our database and some of which are new. My task is to update the ones that already exist and insert the ones that don't. Assume the transactions have remote IDs that aren't dependent on my local database. The size of the list can be anywhere from 1 to ~500.
Database is postgresql.
My initial thought was something like this:
BEGIN
CREATE TEMP TABLE temp_transactions (LIKE transactions) ON COMMIT DROP;
INSERT INTO temp_transactions(...) VALUES (...);
WITH updated_transactions AS (...update statement...)
DELETE FROM temp_transactions USING updated_transactions
WHERE temp_transactions.external_id = updated_transactions.external_id;
INSERT INTO transactions SELECT ... FROM temp_transactions;
COMMIT;
In other words:
Create a temp table that exists only for the life of the transaction.
Dump all my records into the temp table.
Do all the updates in a single statement that also deletes the updated records from the temp table.
Insert anything remaining in the temp table in to the permanent table because it wasn't an update.
But then I began to wonder whether it might be more efficient to use a per-session temp table and not wrap all the operations in a single transaction. My database sessions are only ever going to be used by a single thread, so this should be possible:
CREATE TEMP TABLE temp_transactions IF NOT EXISTS (LIKE transactions);
INSERT INTO temp_transactions(...) VALUES (...);
WITH updated_transactions AS (...update statement...)
DELETE FROM temp_transactions USING updated_transactions
WHERE temp_transactions.external_id = updated_transactions.external_id;
INSERT INTO transactions SELECT ... FROM temp_transactions;
TRUNCATE temp_transactions;
My thinking:
This avoids having to create the temp table each time a new batch of records is received. Instead, if a batch has already been processed using this database session (which is likely) the table will already exist.
This saves rollback space since I'm not stringing together multiple operations within a single transaction. It isn't a requirement that the entire update/insert operation be atomic; the only reason I was using a transaction is so the temp table would be automatically dropped upon commit.
Is the latter method likely to be superior to the former? Does either method have any special "gotchas" I should be aware of?

What you're describing is commonly known as upsert. Even the official documentation mentions it, here: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE
The biggest problem with upserts are concurrency problems, as described here: http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/ and here: http://johtopg.blogspot.com.br/2014/04/upsertisms-in-postgres.html
I think your approach is good, although I wouldn't use a temporary table at all, and put the VALUES part into the UPDATE part, to make the whole thing a single statement.
Like this:
CREATE TABLE test (id int, data int);
CREATE TABLE
WITH new_data (id, data) AS (
VALUES (1, 2), (2, 6), (3, 10)
),
updated AS (
UPDATE test t
SET data = v.data
FROM new_data v
WHERE v.id = t.id
RETURNING t.id
)
INSERT INTO test
SELECT *
FROM new_data v
WHERE NOT EXISTS (
SELECT 1
FROM updated u
WHERE u.id = v.id
);
INSERT 0 3
SELECT * FROM test;
id | data
----+------
1 | 2
2 | 6
3 | 10
(3 rows)
WITH new_data (id, data) AS (
VALUES (1, 20), (2, 60), (4, 111)
),
updated AS (
UPDATE test t
SET data = v.data
FROM new_data v
WHERE v.id = t.id
RETURNING t.id
)
INSERT INTO test
SELECT *
FROM new_data v
WHERE NOT EXISTS (
SELECT 1
FROM updated u
WHERE u.id = v.id
);
INSERT 0 1
SELECT * FROM test;
id | data
----+------
3 | 10
1 | 20
2 | 60
4 | 111
(4 rows)
PG 9.5+ will support concurrent upserts out of the box, with the INSERT ... ON CONFLICT DO NOTHING/UPDATE syntax.

Related

Insert new row of data in SQL table if the 2 column values do not exist

I have a PostgreSQL table interactions with columns
AAId, IDId, S, BasicInfo, DetailedInfo, BN
AAID and IDId are FK to values referencing other tables.
There are around 1540 rows in the ID table and around 12 in the AA table.
Currently in the interactions table there are only around 40 rows for the AAId value = 12 I want to insert a row for all the missing IDId values.
I have searched, but cant find an answer to inserting rows like this. I am not overly confident with SQL, I can do basics but this is a little beyond me.
To clarify, I want to perform a kind of loop where,
for each IDId from 1-1540,
if (the row with AAId = 12 and IDId(current IDId in the loop does not exist)
insert a new row with,
AAId = 12,
IDId = current IDId in the loop,
S = 1,
BasicInfo = Unlikely
DetailedInfo = Unlikely
Is there a way to do this in SQL?
Yes, this is possible. You can use data from different tables when inserting data to a table in Postgres. In your particular example, the following insert should work, as long as you have the correct primary/unique key in interactions, which is a combination of AAId and IDId:
INSERT INTO interactions (AAId, IDId, S, BasicInfo, DetailedInfo, BN)
SELECT 12, ID.ID, 1, 'Unlikely', 'Unlikely'
FROM ID
ON CONFLICT DO NOTHING;
ON CONFLICT DO NOTHING guarantees that the query will not fail when it tries to insert rows that already exist, based on the combination of AAId and IDId.
If you don't have the correct primary/unique key in interactions, you have to filter what IDs to insert manually:
INSERT INTO interactions (AAId, IDId, S, BasicInfo, DetailedInfo, BN)
SELECT 12, ID.ID, 1, 'Unlikely', 'Unlikely'
FROM ID
WHERE NOT EXISTS (
SELECT * FROM interactions AS i
WHERE i.AAId = 12 AND i.IDId = ID.ID
);

Keep track of item's versions after new insert

I'm currently working on creating a Log Table that will have all the data from another table and will also have recorded, as Versions, changes in the prices of items in the main table.
I would like to know how it is possible to save the versions, that is, increment the value +1 at each insertion of the same item in the Log table.
The Log table is loaded via a Merge of data coming from the User API, on a python script using PYODBC:
MERGE LogTable as t
USING (Values(?,?,?,?,?)) AS s(ID, ItemPrice, ItemName)
ON t.ID = s.ID AND t.ItemPrice= s.ItemPrice
WHEN NOT MATCHED BY TARGET
THEN INSERT (ID, ItemPrice, ItemName, Date)
VALUES (s.ID, s.ItemPrice, s.ItemName, GETDATE())
Table example:
Id
ItemPrice
ItemName
Version
Date
1
50
Foo
1
Today
2
30
bar
1
Today
And after inserting the Item with ID = 1 again with a different price, the table should look like this:
Id
ItemPrice
ItemName
Version
Date
1
50
Foo
1
Today
2
30
bar
1
Today
1
45
Foo
2
Today
Saw some similar questions mentioning using triggers but in these other cases it was not a Merge used to insert the data into the Log table.
May the following helps you, modify your insert statement as this:
Insert Into tbl_name
Values (1, 45, 'Foo',
COALESCE((Select MAX(D.Version) From tbl_name D Where D.Id = 1), 0) + 1, GETDATE())
See a demo from db<>fiddle.
Update, according to the proposed enhancements by #GarethD:
First: Using ISNULL instead of COALESCE will be more performant.
Where performance can play an important role is when the result is not a constant, but rather a query of some sort. -1-
Second: prevent race condition that may occur when multiple threads trying to read the MAX value. So the query will be as the following:
Insert Into tbl_name WITH (HOLDLOCK)
Values (1, 45, 'Foo',
ISNULL((Select MAX(D.Version) From tbl_name D Where D.Id = 1), 0) + 1, GETDATE())

Will order by preserve?

create table source_table (id number);
insert into source_table values(3);
insert into source_table values(1);
insert into source_table values(2);
create table target_table (id number, seq_val number);
create sequence example_sequence;
insert into target_table
select id, example_sequence.nextval
from
> (select id from source_table ***order by id***);
Is it officially assured that for the id's with the lower values in source_table corresponding sequence's value will also be lower when inserting into the source_table? In other words, is it guaranteed that the sorting provided by order by clause will be preserved when inserting?
EDIT
The question is not: 'Are rows ordered in a table as such?' but rather 'Can we rely on the order by clause used in the subquery when inserting?'.
To even more closely illustrate this, the contents of the target table in the above example, after running the query like select * from target_table order by id would be:
ID | SEQ_VAL
1 1
2 2
3 3
Moreover, if i specified descending ordering when inserting like this:
insert into target_table
select id, example_sequence.nextval
from
> (select id from source_table ***order by id DESC***);
The output of the same query from above would be:
ID | SEQ_VAL
1 3
2 2
3 1
Of that I'm sure, I have tested it multiple times. My question is 'Can I always rely on this ordering?'
Tables in a relational database are not ordered, and any apparent ordering in the result set of a cursor which lacks an ORDER BY is an artifact of data storage, is not guaranteed, and later actions on the table may cause this apparent ordering to change. If you want the results of a cursor to be ordered in a particular manner you MUST use an ORDER BY.

SQL Multiple Row Insert w/ multiple selects from different tables

I am trying to do a multiple insert based on values that I am pulling from a another table. Basically I need to give all existing users access to a service that previously had access to a different one. Table1 will take the data and run a job to do this.
INSERT INTO Table1 (id, serv_id, clnt_alias_id, serv_cat_rqst_stat)
SELECT
(SELECT Max(id) + 1
FROM Table1 ),
'33', --The new service id
clnt_alias_id,
'PI' --The code to let the job know to grant access
FROM TABLE2,
WHERE serv_id = '11' --The old service id
I am getting a Primary key constraint error on id.
Please help.
Thanks,
Colin
This query is impossible. The max(id) sub-select will evaluate only ONCE and return the same value for all rows in the parent query:
MariaDB [test]> create table foo (x int);
MariaDB [test]> insert into foo values (1), (2), (3);
MariaDB [test]> select *, (select max(x)+1 from foo) from foo;
+------+----------------------------+
| x | (select max(x)+1 from foo) |
+------+----------------------------+
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
+------+----------------------------+
3 rows in set (0.04 sec)
You will have to run your query multiple times, once for each record you're trying to copy. That way the max(id) will get the ID from the previous query.
Is there a requirement that Table1.id be incremental ints? If not, just add the clnt_alias_id to Max(id). This is a nasty workaround though, and you should really try to get that column's type changed to auto_increment, like Marc B suggested.

Is it possible to use a PG sequence on a per record label?

Does PostgreSQL 9.2+ provide any functionality to make it possible to generate a sequence that is namespaced to a particular value? For example:
.. | user_id | seq_id | body | ...
----------------------------------
- | 4 | 1 | "abc...."
- | 4 | 2 | "def...."
- | 5 | 1 | "ghi...."
- | 5 | 2 | "xyz...."
- | 5 | 3 | "123...."
This would be useful to generate custom urls for the user:
domain.me/username_4/posts/1
domain.me/username_4/posts/2
domain.me/username_5/posts/1
domain.me/username_5/posts/2
domain.me/username_5/posts/3
I did not find anything in the PG docs (regarding sequence and sequence functions) to do this. Are sub-queries in the INSERT statement or with custom PG functions the only other options?
You can use a subquery in the INSERT statement like #Clodoaldo demonstrates. However, this defeats the nature of a sequence as being safe to use in concurrent transactions, it will result in race conditions and eventually duplicate key violations.
You should rather rethink your approach. Just one plain sequence for your table and combine it with user_id to get the sort order you want.
You can always generate the custom URLs with the desired numbers using row_number() with a simple query like:
SELECT format('domain.me/username_%s/posts/%s'
, user_id
, row_number() OVER (PARTITION BY user_id ORDER BY seq_id)
)
FROM tbl;
db<>fiddle here
Old sqlfiddle
Maybe this answer is a little off-piste, but I would consider partitioning the data and giving each user their own partitioned table for posts.
There's a bit of overhead to the setup as you will need triggers for managing the DDL statements for the partitions, but would effectively result in each user having their own table of posts, along with their own sequence with the benefit of being able to treat all posts as one big table also.
General gist of the concept...
psql# CREATE TABLE posts (user_id integer, seq_id integer);
CREATE TABLE
psql# CREATE TABLE posts_001 (seq_id serial) INHERITS (posts);
CREATE TABLE
psql# CREATE TABLE posts_002 (seq_id serial) INHERITS (posts);
CREATE TABLE
psql# INSERT INTO posts_001 VALUES (1);
INSERT 0 1
psql# INSERT INTO posts_001 VALUES (1);
INSERT 0 1
psql# INSERT INTO posts_002 VALUES (2);
INSERT 0 1
psql# INSERT INTO posts_002 VALUES (2);
INSERT 0 1
psql# select * from posts;
user_id | seq_id
---------+--------
1 | 1
1 | 2
2 | 1
2 | 2
(4 rows)
I left out some rather important CHECK constraints in the above setup, make sure you read the docs for how these kinds of setups are used
insert into t values (user_id, seq_id) values
(4, (select coalesce(max(seq_id), 0) + 1 from t where user_id = 4))
Check for a duplicate primary key error in the front end and retry if needed.
Update
Although #Erwin advice is sensible, that is, a single sequence with the ordering in the select query, it can be expensive.
If you don't use a sequence there is no defeat of the nature of the sequence. Also it will not result in a duplicate key violation. To demonstrate it I created a table and made a python script to insert into it. I launched 3 parallel instances of the script inserting as fast as possible. And it just works.
The table must have a primary key on those columns:
create table t (
user_id int,
seq_id int,
primary key (user_id, seq_id)
);
The python script:
#!/usr/bin/env python
import psycopg2, psycopg2.extensions
query = """
begin;
insert into t (user_id, seq_id) values
(4, (select coalesce(max(seq_id), 0) + 1 from t where user_id = 4));
commit;
"""
conn = psycopg2.connect('dbname=cpn user=cpn')
conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_SERIALIZABLE)
cursor = conn.cursor()
for i in range(0, 1000):
while True:
try:
cursor.execute(query)
break
except psycopg2.IntegrityError, e:
print e.pgerror
cursor.execute("rollback;")
cursor.close()
conn.close()
After the parallel run:
select count(*), max(seq_id) from t;
count | max
-------+------
3000 | 3000
Just as expected. I developed at least two applications using that logic and one of then is more than 13 years old and never failed. I concede that if you are Facebook or some other giant then you could have a problem.
Yes:
CREATE TABLE your_table
(
column type DEFAULT NEXTVAL(sequence_name),
...
);
More details here:
http://www.postgresql.org/docs/9.2/static/ddl-default.html