PostgreSQL Update and return - sql

Let's say I have a table called t in Postgres:
id | group_name | state
-----------------------------
1 | group1 | 0
2 | group1 | 0
3 | group1 | 0
I need to update the state of a row by ID, while also returning some things:
The old state
The remaining number of rows in the same group that have state = 0
I've got a query to do this as follows:
UPDATE t AS updated SET state = 1
FROM t as original
WHERE
updated.id = original.id AND
updated.id = :some_id
RETURNING
updated.state AS new_state,
original.state AS old_state,
(
SELECT COUNT(*) FROM t
WHERE
group_name = updated.group_name AND
state = 0
) as remaining_count;
However, it seems like the subquery within RETURNING is executed before the update has completed, leaving me with a remaining_count that is off by 1.
Additionally, I'm not sure how this behaves when concurrent queries are run. If we update two of these rows at the same time, is it possible that they would both return the same remaining_count?
Is there a more elegant solution to this? Perhaps some sort of window/aggregate function?

The subquery is indeed run without seeing the change from the UPDATE, because it's running before the UPDATE has committed, and therefore it's not visible. Nevertheless, it's an easy fix; just add a where clause to filter out the ID you just updated in the subquery, making your query something like this:
UPDATE t AS updated SET state = 1
FROM t as original
WHERE
updated.id = original.id AND
updated.id = :some_id
RETURNING
updated.state AS new_state,
original.state AS old_state,
(
SELECT COUNT(*) FROM t
WHERE
group_name = updated.group_name AND
state = 0 AND
t.id <> :some_id /* this is what I changed */
) as remaining_count;
Concurrency-wise, I'm not sure what the behavior would be, TBH; best I can do is point you at the relevant docs.

You could try (non-recursive) WITH queries, aka Common Table Expressions (CTEs). Their general structure is as follows:
WITH auxiliary_query_name AS (
auxiliary_query_expression;
)
[, WITH ...]
primary_query_expression;
Normally, auxiliary_query_expression and primary_query_expression run concurrently, and if they refer to the same underlying tables, the result is unpredictable. However, you can refer to auxiliary_query_name from within primary_query_expression, and from other auxiliary queries, thus enforcing a run sequence, where the referring query has to wait for the referred one to complete. Some finer points may apply, but that's the gist of it. CTEs also come with the advantage of being computed only once.
Regarding your query specifically, assuming that what you want in the end is the ID of the updated item, its old state, new state, the group it belongs to, and how many other items of that group are left to update, I believe the following would achieve this. I slightly modified the original query to update multiple items at once, to show how this approach shines (beside the clear sequence, it's performance advantages are moot if you update only a single item at a time).
WITH updated_t AS (
UPDATE t AS updated SET state = 1
FROM t as original
WHERE
updated.id = original.id AND
updated.id in :array_of_IDs -- I changed this
RETURNING
updated.id,
original.state AS old_state,
updated.state AS new_state,
updated.group_name
),
WITH remaining AS (
SELECT t.group_name, count(*) as remaining_count
-- we need to JOIN then filter out the updated rows because
-- all WITH in a statement share the same snapshot, thus have
-- the same starting "view" of base tables.
FROM t LEFT JOIN updated_t
ON t.id = updated_t.id
WHERE updated_t.id is NULL
AND t.group_name in (SELECT DISTINCT group_name from updated_t)
AND t.state = 0
GROUP BY group_name
)
SELECT
updated_t.id,
updated_t.group_name,
updated_t.old_state,
updated_t.new_state,
remaining.remaining_count
FROM updated_t, remaining
WHERE
updated_t.group_name = remaining.group_name;

Related

update assignment applied to set vs row

I was trying to randomly update stateid in [address], so i started with this query
update [address]
set stateid = (select top 1 id
from lookupvalue
where lookuptypeid = 3 and code = 1
order by newid()),
countryid = 1
select *
from [address]
but as it appears, all the rows get the same value, when I tried referencing [address] table from the inner select query, the update is ran per row (and I got the desired effect).
update [address]
set stateid = (select top 1 id
from lookupvalue
where lookuptypeid = 3 and [address].id = [address].id
and code = 1
order by newid()),
countryid = 1
select *
from [address]
Can someone elaborate on the above behavior, is it related to query plan, do I have to make a dummy reference in the inner select to force the update assignment to be evaluated per row?
Yes, it is related to the query plan. SQL Server (and other databases too) see the subquery and decide that it can be optimized away. I consider this an error because newid() is a volatile function, so the subquery cannot be optimized away. But, there are arguments on the other side as well.
Putting in the outer reference fixes the problem, so you know how to get around this "optimization".
To evaluate result per row you can also use CROSS APPLY

How to update a PostgreSQL table with a count of duplicate items

I found two bugs in a program that created a lot of duplicate values:
an 'index' was created instead of a 'unique index'
a duplication checks wasn't integrated in one of 4 twisted routines
So I need to go in and clean up my database.
Step one is to decorate the table with a count of all the duplicate values (next I'll look into finding the first value, and then migrating everything over )
The code below works, I just recall doing a similar "update from select count" on the same table years ago, and I did it in half as much code.
Is there a better way to write this?
UPDATE
shared_link
SET
is_duplicate_of_count = subquery.is_duplicate_of_count
FROM
(
SELECT
count(url) AS is_duplicate_of_count
, url
FROM
shared_link
WHERE
shared_link.url = url
GROUP BY
url
) AS subquery
WHERE
shared_link.url = subquery.url
;
You query is fine, generally, except for the pointless (but also harmless) WHERE clause in the subquery:
UPDATE shared_link
SET is_duplicate_of_count = subquery.is_duplicate_of_count
FROM (
SELECT url
, count(url) AS is_duplicate_of_count
FROM shared_link
-- WHERE shared_link.url = url
GROUP BY url
) AS subquery
WHERE shared_link.url = subquery.url;
The commented clause is the same as
WHERE shared_link.url = shared_link.url
and therefore only eliminating NULL values (because NULL = NULL is not TRUE), which is most probably neither intended nor needed in your setup.
Other than that you can only shorten your code further with aliases and shorter names:
UPDATE shared_link s
SET ct = u.ct
FROM (
SELECT url, count(url) AS ct
FROM shared_link
GROUP BY 1
) AS u
WHERE s.url = u.url;
In PostgreSQL 9.1 or later you might be able to do the whole operation (identify dupes, consolidate data, remove dupes) in one SQL statement with aggregate and window functions and data-modifying CTEs - thereby eliminating the need for an additional column to begin with.

Guidance on using the WITH clause in SQL

I understand how to use the WITH clause for recursive queries (!!), but I'm having problems understanding its general use / power.
For example the following query updates one record whose id is determined by using a subquery returning the id of the first record by timestamp:
update global.prospect psp
set status=status||'*'
where psp.psp_id=(
select p2.psp_id
from global.prospect p2
where p2.status='new' or p2.status='reset'
order by p2.request_ts
limit 1 )
returning psp.*;
Would this be a good candidate for using a WITH wrapper instead of the relatively ugly sub-query? If so, why?
If there can be concurrent write access to involved tables, there are race conditions in the following queries. Consider:
Postgres UPDATE … LIMIT 1
Your example can use a CTE (common table expression), but it will give you nothing a subquery couldn't do:
WITH x AS (
SELECT psp_id
FROM global.prospect
WHERE status IN ('new', 'reset')
ORDER BY request_ts
LIMIT 1
)
UPDATE global.prospect psp
SET status = status || '*'
FROM x
WHERE psp.psp_id = x.psp_id
RETURNING psp.*;
The returned row will be the updated version.
If you want to insert the returned row into another table, that's where a WITH clause becomes essential:
WITH x AS (
SELECT psp_id
FROM global.prospect
WHERE status IN ('new', 'reset')
ORDER BY request_ts
LIMIT 1
)
, y AS (
UPDATE global.prospect psp
SET status = status || '*'
FROM x
WHERE psp.psp_id = x.psp_id
RETURNING psp.*
)
INSERT INTO z
SELECT *
FROM y;
Data-modifying queries using CTEs were added with PostgreSQL 9.1.
The manual about WITH queries (CTEs).
WITH lets you define "temporary tables" for use in a SELECT query. For example, I recently wrote a query like this, to calculate changes between two sets:
-- Let o be the set of old things, and n be the set of new things.
WITH o AS (SELECT * FROM things(OLD)),
n AS (SELECT * FROM things(NEW))
-- Select both the set of things whose value changed,
-- and the set of things in the old set but not in the new set.
SELECT o.key, n.value
FROM o
LEFT JOIN n ON o.key = n.key
WHERE o.value IS DISTINCT FROM n.value
UNION ALL
-- Select the set of things in the new set but not in the old set.
SELECT n.key, n.value
FROM o
RIGHT JOIN n ON o.key = n.key
WHERE o.key IS NULL;
By defining the "tables" o and n at the top, I was able to avoid repeating the expressions things(OLD) and things(NEW).
Sure, we could probably eliminate the UNION ALL using a FULL JOIN, but I wasn't able to do that in my particular case.
If I understand your query correctly, it does this:
Find the oldest row in global.prospect whose status is 'new' or 'reset'.
Mark it by adding an asterisk to its status
Return the row (including our tweak to status).
I don't think WITH will simplify anything in your case. It may be slightly more elegant to use a FROM clause, though:
update global.prospect psp
set status = status || '*'
from ( select psp_id
from global.prospect
where status = 'new' or status = 'reset'
order by request_ts
limit 1
) p2
where psp.psp_id = p2.psp_id
returning psp.*;
Untested. Let me know if it works.
It's pretty much exactly what you have already, except:
This can be easily extended to update multiple rows. In your version, which uses a subquery expression, the query would fail if the subquery were changed to yield multiple rows.
I did not alias global.prospect in the subquery, so it's a bit easier to read. Since this uses a FROM clause, you'll get an error if you accidentally reference the table being updated.
In your version, the subquery expression is encountered for every single item. Although PostgreSQL should optimize this and only evaluate the expression once, this optimization will go away if you accidentally reference a column in psp or add a volatile expression.

MS SQL update table with multiple conditions

Been reading this site for answers for quite a while and now asking my first question!
I'm using SQL Server
I have two tables, ABC and ABC_Temp.
The contents are inserted into the ABC_Temp first before making its way to ABC.
Table ABC and ABC_Temp have the same columns, except that ABC_Temp has an extra column called LastUpdatedDate, which contains the date of the last update. Because ABC_Temp can have more than 1 of the same record, it has a composite key of the item number and the last updated date.
The columns are: ItemNo | Price | Qty and ABC_Temp has an extra column: LastUpdatedDate
I want to create a statement that follows the following conditions:
Check if each of the attributes of ABC differ from the value of ABC_Temp for records with the same key, if so then do the update (Even if only one attribute is different, all other attributes can be updated as well)
Only update those that need changes, if the record is the same, then it would not update.
Since an item can have more than one record in ABC_Temp I only want the latest updated one to be updated to ABC
I am currently using 2005 (I think, not at work at the moment).
This will be in a stored procedure and is called inside the VBscript scheduled task. So I believe it is a once time thing. Also I'm not trying to sync the two tables, as the contents of ABC_Temp would only contain new records bulk inserted from a text file through BCP. For the sake of context, this will be used with in conjunction with an insert stored proc that checks if records exist.
UPDATE
ABC
SET
price = T1.price,
qty = T1.qty
FROM
ABC
INNER JOIN ABC_Temp T1 ON
T1.item_no = ABC.item_no
LEFT OUTER JOIN ABC_Temp T2 ON
T2.item_no = T1.item_no AND
T2.last_updated_date > T1.last_updated_date
WHERE
T2.item_no IS NULL AND
(
T1.price <> ABC.price OR
T1.qty <> ABC.qty
)
If NULL values are possible in the price or qty columns then you will need to account for that. In this case I would probably change the inequality statements to look like this:
COALESCE(T1.price, -1) <> COALESCE(ABC.price, -1)
This assumes that -1 is not a valid value in the data, so you don't have to worry about it actually appearing there.
Also, is ABC_Temp really a temporary table that's just loaded long enough to get the values into ABC? If not then you are storing duplicate data in multiple places, which is a bad idea. The first problem is that now you need these kinds of update scenarios. There are other issues that you might run into, such as inconsistencies in the data, etc.
You could use cross apply to seek the last row in ABC_Temp with the same key. Use a where clause to filter out rows with no differences:
update abc
set col1 = latest.col1
, col2 = latest.col2
, col3 = latest.col3
from ABC abc
cross apply
(
select top 1 *
from ABC_Temp tmp
where abc.key = tmp.key
order by
tmp.LastUpdatedDate desc
) latest
where abc.col1 <> latest.col1
or (abc.col2 <> latest.col2
or (abc.col1 is null and latest.col2 is not null)
or (abc.col1 is not null and latest.col2 is null))
or abc.col3 <> latest.col3
In the example, only col2 is nullable. Since null <> 1 is not true, you have to check differences involving null using the is null syntax.

SQL Server 2008 - Select disjunct rows

I have two concurrent processes and I have two queries, eg.:
select top 10 * into #tmp_member
from member
where status = 0
order by member_id
and then
update member
set process_status = 1
from member inner join #tmp_member m
on member.member_id=m.member_id
I'd like each process to select different rows, so if a row was already selected by the first process, then do not use that one in the second process' result list.
Do I have to play around with locks? UPDLOCK, ROWLOCK, READPAST hints maybe? Or is there a more straightforward solution?
Any help is appreciated,
cheers,
b
You need hints.
See my answer here: SQL Server Process Queue Race Condition
However, you can shorten your query above into a single statement with the OUTPUT clause. Otherwise you'll need a transaction too (asuming each process executes the 2 statements above one after the other)
update m
set process_status = 1
OUTPUT Inserted.member_id
from
(
SELECT top 10
process_status, member_id
from member WITH (ROWLOCK, READPAST, UPDLOCK)
where status = 0
order by member_id
) m
Summary: if you want multiple processes to
select 10 rows where status = 0
set process_status = 1
return a resultset in a safe, concurrent fashion
...then use this code.
Well the problem is that your select/update is not atomic - the second process might select the first 10 items in between the first process having selected and before updating.
There's the OUTPUT clause you can use on the UPDATE statement to make it atomic. See the documentation for details, but basically you can write something like:
DECLARE #MyTableVar table(member_ID INT)
UPDATE TOP (10) Members
SET
member_id = member_id,
process_status = 1
WHERE status = 0
OUTPUT inserted.member_ID
INTO #MyTableVar;
After that #MyTableVar should contain all the updated member IDs.
To meet your goal of having multiple processes work on the member table you will not need to "play around with locks". You will need to change from the #tmp_member table to a global temp table or a permanate table. The table will also need a column to track which process is managing the member row/
You will need a method to provide some kind of ID to each process which will be using the table. The first query will then be modified to exclude any entries in the table by other processes. The second query will be modified to include only those entries by this process