Return rows of a table that actually changed in an UPDATE

Return rows of a table that actually changed in an UPDATE - sql

Using Postgres, I can perform an update statement and return the rows affected by the commend.
UPDATE accounts
SET status = merge_accounts.status,
field1 = merge_accounts.field1,
field2 = merge_accounts.field2,
etc.
FROM merge_accounts WHERE merge_accounts.uid =accounts.uid
RETURNING accounts.*
This will give me a list of all records that matched the WHERE clause, however will not tell me which rows were actually updated by the operation.
In this simplified use-case it of course would be trivial to simply add another guard AND status != 'Closed, however my real world use-case involves updating potentially dozens of fields from a merge table with 10,000+ rows, and I want to be able to detect which rows were actually changed, and which are identical to their previous version. (The expectation is very few rows will actually have changed).
The best I've got so far is
UPDATE accounts
SET x=..., y=...
FROM accounts as old WHERE old.uid = accounts.uid
FROM merge_accounts WHERE merge_accounts.uid = accounts.uid
RETURNING accounts, old
Which will return a tuple of old and new rows that can then be diff'ed inside my Java codebase itself - however this requires significant additional network traffic and is potentially error prone.
The ideal scenario is to be able to have postgres return just the rows that actually had any values changed - is this possible?
Here on github is a more real world example of what I'm doing, incorporating some of the suggestions so far.
Using Postgres 9.1, but can use 9.4 if required. The requirements are effectively
Be able to perform an upsert of new data
Where we may only know the specific key/value pair to update on any given row
Get back a result containing just the rows that were actually changed by the upsert
Bonus - get a copy of the old records as well.
Since this question was opened I've gotten most of this working now, although I'm unsure if my approach is a good idea or not - it's a bit hacked together.

Only update rows that actually change
That saves expensive updates and expensive checks after the UPDATE.
To update every column with the new value provided (if anything changes):
UPDATE accounts a
SET (status, field1, field2) -- short syntax for ..
= (m.status, m.field1, m.field2) -- .. updating multiple columns
FROM merge_accounts m
WHERE m.uid = a.uid
AND (a.status IS DISTINCT FROM m.status OR
a.field1 IS DISTINCT FROM m.field1 OR
a.field2 IS DISTINCT FROM m.field2)
RETURNING a.*;
Due to PostgreSQL's MVCC model any change to a row writes a new row version. Updating a single column is almost as expensive as updating every column in the row at once. Rewriting the rest of the row comes at practically no cost, as soon as you have to update anything.
Details:
How do I (or can I) SELECT DISTINCT on multiple columns?
UPDATE a whole row in PL/pgSQL
Shorthand for whole rows
If the row types of accounts and merge_accounts are identical and you want to adopt everything from merge_accounts into accounts, there is a shortcut comparing the whole row type:
UPDATE accounts a
SET (status, field1, field2)
= (m.status, m.field1, m.field2)
FROM merge_accounts m
WHERE a.uid = m.uid
AND m IS DISTINCT FROM a
RETURNING a.*;
This even works for NULL values. Details in the manual.
But it's not going to work for your home-grown solution where (quoting your comment):
merge_accounts is identical, save that all non-pk columns are array types
It requires compatible row types, i.e. each column shares the same data type or there is at least an implicit cast between the two types.
For your special case
UPDATE accounts a
SET (status, field1, field2)
= (COALESCE(m.status[1], a.status) -- default to original ..
, COALESCE(m.field1[1], a.field1) -- .. if m.column[1] IS NULL
, COALESCE(m.field2[1], a.field2))
FROM merge_accounts m
WHERE m.uid = a.uid
AND (m.status[1] IS NOT NULL AND a.status IS DISTINCT FROM m.status[1]
OR m.field1[1] IS NOT NULL AND a.field1 IS DISTINCT FROM m.field1[1]
OR m.field2[1] IS NOT NULL AND a.field2 IS DISTINCT FROM m.field2[1])
RETURNING a.*
m.status IS NOT NULL works if columns that shouldn't be updated are NULL in merge_accounts.
m.status <> '{}' if you operate with empty arrays.
m.status[1] IS NOT NULL covers both options.
Related:
Return pre-UPDATE column values using SQL only

if you aren't relying on side-effectts of the update, only update the records that need to change
UPDATE accounts
SET status = merge_accounts.status,
field1 = merge_accounts.field1,
field2 = merge_accounts.field2,
etc.
FROM merge_accounts WHERE merge_accounts.uid =accounts.uid
AND NOT (status IS NOT DISTINCT FROM merge_accounts.status
AND field1 IS NOT DISTINCT FROM merge_accounts.field1
AND field2 IS NOT DISTINCT FROM merge_accounts.field2
)
RETURNING accounts.*

I would recommend using the information_schema.columns table to introspect the columns dynamically, and then use those within a plpgsql function to dynamically generate the UPDATE statement.
i.e. this DDL:
create table foo
(
id serial,
val integer,
name text
);
insert into foo (val, name) VALUES (10, 'foo'), (20, 'bar'), (30, 'baz');
And this query:
select column_name
from information_schema.columns
where table_name = 'foo'
order by ordinal_position;
will yield the columns for the table in the order that they were defined in the table DDL.
Essentially you would use the above SELECT within the function to dynamically build up your UPDATE statement by iterating over the results of the above SELECT in a FOR LOOP to dynamically build up both the SET and WHERE clauses.

Some variation of this ?
SELECT * FROM old;
id | val
----+-----
1 | 1
2 | 2
4 | 5
5 | 1
6 | 2
SELECT * FROM new;
id | val
----+-----
1 | 2
2 | 2
3 | 2
5 | 1
6 | 1
SELECT * FROM old JOIN new ON old.id = new.id;
id | val | id | val
----+-----+----+-----
1 | 1 | 1 | 2
2 | 2 | 2 | 2
5 | 1 | 5 | 1
6 | 2 | 6 | 1
(4 rows)
WITH sel AS (
SELECT o.id , o.val FROM old o JOIN new n ON o.id=n.id ),
upd AS (
UPDATE old SET val = new.val FROM new WHERE new.id=old.id RETURNING old.* )
SELECT * from sel, upd WHERE sel.id = upd.id AND sel.val <> upd.val;
id | val | id | val
----+-----+----+-----
1 | 1 | 1 | 2
6 | 2 | 6 | 1
(2 rows)
Refer SO answer and read the entire discussion.

If you are updating a single table and want to know if the row is actually changed you can use this query:
with rows_affected as (
update mytable set (field1, field2, field3)=('value1', 'value2', 3) where id=1 returning *
)
select count(*)>0 as is_modified from rows_affected
join mytable on mytable.id=rows_affected.id
where rows_affected is distinct from mytable;
And you can wrap your existing queries into this one without the need to modify the actual update statements.

Related

Create a procedure to merge data and avoid duplicates

I am trying to create a SQL Server merge procedure that would allow me to merge new entries in the data set and nullify duplicates in the table. Both tables of the same type. I am trying to perform a merge and avoid duplicates. The ID and Email will always be a one to one relation. However, the source table sometimes will send the same email with two different Ids. We want to keep only one record per person and nullify all the email for the invalid record. Initial thoughts are to join the source table with the target table on email and check which emails have two occurrences and nullify, but how could I put this in one procedure.
Table 1 and Table 2:
Id | Email | First | Last | Building | Date |....
Example of duplicate:
1 | tst#tst.com | ...
2 | tst#tst.com | ...
Needed output:
1 | tst#tst.com
2 | null
Procedure:
CREATE PROCEDURE mergingTwo #TableType
AS
BEGIN
MERGE [target]
USING [source] ON [target].Id = [source].Id OR [target].Email = [source].Email
WHEN MATCHED THEN
UPDATE
SET
WHEN NOT MATCHED BY TARGET THEN
INSERT

Can do the Merge First then nullify the e-mail in a Second update like
With cte as (select id, row_number() over (partition by e-mail order by id asc) n_row
From table_foo)
Update table_foo
Set email = null
From table_foo
Inner Join cte
On cte.id = table_foo.id
And cte.n_row > 1

Sounds like a job for a union (unless you really want those NULL entries).
SELECT Email FROM Table1
UNION
SELECT Email FROM Table2
;

How can I remove duplicate rows from a table but keeping the summation of values of a column

Suppose there is a table which has several identical rows. I can copy the distinct values by
SELECT DISTINCT * INTO DESTINATIONTABLE FROM SOURCETABLE
but if the table has a column named value and for the sake of simplicity its value is 1 for one particular item in that table. Now that row has another 9 duplicates. So the summation of the value column for that particular item is 10. Now I want to remove the 9 duplicates(or copy the distinct value as I mentioned) and for that item now the value should show 10 and not 1. How can this be achieved?
item| value
----+----------------
A | 1
A | 1
A | 1
A | 1
B | 1
B | 1
I want to show this as below
item| value
----+----------------
A | 4
B | 2
Thanks in advance

You can try to use SUM and group by
SELECT item,SUM(value) value
FROM T
GROUP BY item
SQLfiddle:http://sqlfiddle.com/#!18/fac26/1
[Results]:
| item | value |
|------|-------|
| A | 4 |
| B | 2 |

Broadly speaking, you can just us a sum and a GROUP BY clause.
Something like:
SELECT column1, SUM(column2) AS Count
FROM SOURCETABLE
GROUP BY column1
Here it is in action: Sum + Group By
Since your table probably isn't just two columns of data, here is a slightly more complex example showing how to do this to a larger table: SQL Fiddle
Note that I've selected my rows individually so that I can access the necessary data, rather than using
SELECT *
And I have achieved this result without the need for selecting data into another table.
EDIT 2:
Further to your comments, it sounds like you want to alter the actual data in your table rather than just querying it. There may be a more elegant way to do this, but a simple way use the above query to populate a temporary table, delete the contents of the existing table, then move all the data back. To do this in my existing example:
WITH MyQuery AS (
SELECT name, type, colour, price, SUM(number) AS number
FROM MyTable
GROUP BY name, type, colour, price
)
SELECT * INTO MyTable2 FROM MyQuery;
DELETE FROM MyTable;
INSERT INTO MyTable(name, type, colour, price, number)
SELECT * FROM MyTable2;
DROP TABLE MyTable2;
WARNING: If youre going to try this, please use a development environment first (i.e one you don't mind breaking!) to ensure it does exactly what you want it to do. It's imperative that your initial query captures ALL the data you want.
Here is the SQL Fiddle of this example in action: SQL Fiddle

SQL Multiple Row Insert w/ multiple selects from different tables

I am trying to do a multiple insert based on values that I am pulling from a another table. Basically I need to give all existing users access to a service that previously had access to a different one. Table1 will take the data and run a job to do this.
INSERT INTO Table1 (id, serv_id, clnt_alias_id, serv_cat_rqst_stat)
SELECT
(SELECT Max(id) + 1
FROM Table1 ),
'33', --The new service id
clnt_alias_id,
'PI' --The code to let the job know to grant access
FROM TABLE2,
WHERE serv_id = '11' --The old service id
I am getting a Primary key constraint error on id.
Please help.
Thanks,
Colin

This query is impossible. The max(id) sub-select will evaluate only ONCE and return the same value for all rows in the parent query:
MariaDB [test]> create table foo (x int);
MariaDB [test]> insert into foo values (1), (2), (3);
MariaDB [test]> select *, (select max(x)+1 from foo) from foo;
+------+----------------------------+
| x | (select max(x)+1 from foo) |
+------+----------------------------+
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
+------+----------------------------+
3 rows in set (0.04 sec)
You will have to run your query multiple times, once for each record you're trying to copy. That way the max(id) will get the ID from the previous query.

Is there a requirement that Table1.id be incremental ints? If not, just add the clnt_alias_id to Max(id). This is a nasty workaround though, and you should really try to get that column's type changed to auto_increment, like Marc B suggested.

Selecting most recent and specific version in each group of records, for multiple groups

The problem:
I have a table that records data rows in foo. Each time the row is updated, a new row is inserted along with a revision number. The table looks like:
id rev field
1 1 test1
2 1 fsdfs
3 1 jfds
1 2 test2
Note: the last record is a newer version of the first row.
Is there an efficient way to query for the latest version of a record and for a specific version of a record?
For instance, a query for rev=2 would return the 2, 3 and 4th row (not the replaced 1st row though) while a query for rev=1 yields those rows with rev <= 1 and in case of duplicated ids, the one with the higher revision number is chosen (record: 1, 2, 3).
I would not prefer to return the result in an iterative way.

To get only latest revisions:
SELECT * from t t1
WHERE t1.rev =
(SELECT max(rev) FROM t t2 WHERE t2.id = t1.id)
To get a specific revision, in this case 1 (and if an item doesn't have the revision yet the next smallest revision):
SELECT * from foo t1
WHERE t1.rev =
(SELECT max(rev)
FROM foo t2
WHERE t2.id = t1.id
AND t2.rev <= 1)
It might not be the most efficient way to do this, but right now I cannot figure a better way to do this.

Here's an alternative solution that incurs an update cost but is much more efficient for reading the latest data rows as it avoids computing MAX(rev). It also works when you're doing bulk updates of subsets of the table. I needed this pattern to ensure I could efficiently switch to a new data set that was updated via a long running batch update without any windows of time where we had partially updated data visible.
Aging
Replace the rev column with an age column
Create a view of the current latest data with filter: age = 0
To create a new version of your data ...
INSERT: new rows with age = -1 - This was my slow long running batch process.
UPDATE: UPDATE table-name SET age = age + 1 for all rows in the subset. This switches the view to the new latest data (age = 0) and also ages older data in a single transaction.
DELETE: rows having age > N in the subset - Optionally purge old data
Indexing
Create a composite index with age and then id so the view will be nice and fast and can also be used to look up by id. Although this key is effectively unique, its temporarily non-unique when you're ageing the rows (during UPDATE SET age=age+1) so you'll need to make it non-unique and ideally the clustered index. If you need to find all versions of a given id ordered by age, you may need an additional non-unique index on id then age.
Rollback
Finally ... Lets say you're having a bad day and the batch processing breaks. You can quickly revert to a previous data set version by running:
UPDATE table-name SET age = age - 1 -- Roll back a version
DELETE table-name WHERE age < 0 -- Clean up bad stuff
Existing Table
Suppose you have an existing table that now needs to support aging. You can use this pattern by first renaming the existing table, then add the age column and indexing and then create the view that includes the age = 0 condition with the same name as the original table name.
This strategy may or may not work depending on the nature of technology layers that depended on the original table but in many cases swapping a view for a table should drop in just fine.
Notes
I recommend naming the age column to RowAge in order to indicate this pattern is being used, since it's clearer that its a database related value and it complements SQL Server's RowVersion naming convention. It also won't conflict with a column or view that needs to return a person's age.
Unlike other solutions, this pattern works for non SQL Server databases.
If the subsets you're updating are very large then this might not be a good solution as your final transaction will update not just the current records but all past version of the records in this subset (which could even be the entire table!) so you may end up locking the table.

This is how I would do it. ROW_NUMBER() requires SQL Server 2005 or later
Sample data:
DECLARE #foo TABLE (
id int,
rev int,
field nvarchar(10)
)
INSERT #foo VALUES
( 1, 1, 'test1' ),
( 2, 1, 'fdsfs' ),
( 3, 1, 'jfds' ),
( 1, 2, 'test2' )
The query:
DECLARE #desiredRev int
SET #desiredRev = 2
SELECT * FROM (
SELECT
id,
rev,
field,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rn
FROM #foo WHERE rev <= #desiredRev
) numbered
WHERE rn = 1
The inner SELECT returns all relevant records, and within each id group (that's the PARTITION BY), computes the row number when ordered by descending rev.
The outer SELECT just selects the first member (so, the one with highest rev) from each id group.
Output when #desiredRev = 2 :
id rev field rn
----------- ----------- ---------- --------------------
1 2 test2 1
2 1 fdsfs 1
3 1 jfds 1
Output when #desiredRev = 1 :
id rev field rn
----------- ----------- ---------- --------------------
1 1 test1 1
2 1 fdsfs 1
3 1 jfds 1

If you want all the latest revisions of each field, you can use
SELECT C.rev, C.fields FROM (
SELECT MAX(A.rev) AS rev, A.id
FROM yourtable A
GROUP BY A.id)
AS B
INNER JOIN yourtable C
ON B.id = C.id AND B.rev = C.rev
In the case of your example, that would return
rev field
1 fsdfs
1 jfds
2 test2

SELECT
MaxRevs.id,
revision.field
FROM
(SELECT
id,
MAX(rev) AS MaxRev
FROM revision
GROUP BY id
) MaxRevs
INNER JOIN revision
ON MaxRevs.id = revision.id AND MaxRevs.MaxRev = revision.rev

SELECT foo.* from foo
left join foo as later
on foo.id=later.id and later.rev>foo.rev
where later.id is null;

How about this?
select id, max(rev), field from foo group by id
For querying specific revision e.g. revision 1,
select id, max(rev), field from foo where rev <= 1 group by id

I DISTINCTly hate MySQL (help building a query)

This is staight forward I believe:
I have a table with 30,000 rows. When I SELECT DISTINCT 'location' FROM myTable it returns 21,000 rows, about what I'd expect, but it only returns that one column.
What I want is to move those to a new table, but the whole row for each match.
My best guess is something like SELECT * from (SELECT DISTINCT 'location' FROM myTable) or something like that, but it says I have a vague syntax error.
Is there a good way to grab the rest of each DISTINCT row and move it to a new table all in one go?

SELECT * FROM myTable GROUP BY `location`
or if you want to move to another table
CREATE TABLE foo AS SELECT * FROM myTable GROUP BY `location`

Distinct means for the entire row returned. So you can simply use
SELECT DISTINCT * FROM myTable GROUP BY 'location'
Using Distinct on a single column doesn't make a lot of sense. Let's say I have the following simple set
-id- -location-
1 store
2 store
3 home
if there were some sort of query that returned all columns, but just distinct on location, which row would be returned? 1 or 2? Should it just pick one at random? Because of this, DISTINCT works for all columns in the result set returned.

Well, first you need to decide what you really want returned.
The problem is that, presumably, for some of the location values in your table there are different values in the other columns even when the location value is the same:
Location OtherCol StillOtherCol
Place1 1 Fred
Place1 89 Fred
Place1 1 Joe
In that case, which of the three rows do you want to select? When you talk about a DISTINCT Location, you're condensing those three rows of different data into a single row, there's no meaning to moving the original rows from the original table into a new table since those original rows no longer exist in your DISTINCT result set. (If all the other columns are always the same for a given Location, your problem is easier: Just SELECT DISTINCT * FROM YourTable).
If you don't care which values come from the other columns you can use a (bad, IMHO) MySQL extension to SQL and do:
SELECT * FROM YourTable GROUP BY Location
which will give a result set with one row per location and values for the other columns derived from the original data in an undefined fashion.

Multiple rows with identical values in all columns don't have any sense. OK - the question might be a way to correct exactly that situation.
Considering this table, with id being the PK:
kram=# select * from foba;
id | no | name
----+----+---------------
2 | 1 | a
3 | 1 | b
4 | 2 | c
5 | 2 | a,b,c,d,e,f,g
you may extract a sample for every single no (:=location) by grouping over that column, and selecting the row with minimum PK (for example):
SELECT * FROM foba WHERE id IN (SELECT min (id) FROM foba GROUP BY no);
id | no | name
----+----+------
2 | 1 | a
4 | 2 | c

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas