How to update a PostgreSQL table with a count of duplicate items

How to update a PostgreSQL table with a count of duplicate items - sql

I found two bugs in a program that created a lot of duplicate values:
an 'index' was created instead of a 'unique index'
a duplication checks wasn't integrated in one of 4 twisted routines
So I need to go in and clean up my database.
Step one is to decorate the table with a count of all the duplicate values (next I'll look into finding the first value, and then migrating everything over )
The code below works, I just recall doing a similar "update from select count" on the same table years ago, and I did it in half as much code.
Is there a better way to write this?
UPDATE
shared_link
SET
is_duplicate_of_count = subquery.is_duplicate_of_count
FROM
(
SELECT
count(url) AS is_duplicate_of_count
, url
FROM
shared_link
WHERE
shared_link.url = url
GROUP BY
url
) AS subquery
WHERE
shared_link.url = subquery.url
;

You query is fine, generally, except for the pointless (but also harmless) WHERE clause in the subquery:
UPDATE shared_link
SET is_duplicate_of_count = subquery.is_duplicate_of_count
FROM (
SELECT url
, count(url) AS is_duplicate_of_count
FROM shared_link
-- WHERE shared_link.url = url
GROUP BY url
) AS subquery
WHERE shared_link.url = subquery.url;
The commented clause is the same as
WHERE shared_link.url = shared_link.url
and therefore only eliminating NULL values (because NULL = NULL is not TRUE), which is most probably neither intended nor needed in your setup.
Other than that you can only shorten your code further with aliases and shorter names:
UPDATE shared_link s
SET ct = u.ct
FROM (
SELECT url, count(url) AS ct
FROM shared_link
GROUP BY 1
) AS u
WHERE s.url = u.url;
In PostgreSQL 9.1 or later you might be able to do the whole operation (identify dupes, consolidate data, remove dupes) in one SQL statement with aggregate and window functions and data-modifying CTEs - thereby eliminating the need for an additional column to begin with.

Related

RETURNING causes error: missing FROM-clause entry for table

I am getting the users data from UUID WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'.
Since I don't want to make an additional query to fetch additional user data I'm trying to sneak them through the INSERT.
WITH _u AS (
SELECT
eu.empl_user_pvt_uuid,
ee.email,
ep.name_first
FROM employees.users eu
LEFT JOIN (
SELECT DISTINCT ON (ee.empl_user_pvt_uuid)
ee.empl_user_pvt_uuid,
ee.email
FROM employees.emails ee
ORDER BY ee.empl_user_pvt_uuid, ee.t DESC
) ee ON eu.empl_user_pvt_uuid = ee.empl_user_pvt_uuid
LEFT JOIN (
SELECT DISTINCT ON (ep.empl_user_pvt_uuid)
ep.empl_user_pvt_uuid,
ep.name_first
FROM employees.profiles ep
) ep ON eu.empl_user_pvt_uuid = ep.empl_user_pvt_uuid
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
)
INSERT INTO employees.password_resets (empl_pwd_reset_uuid, empl_user_pvt_uuid, t_valid, for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', _u.empl_user_pvt_uuid, '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM _u
RETURNING _u.empl_user_pvt_uuid, _u.email, _u.name_first;
However I get:
[42P01] ERROR: missing FROM-clause entry for table "_u"
Position: 994
What am I doing wrong?

It's true, as has been noted, that the RETURNING clause of an INSERT only sees the inserted row. More specifically, quoting the manual here:
The optional RETURNING clause causes INSERT to compute and return
value(s) based on each row actually inserted (or updated, if an ON CONFLICT DO UPDATE clause was used). This is primarily useful for
obtaining values that were supplied by defaults, such as a serial
sequence number. However, any expression using the table's columns
is allowed. The syntax of the RETURNING list is identical to that
of the output list of SELECT. Only rows that were successfully
inserted or updated will be returned. [...]
Bold emphasis mine.
So nothing keeps you from adding a correlated subquery to the RETURNING list:
INSERT INTO employees.password_resets AS ep
(empl_pwd_reset_uuid , empl_user_pvt_uuid , t_valid , for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', eu.empl_user_pvt_uuid , '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM employees.users eu
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
RETURNING for_empl_user_pvt_uuid AS empl_user_pvt_uuid -- alias to meet your org. query
, (SELECT email
FROM employees.emails
WHERE empl_user_pvt_uuid = ep.empl_user_pvt_uuid
ORDER BY t DESC -- NULLS LAST ?
LIMIT 1
) AS email
, (SELECT name_first
FROM employees.profiles
WHERE empl_user_pvt_uuid = ep.empl_user_pvt_uuid
-- ORDER BY ???
LIMIT 1
) AS name_first;
This is also much more efficient than the query you had (or what was proposed) for multiple reasons.
We don't run the subqueries ee and ep over all rows of the tables employees.emails and employees.profiles. That would be efficient if we needed major parts of those tables, but we only fetch a single row of interest from each. With appropriate indexes, a correlated subquery is much more efficient for this. See:
Efficient query to get greatest value per group from big table
Two SQL LEFT JOINS produce incorrect result
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest row per user
We don't add the overhead of one or more CTEs.
We only fetch additional data after a successful INSERT, so no time is wasted if the insert didn't go through for any reason. (See quote at the top!)
Plus, possibly most important, this is correct. We use data from the row that has actually been inserted - after inserting it. (See quote at the top!) After possible default values, triggers or rules have been applied. We can be certain that what we see is what's actually in the database (currently).
You have no ORDER BY for profiles.name_first. That's not right. Either there is only one qualifying row, then we need no DISTINCT nor LIMIT 1. Or there can be multiple, then we also need a deterministic ORDER BY to get a deterministic result.
And if emails.t can be NULL, you'll want to add NULLS LAST in the ORDER BY clause. See:
Sort by column ASC, but NULL values first?
Indexes
Ideally, you have these multicolumn indexes (with columns in this order):
users (empl_user_pub_uuid, empl_user_pvt_uuid)
emails (empl_user_pvt_uuid, email)
profiles (empl_user_pvt_uuid, name_first)
Then, if the tables are vacuumed enough, you get three index-only scans and the whole operation is lightening fast.
Get pre-INSERT values?
If you really want that (which I don't think you do), consider:
Return pre-UPDATE column values using SQL only

According Postgres Docs about 6.4. Returning Data From Modified Rows :
In an INSERT, the data available to RETURNING is the row as it was
inserted.
But here you are trying to return columns from source table instead of destination. Returning will not be able to return columns form _u table rather only from employees.password_resets table's inserted row. But you can write nested cte for insertion and can select data from source table as well. Please try below approach.
WITH _u AS (
SELECT
eu.empl_user_pvt_uuid,
ee.email,
ep.name_first
FROM employees.users eu
LEFT JOIN (
SELECT DISTINCT ON (ee.empl_user_pvt_uuid)
ee.empl_user_pvt_uuid,
ee.email
FROM employees.emails ee
ORDER BY ee.empl_user_pvt_uuid, ee.t DESC
) ee ON eu.empl_user_pvt_uuid = ee.empl_user_pvt_uuid
LEFT JOIN (
SELECT DISTINCT ON (ep.empl_user_pvt_uuid)
ep.empl_user_pvt_uuid,
ep.name_first
FROM employees.profiles ep
) ep ON eu.empl_user_pvt_uuid = ep.empl_user_pvt_uuid
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
), I as
(
INSERT INTO employees.password_resets (empl_pwd_reset_uuid, empl_user_pvt_uuid, t_valid, for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', _u.empl_user_pvt_uuid, '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM _u
)
select _u.empl_user_pvt_uuid, _u.email, _u.name_first from _u

Merging deltas with duplicate keys

I'm trying to perform a merge into a target table in our Snowflake instance where the source data contains change data with a field denoting the at source DML operation i.e I=Insert,U=Update,D=Delete.
The problem is dealing with the fact the log (deltas) source might contain multiple updates for the same record. The merge I've constructed bombs out complaining about duplicate keys.
I'm struggling to think of a solution without going the likes of GROUP BY and MAX on the updates. I've done a similar setup with Oracle and the AND clause on the MATCH was enough.
MERGE INTO "DB"."SCHEMA"."TABLE" t
USING (
SELECT * FROM "DB"."SCHEMA"."TABLE_LOG"
ORDER BY RECORD_TIMESTAMP ASC
) s ON t.RECORD_KEY = s.RECORD_KEY
WHEN MATCHED AND s.RECORD_OPERATION = 'D' THEN DELETE
WHEN MATCHED AND s.RECORD_OPERATION = 'U' THEN UPDATE
SET t.ID=COALESCE(s.ID,t.ID),
t.CREATED_AT=COALESCE(s.CREATED_AT,t.CREATED_AT),
t.PRODUCT=COALESCE(s.PRODUCT,t.PRODUCT),
t.SHOP_ID=COALESCE(s.SHOP_ID,t.SHOP_ID),
t.UPDATED_AT=COALESCE(s.UPDATED_AT,t.UPDATED_AT)
WHEN NOT MATCHED AND s.RECORD_OPERATION = 'I' THEN
INSERT (RECORD_KEY, ID, CREATED_AT, PRODUCT,
SHOP_ID, UPDATED_AT)
VALUES (s.RECORD_KEY, s.ID, s.CREATED_AT, s.PRODUCT,
s.SHOP_ID, s.UPDATED_AT);
Is there a way to rewrite the above merge so that it works as is?

The Snowflake docs show the ability for the AND case predicate during the match clause, it sounds like you tried this and it's not working because of the duplicates, right?
https://docs.snowflake.net/manuals/sql-reference/sql/merge.html#matchedclause-for-updates-or-deletes
There is even an example there which is using the AND command:
merge into t1 using t2 on t1.t1key = t2.t2key
when matched and t2.marked = 1 then delete
when matched and t2.isnewstatus = 1 then update set val = t2.newval, status = t2.newstatus
when matched then update set val = t2.newval
when not matched then insert (val, status) values (t2.newval, t2.newstatus);
I think you are going to have to get the "last record" per key and use that as your update, or process these serially which will be pretty slow...
Another thing to look at would be to try to see if you can apply the last_value( ) function to each column, where you order by your timestamp and partition over your key. If you do that in your inline view, that might work.
I hope this helps, I have a feeling it won't help much...Rich
UPDATE:
I found the following: https://docs.snowflake.net/manuals/sql-reference/parameters.html#error-on-nondeterministic-merge
If you run the following command before your merge, I think you'll be OK (testing required of course):
ALTER SESSION SET ERROR_ON_NONDETERMINISTIC_MERGE=false;

Writing a single UPDATE statement that prevents duplicates

I've been trying for a few hours (probably more than I needed to) to figure out the best way to write an update sql query that will dissallow duplicates on the column I am updating.
Meaning, if TableA.ColA already has a name 'TEST1', then when I'm changing another record, then I simply can't pick a value for ColA to be 'TEST1'.
It's pretty easy to simply just separate the query into a select, and use a server layer code that would allow conditional logic:
SELECT ID, NAME FROM TABLEA WHERE NAME = 'TEST1'
IF TableA.recordcount > 0 then
UPDATE SET NAME = 'TEST1' WHERE ID = 1234
END IF
But I'm more interested to see if these two queries can be combined into a single query.
I am using Oracle to figure things out, but I'd love to see a SQL Server query as well. I figured a MERGE statement can work, but for obvious reasons you can't have the clause:
..etc.. WHEN NOT MATCHED UPDATE SET ..etc.. WHERE ID = 1234
AND you can't update a column if it's mentioned in the join (oracle limitation but not limited to SQL Server)
ALSO, I know you can put a constraint on a column that prevents duplicate values, but I'd be interested to see if there is such a query that can do this without using constraint.
Here is an example start-up attempt on my end just to see what I can come up with (explanations on it failed is not necessary):
ERROR: ORA-01732: data manipulation operation not legal on this view
UPDATE (
SELECT d.NAME, ch.NAME FROM (
SELECT 'test1' AS NAME, '2722' AS ID
FROM DUAL
) d
LEFT JOIN TABLEA a
ON UPPER(a.name) = UPPER(d.name)
)
SET a.name = 'test2'
WHERE a.name is null and a.id = d.id
I have tried merge, but just gave up thinking it's not possible. I've also considered not exists (but I'd have to be careful since I might accidentally update every other record that doesn't match a criteria)

It should be straightforward:
update personnel
set personnel_number = 'xyz'
where person_id = 1001
and not exists (select * from personnel where personnel_number = 'xyz');

If I understand correctly, you want to conditionally update a field, assuming the value is not found. The following query does this. It should work in both SQL Server and Oracle:
update table1
set name = 'Test1'
where (select count(*) from table1 where name = 'Test1') > 0 and
id = 1234

Guidance on using the WITH clause in SQL

I understand how to use the WITH clause for recursive queries (!!), but I'm having problems understanding its general use / power.
For example the following query updates one record whose id is determined by using a subquery returning the id of the first record by timestamp:
update global.prospect psp
set status=status||'*'
where psp.psp_id=(
select p2.psp_id
from global.prospect p2
where p2.status='new' or p2.status='reset'
order by p2.request_ts
limit 1 )
returning psp.*;
Would this be a good candidate for using a WITH wrapper instead of the relatively ugly sub-query? If so, why?

If there can be concurrent write access to involved tables, there are race conditions in the following queries. Consider:
Postgres UPDATE … LIMIT 1
Your example can use a CTE (common table expression), but it will give you nothing a subquery couldn't do:
WITH x AS (
SELECT psp_id
FROM global.prospect
WHERE status IN ('new', 'reset')
ORDER BY request_ts
LIMIT 1
)
UPDATE global.prospect psp
SET status = status || '*'
FROM x
WHERE psp.psp_id = x.psp_id
RETURNING psp.*;
The returned row will be the updated version.
If you want to insert the returned row into another table, that's where a WITH clause becomes essential:
WITH x AS (
SELECT psp_id
FROM global.prospect
WHERE status IN ('new', 'reset')
ORDER BY request_ts
LIMIT 1
)
, y AS (
UPDATE global.prospect psp
SET status = status || '*'
FROM x
WHERE psp.psp_id = x.psp_id
RETURNING psp.*
)
INSERT INTO z
SELECT *
FROM y;
Data-modifying queries using CTEs were added with PostgreSQL 9.1.
The manual about WITH queries (CTEs).

WITH lets you define "temporary tables" for use in a SELECT query. For example, I recently wrote a query like this, to calculate changes between two sets:
-- Let o be the set of old things, and n be the set of new things.
WITH o AS (SELECT * FROM things(OLD)),
n AS (SELECT * FROM things(NEW))
-- Select both the set of things whose value changed,
-- and the set of things in the old set but not in the new set.
SELECT o.key, n.value
FROM o
LEFT JOIN n ON o.key = n.key
WHERE o.value IS DISTINCT FROM n.value
UNION ALL
-- Select the set of things in the new set but not in the old set.
SELECT n.key, n.value
FROM o
RIGHT JOIN n ON o.key = n.key
WHERE o.key IS NULL;
By defining the "tables" o and n at the top, I was able to avoid repeating the expressions things(OLD) and things(NEW).
Sure, we could probably eliminate the UNION ALL using a FULL JOIN, but I wasn't able to do that in my particular case.
If I understand your query correctly, it does this:
Find the oldest row in global.prospect whose status is 'new' or 'reset'.
Mark it by adding an asterisk to its status
Return the row (including our tweak to status).
I don't think WITH will simplify anything in your case. It may be slightly more elegant to use a FROM clause, though:
update global.prospect psp
set status = status || '*'
from ( select psp_id
from global.prospect
where status = 'new' or status = 'reset'
order by request_ts
limit 1
) p2
where psp.psp_id = p2.psp_id
returning psp.*;
Untested. Let me know if it works.
It's pretty much exactly what you have already, except:
This can be easily extended to update multiple rows. In your version, which uses a subquery expression, the query would fail if the subquery were changed to yield multiple rows.
I did not alias global.prospect in the subquery, so it's a bit easier to read. Since this uses a FROM clause, you'll get an error if you accidentally reference the table being updated.
In your version, the subquery expression is encountered for every single item. Although PostgreSQL should optimize this and only evaluate the expression once, this optimization will go away if you accidentally reference a column in psp or add a volatile expression.

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.

Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).

As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...

You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)

This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas