I am attempting to update multiple columns on a table with values from another row in the same table:
CREATE TEMP TABLE person (
pid INT
, name VARCHAR(40)
, dob DATE
, younger_sibling_name VARCHAR(40)
, younger_sibling_dob DATE
);
INSERT INTO person VALUES (pid, name, dob)
(1, 'John' , '1980-01-05')
, (2, 'Jimmy', '1975-04-25')
, (3, 'Sarah', '2004-02-10')
, (4, 'Frank', '1934-12-12')
;
The task is to populate younger_sibling_name and younger_sibling_dob with the name and birthday of the person that is closest to them in age, but not older or the same age.
I can set the younger sibling dob easily because this is the value that determines the record to use with a correlated subquery (I think this is an example of that?):
UPDATE person SET younger_sibling_dob = (
SELECT MAX(dob)
FROM person AS sibling
WHERE sibling.dob < person.dob);
I just can't see any way to get the name?
The real query of this will run over about 1M rows in groups of 100-500 for each MAX selection so performance is a concern.
Edit
After trying many different approaches, I've decided on this one which I think is a good balance of being able to verify the data with the intermediate result, shows the intention of what the logic is, and performs adequately:
WITH sibling AS (
SELECT person.pid, sibling.dob, sibling.name,
row_number() OVER (PARTITION BY person.pid
ORDER BY sibling.dob DESC) AS age_closeness
FROM person
JOIN person AS sibling ON sibling.dob < person.dob
)
UPDATE person
SET younger_sibling_name = sibling.name
,younger_sibling_dob = sibling.dob
FROM sibling
WHERE person.pid = sibling.pid
AND sibling.age_closeness = 1;
SELECT * FROM person ORDER BY dob;
Rewrite 2022
I expect your added solution to perform poorly, as it's doing a of of unnecessary work. The following should be much faster.
The question and the added solution do not define which row to pick when there are multiple with the same dob. Typically you'll want a deterministic pick. This query picks the alphabetically first name from each group of peers with the same dob. Adapt to your needs.
UPDATE person p
SET younger_sibling_name = y.name
, younger_sibling_dob = y.dob
FROM (
SELECT dob, name, lead(dob) OVER (ORDER BY dob) AS next_dob
FROM (
SELECT DISTINCT ON (dob)
dob, name
FROM person p
ORDER BY dob, name -- ①
) sub
) y
WHERE p.dob = y.next_dob;
db<>fiddle here - with extended test case
Works since at least Postgres 8.4.
Needs an index on dob to be fast, ideally a multicolumn index on (dob, name).
Subquery sub passes over the whole table once and distills distinct rows per dob.
① I added name to ORDER BY as tiebreaker to pick the row with the alphabetically first name. Adapt to our needs.
In the outer SELECT add the next later dob (next_dob) to each row with lead() - simple now with distinct dob. Then join to that next_dob and the rest is simple.
If no younger person exists, no UPDATE happens and the columns stay NULL.
About DISTINCT ON and possibly faster query techniques for many duplicates:
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest row per user
Taking dob and name from the same row guarantees we stay in sync. Multiple correlated subqueries would not offer this guarantee, and would be more expensive anyway.
Original answer
Still valid.
Old query 1
WITH cte AS (
SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
FROM person
)
UPDATE person p
SET younger_sibling_name = y.name
, younger_sibling_dob = y.dob
FROM cte x
JOIN (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk - 1
WHERE x.pid = p.pid;
Old sqlfiddle
In the CTE cte use the window function dense_rank() to get a rank without gaps according to the dop for every person.
Join cte to itself, but remove duplicates on dob from the second instance. Thereby everybody gets exactly one UPDATE. If more than one person share the same dop, the same one is selected as younger sibling for all persons on the next dob. I do this with:
(SELECT DISTINCT ON (rnk) * FROM cte)
Add ORDER BY rnk, ... to this subquery to pick a particular person for every dob.
Old query 2
WITH cte AS (
SELECT dob, min(name) AS name
, row_number() OVER (ORDER BY dob) rn
FROM person p
GROUP BY dob
)
UPDATE person p
SET younger_sibling_name = y.name
, younger_sibling_dob = y.dob
FROM cte x
JOIN cte y ON y.rn = x.rn - 1
WHERE x.dob = p.dob;
Old sqlfiddle
This works, because aggregate functions are applied before window functions. And it should be very fast since both operations agree on the sort order.
Obviates the need for a later DISTINCT like in query 1.
Result is the same as query 1, exactly.
Again, you can add more columns to ORDER BY to pick a particular person for every dob.
1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
SELECT * FROM person nx
WHERE nx.dob < dst.dob
OR nx.dob = dst.dob AND nx.pid < dst.pid
AND nx.dob > src.dob
OR nx.dob = src.dob AND nx.pid > src.pid
);
2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:
UPDATE person dst
SET younger_sibling_name = src.name
,younger_sibling_dob = src.dob
FROM (
SELECT pid
, LAG(name) OVER win AS name
, LAG(dob) OVER win AS dob
FROM person
WINDOW win AS (ORDER BY dob, pid)
) src
WHERE src.pid = dst.pid
;
Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.
To get the dob and name, you can do:
update person
set younger_sibling_dob = (select dob
from person p2
where s.dob < person.dob
order by dob desc
limit 1),
younger_sibling_name = (select name
from person p2
where s.dob < person.dob
order by dob desc
limit 1)
If you have an index on dob, then the query will run faster.
Related
I have a postgresql table contains a list of email addresses. The table has three columns, Email, EmailServer (e.g., gmail.com, outlook.com, msn.com, and yahoo.com.ca etc.), and Valid (boolean).
Now, I want to group those emails by EmailServer and then update the first 3 records of each large group (count >=6) as Valid = true while leaving the rest of each group as Valid = false.
I failed to get the wanted output by below query:
UPDATE public."EmailContacts"
SET "Valid"=true
WHERE "EmailServer" IN (
SELECT "EmailServer"
FROM public."EmailContacts"
GROUP by "EmailServer"
HAVING count(*) >=6
LIMIT 5)
Please help to modify so as to get the expected results. Would be greatly appreciated for any kind of your help!
WITH major_servers AS (
SELECT email_server
FROM email_address
GROUP by email_server
HAVING count(*) >=6
),
enumerated_emails AS (
SELECT email,
email_server,
row_number() OVER (PARTITION BY email_server ORDER BY email) AS row_number --TODO:: ORDER BY email - attention
FROM email_address
WHERE email_server IN (SELECT email_server FROM major_servers)
)
UPDATE email_address
SET valid = true
WHERE email IN (SELECT email
FROM enumerated_emails ee
WHERE ee.row_number <= 3);
The first query major_servers finds major groups where more than 5 email servers exist.
The second query enumerated_emails enumerates emails by their natural order (see a TODO comment, I think you should choose another ORDER BY criteria) which belong to major groups using window function row_number().
The last query updates the first 3 rows in each major server group.
Find the sql-fiddle here.
You need to get the servers, then order the mails from which one and then perform the update. Something like this:
WITH DataSourceServers AS
(
SELECT "EmailServer"
FROM public."EmailContacts"
GROUP by "EmailServer"
HAVING count(*) >=6
),DataSourceEmails AS
(
SELECT "Email", row_number() OVER (PARTITION BY "EmailServer" ORDER BY "Email") AS rn
FROM public."EmailContacts"
WHERE "EmailServer" IN (SELECT "EmailServer" FROM DataSourceServers)
)
UPDATE public."EmailContacts"
SET "Valid" = true
FROM public."EmailContacts" E
INNER JOIN DataSourceEmails SE
WHERE E."EmailServer" = SE."EmailServer"
AND E."Email" = SE."Email"
AND SE.rn <= 3;
The following query returns the results that I need but I have to add the ID of the row to then update it. If I add the ID directly in the select statement it will return me more results then I need because each ID is unique so the DISTINCT statement see the line as unique.
SELECT DISTINCT ucpse.MemberID, ucpse.ProductID, ucpse.UserID
FROM UserCustomerProductSalaryExceptions as ucpse
WHERE EXISTS (SELECT NULL
FROM UserCustomerProductSalaryExceptions as upcse2
WHERE ucpse.userid = upcse2.userid AND ucpse.MemberID = upcse2.MemberID AND ucpse.ProductID = upcse2.ProductID
GROUP BY upcse2.UserID, upcse2.memberid, upcse2.productid
HAVING COUNT(UserID) >= 2
)
So basically I need to add ucpse.ID in the Select statement while keeping DISTINCT values for MemberID,ProductID and UserID.
Any Ideas ?
Thank you
According to you comment:
If the data has been duplicated 67 times for a given employee with a given product and a given client, I need to keep only one of thoses records. It's not important which one, so this is why I use DISTINC to obtain unique combinaison of given employee with a given product and a given client.
You can use MIN() or MAX() and GROUP BY instead of DISTINCT
SELECT MAX(ucpse.ID) AS ID, ucpse.MemberID, ucpse.ProductID, ucpse.UserID
FROM UserCustomerProductSalaryExceptions as ucpse
WHERE EXISTS (SELECT NULL
FROM UserCustomerProductSalaryExceptions as upcse2
WHERE ucpse.userid = upcse2.userid AND ucpse.MemberID = upcse2.MemberID AND ucpse.ProductID = upcse2.ProductID
GROUP BY upcse2.UserID, upcse2.memberid, upcse2.productid
HAVING COUNT(UserID) >= 2
)
GROUP BY ucpse.MemberID, ucpse.ProductID, ucpse.UserID
UPDATE:
From you comments I think the below query is what you need
DELETE FROM UserCustomerProductSalaryExceptions
WHERE ID NOT IN ( SELECT MAX(ucpse.ID) AS ID
FROM #UserCustomerProductSalaryExceptions
GROUP BY ucpse.MemberID, ucpse.ProductID, ucpse.UserID
HAVING COUNT(ucpse.ID) >= 2
)
If all you want is to delete the duplicates, this will do it:
WITH X AS
(SELECT ID,
ROW_NUMBER() OVER (PARTITION BY MemberID, ProductID, UserID ORDER BY ID) AS DupRowNum<br
FROM UserCustomerProductSalaryExceptions
)
DELETE X WHERE DupRowNum > 1
ID's not necessary - try:
UPDATE uu SET
<your settings here>
FROM UserCustomerProductSalaryExceptions uu
JOIN ( <paste your entire query above here>
) uc ON uc.MemberID=uu.MemberId AND uc.ProductID=uu.ProductId AND uc.UserID=uu.UserId
From the sound of your data structure (which I would STRONGLY advise normalizing as soon as possible), it sounds like you should be updating all the records. It sounds as if each duplicate is important because it contains some information about an employee's relation to a customer or product.
I would probably update all the records. Try this:
UPDATE UCPSE
SET
--Do your updates here
FROM UserCustomerProductSalaryExceptions as ucpse
JOIN
(
SELECT UserID, MemberID, ProductID
FROM UserCustomerProductSalaryExceptions
GROUP BY UserID, MemberID, ProductID
HAVING COUNT(UserID) >= 2
) T
ON ucpse.UserID = T.UserID AND ucpse.MemberID = T.MemberID AND ucpse.ProductID = T.ProductID
I am looking for an optimized query
let me show you a small example.
Lets suppose I have a table having three field studentId, teacherId and subject as
Now I want those data in which a physics teacher is teaching to only one student, i.e
teacher 300 is only teaching student 3 and so on.
What I have tried till now
select sid,tid from tabletesting with(nolock)
where tid in (select tid from tabletesting with(nolock)
where subject='physics' group by tid having count(tid) = 1)
and subject='physics'
The above query is working fine. But I want different solution in which I don't have to scan the same table twice.
I also tried using Rank() and Row_Number() but no result.
FYI :
I have showed you an example, this is not the actual table i am playing with, my table contain huge number of rows and columns and where clause is also very complex(i.e date comparison etc.), so I don't want to give the same where clause in subquery and outquery.
You can do this with window functions. Assuming that there are no duplicate students for a given teacher (as in your sample data):
select tt.sid, tt.tid
from (select tt.*, count(*) over (partition by teacher) as scnt
from TableTesting tt
) tt
where scnt = 1;
Another way to approach this, which might be more efficient, is to use an exists clause:
select tt.sid, tt.tid
from TableTesting tt
where not exists (select 1 from TableTesting tt1 where tt1.tid = tt.tid and tt1.sid <> tt.sid)
Another option is to use an analytic function:
select sid, tid, subject from
(
select sid, tid, subject, count(sid) over (partition by subject, tid) cnt
from tabletesting
) X
where cnt = 1
I have been trying to find some info on how to select a non-aggregate column that is not contained in the Group By statement in SQL, but nothing I've found so far seems to answer my question. I have a table with three columns that I want from it. One is a create date, one is a ID that groups the records by a particular Claim ID, and the final is the PK. I want to find the record that has the max creation date in each group of claim IDs. I am selecting the MAX(creation date), and Claim ID (cpe.fmgcms_cpeclaimid), and grouping by the Claim ID. But I need the PK from these records (cpe.fmgcms_claimid), and if I try to add it to my select clause, I get an error. And I can't add it to my group by clause because then it will throw off my intended grouping. Does anyone know any workarounds for this? Here is a sample of my code:
Select MAX(cpe.createdon) As MaxDate, cpe.fmgcms_cpeclaimid
from Filteredfmgcms_claimpaymentestimate cpe
where cpe.createdon < 'reportstartdate'
group by cpe.fmgcms_cpeclaimid
This is the result I'd like to get:
Select MAX(cpe.createdon) As MaxDate, cpe.fmgcms_cpeclaimid, cpe.fmgcms_claimid
from Filteredfmgcms_claimpaymentestimate cpe
where cpe.createdon < 'reportstartdate'
group by cpe.fmgcms_cpeclaimid
The columns in the result set of a select query with group by clause must be:
an expression used as one of the group by criteria , or ...
an aggregate function , or ...
a literal value
So, you can't do what you want to do in a single, simple query. The first thing to do is state your problem statement in a clear way, something like:
I want to find the individual claim row bearing the most recent
creation date within each group in my claims table
Given
create table dbo.some_claims_table
(
claim_id int not null ,
group_id int not null ,
date_created datetime not null ,
constraint some_table_PK primary key ( claim_id ) ,
constraint some_table_AK01 unique ( group_id , claim_id ) ,
constraint some_Table_AK02 unique ( group_id , date_created ) ,
)
The first thing to do is identify the most recent creation date for each group:
select group_id ,
date_created = max( date_created )
from dbo.claims_table
group by group_id
That gives you the selection criteria you need (1 row per group, with 2 columns: group_id and the highwater created date) to fullfill the 1st part of the requirement (selecting the individual row from each group. That needs to be a virtual table in your final select query:
select *
from dbo.claims_table t
join ( select group_id ,
date_created = max( date_created )
from dbo.claims_table
group by group_id
) x on x.group_id = t.group_id
and x.date_created = t.date_created
If the table is not unique by date_created within group_id (AK02), you you can get duplicate rows for a given group.
You can do this with PARTITION and RANK:
select * from
(
select MyPK, fmgcms_cpeclaimid, createdon,
Rank() over (Partition BY fmgcms_cpeclaimid order by createdon DESC) as Rank
from Filteredfmgcms_claimpaymentestimate
where createdon < 'reportstartdate'
) tmp
where Rank = 1
The direct answer is that you can't. You must select either an aggregate or something that you are grouping by.
So, you need an alternative approach.
1). Take you current query and join the base data back on it
SELECT
cpe.*
FROM
Filteredfmgcms_claimpaymentestimate cpe
INNER JOIN
(yourQuery) AS lookup
ON lookup.MaxData = cpe.createdOn
AND lookup.fmgcms_cpeclaimid = cpe.fmgcms_cpeclaimid
2). Use a CTE to do it all in one go...
WITH
sequenced_data AS
(
SELECT
*,
ROW_NUMBER() OVER (PARITION BY fmgcms_cpeclaimid ORDER BY CreatedOn DESC) AS sequence_id
FROM
Filteredfmgcms_claimpaymentestimate
WHERE
createdon < 'reportstartdate'
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
NOTE: Using ROW_NUMBER() will ensure just one record per fmgcms_cpeclaimid. Even if multiple records are tied with the exact same createdon value. If you can have ties, and want all records with the same createdon value, use RANK() instead.
You can join the table on itself to get the PK:
Select cpe1.PK, cpe2.MaxDate, cpe1.fmgcms_cpeclaimid
from Filteredfmgcms_claimpaymentestimate cpe1
INNER JOIN
(
select MAX(createdon) As MaxDate, fmgcms_cpeclaimid
from Filteredfmgcms_claimpaymentestimate
group by fmgcms_cpeclaimid
) cpe2
on cpe1.fmgcms_cpeclaimid = cpe2.fmgcms_cpeclaimid
and cpe1.createdon = cpe2.MaxDate
where cpe1.createdon < 'reportstartdate'
Thing I like to do is to wrap addition columns in aggregate function, like max().
It works very good when you don't expect duplicate values.
Select MAX(cpe.createdon) As MaxDate, cpe.fmgcms_cpeclaimid, MAX(cpe.fmgcms_claimid) As fmgcms_claimid
from Filteredfmgcms_claimpaymentestimate cpe
where cpe.createdon < 'reportstartdate'
group by cpe.fmgcms_cpeclaimid
What you are asking, Sir, is as the answer of RedFilter.
This answer as well helps in understanding why group by is somehow a simpler version or partition over:
SQL Server: Difference between PARTITION BY and GROUP BY
since it changes the way the returned value is calculated and therefore you could (somehow) return columns group by can not return.
You can use as below,
Select X.a, X.b, Y.c from (
Select X.a as a, sum (b) as sum_b from name_table X
group by X.a)X
left join from name_table Y on Y.a = X.a
Example;
CREATE TABLE #products (
product_name VARCHAR(MAX),
code varchar(3),
list_price [numeric](8, 2) NOT NULL
);
INSERT INTO #products VALUES ('paku', 'ACE', 2000)
INSERT INTO #products VALUES ('paku', 'ACE', 2000)
INSERT INTO #products VALUES ('Dinding', 'ADE', 2000)
INSERT INTO #products VALUES ('Kaca', 'AKB', 2000)
INSERT INTO #products VALUES ('paku', 'ACE', 2000)
--SELECT * FROM #products
SELECT distinct x.code, x.SUM_PRICE, product_name FROM (SELECT code, SUM(list_price) as SUM_PRICE From #products
group by code)x
left join #products y on y.code=x.code
DROP TABLE #products
CREATE TABLE doctor( patient CHAR(13), docname CHAR(30) );
Say I had a table like this, then how would I display the names of the doctors that have the most patients? Like if the most was three and two doctors had three patients then I would display both of their names.
This would get the max patients:
SELECT MAX(count)
FROM (SELECT COUNT(docname) FROM doctor GROUP BY docname) a;
This is all the doctors and how many patients they have:
SELECT docname, COUNT(docname) FROM doctor GROUP BY name;
Now I can't figure out how to combine them to list only the names of doctors who have the max patients.
Thanks.
This should do it.
SELECT docname, COUNT(*) FROM doctor GROUP BY name HAVING COUNT(*) =
(SELECT MAX(c) FROM
(SELECT COUNT(patient) AS c
FROM doctor
GROUP BY docname))
On the other hand if you require only the first entry, then
SELECT docname, COUNT(docname) FROM doctor
GROUP BY name
ORDER BY COUNT(docname) DESC LIMIT 1;
This should do it for you:
SELECT docname
FROM doctor
GROUP BY docname
HAVING COUNT(patient)=
(SELECT MAX(patientcount) FROM
(SELECT docname,COUNT(patient) AS patientcount
FROM doctor
GROUP BY docname) t1)
Here's another alternative that only has one subquery instead of two:
SELECT docname
FROM author
GROUP BY name
HAVING COUNT(*) = (
SELECT COUNT(*) AS c
FROM author
GROUP BY name
ORDER BY c DESC
LIMIT 1
)
Allowing for any feature in any ISO SQL specification since you did not specify a database product or version, and assuming that the table of patients is called "patients" and has a column called "docname", the following might give you what you wanted:
With PatientCounts As
(
Select docname
, Count(*) As PatientCount
From patient
Group By docname
)
, RankedCounts As
(
Select docname, PatientCount
, Rank() Over( Order By PatientCount ) As PatientCountRank
From PatientCounts
)
Select docname, PatientCount, PatientCountRank
From RankedCounts
Where PatientCountRank = 1
While using ... HAVING COUNT(*) = ( ...MAX().. ) works:
Within the query, it needs almost the same sub-query twice.
For most databases, it needs a 2nd level sub-query as MAX( COUNT(*) )
is not supported.
While using TOP / LIMIT / RANK etc works:
It uses SQL extensions for a specific database.
Also, using TOP / LIMIT of 1 will only give one row - what if there are two or more doctors with the same maximum number of patients?
I would break the problem into steps:
Get target field(s) and associated count
SELECT docName, COUNT( patient ) AS countX
FROM doctor
GROUP BY docName
Using the above as a 'statement scoped view', use this to get the max count row(s)
WITH x AS
(
SELECT docName, COUNT( patient ) AS countX
FROM doctor
GROUP BY docName
)
SELECT x.docName, x.countX
FROM x
WHERE x.countX = ( SELECT MAX( countX ) FROM x )
The WITH clause, which defines a 'statement scoped view', effectively gives named sub-queries that can be re-used within the same query.
While this solution, using statement scoped views, is longer, it is:
Easier to test
Self documenting
Extendable
It is easier to test as parts of the query can be run standalone.
It is self documenting as the query directly reflects the requirement
ie the statement scoped view lists the target field(s) and associated count.
It is extendable as if other conditions or fields are required, this can be easily added to the statement scoped view.
eg in this case, the table stucture should be changed to include a doctor-id as a primary key field and this should be part of the results.
Take both queries and join them together to get the max:
SELECT
docName,
m.MaxCount
FROM
author
INNER JOIN
(
SELECT
MAX(count) as MaxCount,
docName
FROM
(SELECT
COUNT(docname)
FROM
doctor
GROUP BY
docname
)
) m ON m.DocName = author.DocName
Another alternative using CTE:
with cte_DocPatients
as
(
select docname, count(*) as patientCount
from doctor
group by docname
)
select docname, patientCount from
cte_DocPatients where
patientCount = (select max(patientCount) from cte_DocPatients)
if you do not need to care about performance I think just sorting and pick first element. Something like this:
SELECT docname, COUNT(docname) as CNT
FROM doctor
WHERE docname = docname
GROUP BY docname
ORDER BY CNT DESC
LIMIT 1
This will give you each doctor name and respective count of treating patients
SELECT docname, COUNT(docname) as TreatingPatients FROM doctor
WHERE docname = docname
GROUP BY docname