Grouping while maintaining next record - sql

I have a table (NerdsTable) with some of this data:
-------------+-----------+----------------
id name school
-------------+-----------+----------------
1 Joe ODU
2 Mike VCU
3 Ane ODU
4 Trevor VT
5 Cools VCU
When I run the following query
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable where school = 'ODU';
I get these results:
[id=1,name=Joe,nextid=3]
[id=3,name=Ane,nextid=NULL]
I want to write a query that does not need the static check for
where school = 'odu'
but gives back the same results as above. In another words, I want to select all results in the database, and have them grouped correctly as if i went through individually and ran queries for:
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'ODU';
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'VCU';
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'VT';
Here is the output I am hoping to see:
[id=1,name=Joe,nextid=3]
[id=3,name=Ane,nextid=NULL]
[id=2,name=Mike,nextid=5]
[id=5,name=Cools,nextid=NULL]
[id=4,name=Trevor,nextid=NULL]
Here is what I have tried, but am failing miserably:
SELECT id, name,
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
ORDER BY school;
-- Problem, as this does not sort by the id. I need the lowest id first for the group
SELECT id, name,
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
ORDER BY id, school;
-- Sorts by id, but the grouping is not correct, thus next_id is wrong
I then looked on the Microsoft doc site for aggregate functions, but do not see how i can use any to group my results correctly. I tried to use GROUPING_ID, as follows:
SELECT id, GROUPING_ID(name),
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
group by school;
But I get an error:
is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
Any idea as to what I am missing here?

From your desired output it looks like you are just trying to order the records by school. You can do that like this:
SELECT id, name
FROM dbo.NerdsTable
ORDER BY school ASC, id ASC
I don't know what next ID is supposed to mean.

create table schools (id int, name varchar(50), school varchar(3))
insert into schools values (1, 'Joe', 'ODU'), (2, 'Mike', 'VCU'), (3, 'Ane',
'ODU'), (4, 'Trevor', 'VT'), (5, 'Cools', 'VCU'), (6, 'Sarah', 'VCU')
select n.id, n.name, min(g.id) nextid
from schools n
left join
(
select id, school
from schools
) g on g.school = n.school and g.id > n.id
group by n.id, n.name
drop table schools

Related

SQL Apply Distinct One Column [duplicate]

This question already has answers here:
Get top 1 row of each group
(19 answers)
Closed 9 months ago.
I have a SQL query as below. I want to apply DISTINCT to Name column in this query. Can you help me?
SELECT Id,
ConflictCheckRequestIndividualId,
Name,
Surname,
it.IndividualType AS IndividualType,
JobTitle,
RegistrationNumber,
Title,
District,
Status,
CreatedBy,
Created,
ModifiedBy,
Modified
FROM ConflictCheckItoIndividual
LEFT JOIN #IndividualTypes it
ON it.IndividualId = ConflictCheckRequestIndividualId
WHERE ConflictCheckRequestIndividualId IN
(SELECT Id
FROM ConflictCheckRequestIndividual
WHERE ConflictCheckRequestId = #ConflictId
AND SubStatus = 2)
Two ways with subtly different results. "GROUP BY X" is another way of saying "Give me one row per X". You will have to apply an aggregation function to each other row so it knows how to squash the rows into 1:
SELECT MAX(Id),
MAX(ConflictCheckRequestIndividualId),
Name,
MAX(Surname),
MAX(it.IndividualType) AS IndividualType,
MAX(JobTitle),
MAX(RegistrationNumber),
MAX(Title),
MAX(District),
MAX(Status),
MAX(CreatedBy),
MAX(Created),
MAX(ModifiedBy),
MAX(Modified)
FROM ConflictCheckItoIndividual
LEFT JOIN #IndividualTypes it
ON it.IndividualId = ConflictCheckRequestIndividualId
WHERE ConflictCheckRequestIndividualId IN
(SELECT Id
FROM ConflictCheckRequestIndividual
WHERE ConflictCheckRequestId = #ConflictId
AND SubStatus = 2)
GROUP BY Name
This might end up with data for each field coming from different rows. If you wanted all data to come from one row, you could do this
;WITH cte AS
(
SELECT Id,
ConflictCheckRequestIndividualId,
Name,
Surname,
it.IndividualType AS IndividualType,
JobTitle,
RegistrationNumber,
Title,
District,
Status,
CreatedBy,
Created,
ModifiedBy,
Modified,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Modified DESC) as rownum
FROM ConflictCheckItoIndividual
LEFT JOIN #IndividualTypes it
ON it.IndividualId = ConflictCheckRequestIndividualId
WHERE ConflictCheckRequestIndividualId IN
(SELECT Id
FROM ConflictCheckRequestIndividual
WHERE ConflictCheckRequestId = #ConflictId
AND SubStatus = 2)
)
SELECT * FROM cte WHERE rownum = 1
This is "Partitioning" the data into one bucket per Name. Within each bucket, its ordering the rows by Modified in descending order. We then only pick out one row from each bucket - the most recently Modified one.
As an aside, given that there is a Name and a Surname field, I would expect to group by that as well
Just realize that what your asking for means: You get a random dataset for 1 particular Name, but if that's good enough for you at this moment this code should do:
select * from (
SELECT
ROW_NUMBER() over (partition by Name order by Id) [row],
Id,
ConflictCheckRequestIndividualId,
Name,
Surname,
it.IndividualType AS IndividualType,
JobTitle,
RegistrationNumber,
Title,
District,
Status,
CreatedBy,
Created,
ModifiedBy,
Modified
FROM ConflictCheckItoIndividual
LEFT JOIN #IndividualTypes it
ON it.IndividualId = ConflictCheckRequestIndividualId
WHERE ConflictCheckRequestIndividualId IN
(SELECT Id
FROM ConflictCheckRequestIndividual
WHERE ConflictCheckRequestId = #ConflictId
AND SubStatus = 2)) data
where [row] = 1
Will you let me know if this works for you?

Display duplicate row indicator and get only one row when duplicate

I built the schema at http://sqlfiddle.com/#!18/7e9e3
CREATE TABLE BoatOwners
(
BoatID INT,
OwnerDOB DATETIME,
Name VARCHAR(200)
);
INSERT INTO BoatOwners (BoatID, OwnerDOB,Name)
VALUES (1, '2021-04-06', 'Bob1'),
(1, '2020-04-06', 'Bob2'),
(1, '2019-04-06', 'Bob3'),
(2, '2012-04-06', 'Tom'),
(3, '2009-04-06', 'David'),
(4, '2006-04-06', 'Dale1'),
(4, '2009-04-06', 'Dale2'),
(4, '2013-04-06', 'Dale3');
I would like to write a query that would produce the following result characteristics :
Returns only one owner per boat
When multiple owners on a single boat, return the youngest owner.
Display a column to indicate if a boat has multiple owners.
So the following data set when apply that query would produce
I tried
ROW_NUMBER() OVER (PARTITION BY ....
but haven't had much luck so far.
with data as (
select BoatID, OwnerDOB, Name,
row_number() over (partition by BoatID order by OwnerDOB desc) as rn,
count() over (partition by BoatID) as cnt
from BoatOwners
)
select BoatID, OwnerDOB, Name,
case when cnt > 1 then 'Yes' else 'No' end as MultipleOwner
from data
where rn = 1
This is just a case of numbering the rows for each BoatId group and also counting the rows in each group, then filtering accordingly:
select BoatId, OwnerDob, Name, Iif(qty=1,'No','Yes') MultipleOwner
from (
select *, Row_Number() over(partition by boatid order by OwnerDOB desc)rn, Count(*) over(partition by boatid) qty
from BoatOwners
)b where rn=1

Aggregating consecutive rows in SQL

Given the sql table (I'm using SQLite3):
CREATE TABLE person(name text, number integer);
And filling with the values:
insert into person values
('Leandro', 2),
('Leandro', 4),
('Maria', 8),
('Maria', 16),
('Jose', 32),
('Leandro', 64);
What I want is to get the sum of the number column, but only for consecutive rows, so that I can the result, that maintain the original insertion order:
Leandro|6
Maria|24
Jose|32
Leandro|64
The "closest" I got so far is:
select name, sum(number) over(partition by name) from person order by rowid;
But it clearly shows I'm far from understanding SQL, as the most important features (grouping and summation of consecutive rows) is missing, but at least the order is there :-):
Leandro|70
Leandro|70
Maria|24
Maria|24
Jose|32
Leandro|70
Preferably the answer should not require creation of temporary tables, as the output is expected to always have the same order of how the data was inserted.
This is a type of gaps-and-islands problem. You can use the difference of row numbers for this purpose:
select name, sum(number)
from (select p.*,
row_number() over (order by number) as seqnum,
row_number() over (partition by name order by number) as seqnum_1
from person p
) p
group by name, (seqnum - seqnum_1)
order by. min(number);
Why this works is a little tricky to explain. However, it becomes pretty obvious when you look at the results of the subquery. The difference of row numbers is constant on adjacent rows when the name does not change.
Here is a db<>fiddle.
You can do it with window functions:
LAG() to check if the previous name is the same as the current one
SUM() to create groups for consecutive same names
and then group by the groups and aggregate:
select name, sum(number) total
from (
select *, sum(flag) over (order by rowid) grp
from (
select *, rowid, name <> lag(name, 1, '') over (order by rowid) flag
from person
)
)
group by grp
See the demo.
Results:
> name | total
> :------ | ----:
> Leandro | 6
> Maria | 24
> Jose | 32
> Leandro | 64
I would change the create table statement to the following:
CREATE TABLE person(id integer, firstname nvarchar(255), number integer);
you need a third column to dertermine the insert order
I would rename the column name to something like firstname, because name is a keyword in some DBMS. This applies also for the column named number. Moreover I would change the text type of name to nvarchar, because it is sortable in the group by cause.
Then you can insert your data:
insert into person values
(1, 'Leandro', 2),
(2, 'Leandro', 4),
(3, 'Maria', 8),
(4, 'Maria', 16),
(5, 'Jose', 32),
(6, 'Leandro', 64);
After that you can query the data in the following way:
SELECT firstname, value FROM (
SELECT p.id, p.firstname, p.number, LAG(p.firstname) over (ORDER BY p.id) as prevname,
CASE
WHEN firstname LIKE LEAD(p.firstname) over (ORDER BY p.id) THEN number + LEAD(p.number) over(ORDER BY p.id)
ELSE number
END as value
FROM Person p
) AS temp
WHERE temp.firstname <> temp.prevname OR
temp.prevname IS NULL
First you select the value in the case statement
Then you filter the data and look at those entries which previous name is not the name of the actual name.
To understand the query better, you can run the subquery on it's own:
SELECT p.id, p.firstname, p.number, LEAD(p.firstname) over (ORDER BY p.id) as nextname, LAG(p.firstname) over (ORDER BY p.id) as prevname,
CASE
WHEN firstname LIKE LEAD(p.firstname) over (ORDER BY p.id) THEN number + LEAD(p.number) over(ORDER BY p.id)
ELSE number
END as value
FROM Person p
Based on Gordon Linoff's answer (https://stackoverflow.com/a/64727401/1721672), I extracted the inner select as CTE and the following query works pretty well:
with p(name, number, seqnum, seqnum_1) as
(select name, number,
row_number() over (order by number) as seqnum,
row_number() over (partition by name order by number) as seqnum_1
from person)
select
name, sum(number)
from
p
group by
name, (seqnum - seqnum_1)
order by
min(number);
Producing the expected result:
Leandro|6
Maria|24
Jose|32
Leandro|64

Group by two columns, take sum, then max

There are three columns: Id (char), Name (char), and Score (int).
First, we group by Id and Name and add Score for each group. Let us call the added score total_score.
Then, we group by Name and take only the maximum of total_score and its corresponding Id and Name. I've got everything else but I'm having a hard time figuring out how to get the Id. The error I get is
Column 'Id' is invalid in the select list because
it is not contained in either an aggregate function or the GROUP BY
clause.
WITH Tmp AS
(SELECT Id,
Name,
SUM(Score) AS total_score
FROM Mytable
GROUP BY Id,
Name)
SELECT Name, -- Id,
MAX(total_score) AS max_score
FROM Tmp
GROUP BY Name
ORDER BY max_score DESC
just add row_number() partition by Name to your query and get the 1st row (order by total_score descending)
select *
from
(
-- your existing `total_score` query
SELECT Id, Name,
SUM(Score) AS total_score,
r = row_number() over (partition by Name order by SUM(Score) desc)
FROM Mytable
GROUP BY Id, Name
) d
where r = 1
WITH Tmp AS
(SELECT Id,
Name,
SUM(Score) AS total_score
FROM Mytable
GROUP BY Id,
Name)
SELECT Name, Id,
MAX(total_score) AS max_score
FROM Tmp
GROUP BY Name,id
ORDER BY max_score DESC
Try this. Hope this will help.
WITH Tmp AS
(
SELECT Id,
Name,
SUM(Score) AS total_score
FROM Mytable
GROUP BY Id,
NAME
)
SELECT Name, Id,
MAX(total_score) AS max_score
FROM Tmp
GROUP BY Name,id
ORDER BY max_score DESC
Note:- If we are using aggregate function then we have to use other column as Group By....
In your case you are using SUM(Score) as aggregate function then we to use other column as Group by ...
I am not sure about performance of below query but we can use window functions to get maximum value from data partition.
SELECT
Id,
Name,
SUM(Score) AS total_score,
MAX(SUM(Score)) OVER(Partition by Name) AS max_score
FROM Mytable
GROUP BY Id, Name;
Tested -
declare #Mytable table (id int, name varchar(10), score int);
insert into #Mytable values
(1,'abc', 100),
(2,'abc', 200),
(3,'def', 300),
(3,'def', 400),
(4,'pqr', 500);
Output -
Id Name total_score max_score
1 abc 100 200
2 abc 200 200
3 def 700 700
4 pqr 500 500
You can select DENSE_RANK() with total_score column and then select records with Rank = 1. This will work for those also when there are multiple Name which are having same total_score.
WITH Tmp AS
(SELECT Id,
Name,
SUM(Score) AS total_score
FROM Mytable
GROUP BY Id, Name)
SELECT Id,
Name,
total_score AS max_score
FROM (SELECT Id,
Name,
total_score,
DENSE_RANK() OVER (PARTITION BY Name ORDER BY total_score DESC) AS Rank
FROM Tmp) AS Tmp2
WHERE Rank = 1
You can try this as well:
select id,name,max(total_score) over (partition by name) max_score from (
select id,name,sum(score) as total_score from YOURTABLE
group by id,name
) t

How to find duplicate records in PostgreSQL

I have a PostgreSQL database table called "user_links" which currently allows the following duplicate fields:
year, user_id, sid, cid
The unique constraint is currently the first field called "id", however I am now looking to add a constraint to make sure the year, user_id, sid and cid are all unique but I cannot apply the constraint because duplicate values already exist which violate this constraint.
Is there a way to find all duplicates?
The basic idea will be using a nested query with count aggregation:
select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1
You can adjust the where clause in the inner query to narrow the search.
There is another good solution for that mentioned in the comments, (but not everyone reads them):
select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1
Or shorter:
SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1
From "Find duplicate rows with PostgreSQL" here's smart solution:
select * from (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
FROM tbl
) dups
where
dups.Row > 1
In order to make it easier I assume that you wish to apply a unique constraint only for column year and the primary key is a column named id.
In order to find duplicate values you should run,
SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);
Using the sql statement above you get a table which contains all the duplicate years in your table. In order to delete all the duplicates except of the the latest duplicate entry you should use the above sql statement.
DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;
You can join to the same table on the fields that would be duplicated and then anti-join on the id field. Select the id field from the first table alias (tn1) and then use the array_agg function on the id field of the second table alias. Finally, for the array_agg function to work properly, you will group the results by the tn1.id field. This will produce a result set that contains the the id of a record and an array of all the id's that fit the join conditions.
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id;
Obviously, id's that will be in the duplicate_entries array for one id, will also have their own entries in the result set. You will have to use this result set to decide which id you want to become the source of 'truth.' The one record that shouldn't get deleted. Maybe you could do something like this:
with dupe_set as (
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists
(select de from unnest(ds.duplicate_entries) as de where de < ds.id)
Selects the lowest number ID's that have duplicates (assuming the ID is increasing int PK). These would be the ID's that you would keep around.
Inspired by Sandro Wiggers, I did something similiar to
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id
FROM ordered
WHERE rnk > 1
)
DELETE
FROM user_links
USING to_delete
WHERE user_link.id = to_delete.id;
If you want to test it, change it slightly:
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id,year,user_id,sid, cid
FROM ordered
WHERE rnk > 1
)
SELECT * FROM to_delete;
This will give an overview of what is going to be deleted (there is no problem to keep year,user_id,sid,cid in the to_delete query when running the deletion, but then they are not needed)
In your case, because of the constraint you need to delete the duplicated records.
Find the duplicated rows
Organize them by created_at date - in this case I'm keeping the oldest
Delete the records with USING to filter the right rows
WITH duplicated AS (
SELECT id,
count(*)
FROM products
GROUP BY id
HAVING count(*) > 1),
ordered AS (
SELECT p.id,
created_at,
rank() OVER (partition BY p.id ORDER BY p.created_at) AS rnk
FROM products o
JOIN duplicated d ON d.id = p.id ),
products_to_delete AS (
SELECT id,
created_at
FROM ordered
WHERE rnk = 2
)
DELETE
FROM products
USING products_to_delete
WHERE products.id = products_to_delete.id
AND products.created_at = products_to_delete.created_at;
Following SQL syntax provides better performance while checking for duplicate rows.
SELECT id, count(id)
FROM table1
GROUP BY id
HAVING count(id) > 1
begin;
create table user_links(id serial,year bigint, user_id bigint, sid bigint, cid bigint);
insert into user_links(year, user_id, sid, cid) values (null,null,null,null),
(null,null,null,null), (null,null,null,null),
(1,2,3,4), (1,2,3,4),
(1,2,3,4),(1,1,3,8),
(1,1,3,9),
(1,null,null,null),(1,null,null,null);
commit;
set operation with distinct and except.
(select id, year, user_id, sid, cid from user_links order by 1)
except
select distinct on (year, user_id, sid, cid) id, year, user_id, sid, cid
from user_links order by 1;
except all also works. Since id serial make all rows unique.
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1;
So far works nulls and non-nulls.
delete:
with a as(
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1)
delete from user_links using a where user_links.id = a.id returning *;