Aggregating consecutive rows in SQL - sql

Given the sql table (I'm using SQLite3):
CREATE TABLE person(name text, number integer);
And filling with the values:
insert into person values
('Leandro', 2),
('Leandro', 4),
('Maria', 8),
('Maria', 16),
('Jose', 32),
('Leandro', 64);
What I want is to get the sum of the number column, but only for consecutive rows, so that I can the result, that maintain the original insertion order:
Leandro|6
Maria|24
Jose|32
Leandro|64
The "closest" I got so far is:
select name, sum(number) over(partition by name) from person order by rowid;
But it clearly shows I'm far from understanding SQL, as the most important features (grouping and summation of consecutive rows) is missing, but at least the order is there :-):
Leandro|70
Leandro|70
Maria|24
Maria|24
Jose|32
Leandro|70
Preferably the answer should not require creation of temporary tables, as the output is expected to always have the same order of how the data was inserted.

This is a type of gaps-and-islands problem. You can use the difference of row numbers for this purpose:
select name, sum(number)
from (select p.*,
row_number() over (order by number) as seqnum,
row_number() over (partition by name order by number) as seqnum_1
from person p
) p
group by name, (seqnum - seqnum_1)
order by. min(number);
Why this works is a little tricky to explain. However, it becomes pretty obvious when you look at the results of the subquery. The difference of row numbers is constant on adjacent rows when the name does not change.
Here is a db<>fiddle.

You can do it with window functions:
LAG() to check if the previous name is the same as the current one
SUM() to create groups for consecutive same names
and then group by the groups and aggregate:
select name, sum(number) total
from (
select *, sum(flag) over (order by rowid) grp
from (
select *, rowid, name <> lag(name, 1, '') over (order by rowid) flag
from person
)
)
group by grp
See the demo.
Results:
> name | total
> :------ | ----:
> Leandro | 6
> Maria | 24
> Jose | 32
> Leandro | 64

I would change the create table statement to the following:
CREATE TABLE person(id integer, firstname nvarchar(255), number integer);
you need a third column to dertermine the insert order
I would rename the column name to something like firstname, because name is a keyword in some DBMS. This applies also for the column named number. Moreover I would change the text type of name to nvarchar, because it is sortable in the group by cause.
Then you can insert your data:
insert into person values
(1, 'Leandro', 2),
(2, 'Leandro', 4),
(3, 'Maria', 8),
(4, 'Maria', 16),
(5, 'Jose', 32),
(6, 'Leandro', 64);
After that you can query the data in the following way:
SELECT firstname, value FROM (
SELECT p.id, p.firstname, p.number, LAG(p.firstname) over (ORDER BY p.id) as prevname,
CASE
WHEN firstname LIKE LEAD(p.firstname) over (ORDER BY p.id) THEN number + LEAD(p.number) over(ORDER BY p.id)
ELSE number
END as value
FROM Person p
) AS temp
WHERE temp.firstname <> temp.prevname OR
temp.prevname IS NULL
First you select the value in the case statement
Then you filter the data and look at those entries which previous name is not the name of the actual name.
To understand the query better, you can run the subquery on it's own:
SELECT p.id, p.firstname, p.number, LEAD(p.firstname) over (ORDER BY p.id) as nextname, LAG(p.firstname) over (ORDER BY p.id) as prevname,
CASE
WHEN firstname LIKE LEAD(p.firstname) over (ORDER BY p.id) THEN number + LEAD(p.number) over(ORDER BY p.id)
ELSE number
END as value
FROM Person p

Based on Gordon Linoff's answer (https://stackoverflow.com/a/64727401/1721672), I extracted the inner select as CTE and the following query works pretty well:
with p(name, number, seqnum, seqnum_1) as
(select name, number,
row_number() over (order by number) as seqnum,
row_number() over (partition by name order by number) as seqnum_1
from person)
select
name, sum(number)
from
p
group by
name, (seqnum - seqnum_1)
order by
min(number);
Producing the expected result:
Leandro|6
Maria|24
Jose|32
Leandro|64

Related

Pick a record based on the value of one column being the greatest in Snowflake

Let's say I have a table structured like this
Name
Score
Mike
40
Mike
79
Mike
49
And I wanted to return just the row that says Mike with the score of 79 and nothing else.
The code I have been playing around with looks like this:
SELECT Name, COUNT(Name), greatest(Score) FROM
table GROUP BY Name, Score
I tried a few different variations like using Rank and the greatest function, but haven't had too much luck. Any help would be much appreciated, thanks.
Using QUALIFY and RANK/ROW_NUMBER:
SELECT *
FROM tab
QUALIFY RANK() OVER(PARTITION BY Name ORDER BY Score DESC) = 1
The long form explanation:
If you add a ROW_NUMBER, and a RANK to the altered data:
WITH data(name, score, extra) as (
select * from values
('Mike', 40, 'a'),
('Mike', 79, 'b'),
('Mike', 79, 'c')
)
select *
,row_number() over (partition by name order by score desc) as rn
,rank() over (partition by name order by score desc) as rank
from data;
NAME
SCORE
EXTRA
RN
RANK
Mike
79
b
1
1
Mike
79
c
2
1
Mike
40
a
3
3
You can see that ROW_NUMBER will only assign the value 1 to one value, where RANK will give you as many values that hold that spot, and in the case of sparse RANK there will be gaps, as Mike,40 is the third value. So the choice between RANK/ROW_NUMBER depends how you want to handle the results and if you are joining to this data etc.
Then you can do a filter in the classic ANSI form:
WITH data(name, score, extra) as (
select * from values
('Mike', 40, 'a'),
('Mike', 79, 'b'),
('Mike', 79, 'c')
)
select name, score, extra
from (
select *
,row_number() over (partition by name order by score desc) as rn
from data
)
where rn = 1;
Note this is an unstable sort, as Mike,79,b OR Mike,79,c can be returned by the database, but with ROW_NUMBER you will only get one.
Snowflake has the QUALIFY command which allows dropping the sub-select, and having another filter run after grouping is complete.
So you can write:
select *
,row_number() over (partition by name order by score desc) as rn
from data
QUALIFY rn = 1;
NAME
SCORE
EXTRA
RN
Mike
79
b
1
but if you do not want to see the ROW_NUMBER value, it can be moved to the QUALIFY and the scope time is the exact same as the query, but it makes things tidier:
WITH data(name, score, extra) as (
select * from values
('Mike', 40, 'a'),
('Mike', 79, 'b'),
('Mike', 79, 'c')
)
select *
from data
QUALIFY row_number() over (partition by name order by score desc) = 1;
NAME
SCORE
EXTRA
Mike
79
b
Not sure why other answers are complicating things, you just want to be using the max function, like so:
WITH data(name, score) as (
select * from values
('Mike', 40),
('Mike', 79),
('Mike', 79)
)
select name, max(score) as score
from data
where name ='Mike' group by name;
With your query:
SELECT Name, COUNT(*), max(Score) FROM
table GROUP BY Name
greatest is "similar" to max from a functionality perspective but as you mentioned it did not work, that's because of it's signature - it's not meant to receive a single expr as input. I recommend you read about the differences between max and greatest to make sure you understand them fully.

Display duplicate row indicator and get only one row when duplicate

I built the schema at http://sqlfiddle.com/#!18/7e9e3
CREATE TABLE BoatOwners
(
BoatID INT,
OwnerDOB DATETIME,
Name VARCHAR(200)
);
INSERT INTO BoatOwners (BoatID, OwnerDOB,Name)
VALUES (1, '2021-04-06', 'Bob1'),
(1, '2020-04-06', 'Bob2'),
(1, '2019-04-06', 'Bob3'),
(2, '2012-04-06', 'Tom'),
(3, '2009-04-06', 'David'),
(4, '2006-04-06', 'Dale1'),
(4, '2009-04-06', 'Dale2'),
(4, '2013-04-06', 'Dale3');
I would like to write a query that would produce the following result characteristics :
Returns only one owner per boat
When multiple owners on a single boat, return the youngest owner.
Display a column to indicate if a boat has multiple owners.
So the following data set when apply that query would produce
I tried
ROW_NUMBER() OVER (PARTITION BY ....
but haven't had much luck so far.
with data as (
select BoatID, OwnerDOB, Name,
row_number() over (partition by BoatID order by OwnerDOB desc) as rn,
count() over (partition by BoatID) as cnt
from BoatOwners
)
select BoatID, OwnerDOB, Name,
case when cnt > 1 then 'Yes' else 'No' end as MultipleOwner
from data
where rn = 1
This is just a case of numbering the rows for each BoatId group and also counting the rows in each group, then filtering accordingly:
select BoatId, OwnerDob, Name, Iif(qty=1,'No','Yes') MultipleOwner
from (
select *, Row_Number() over(partition by boatid order by OwnerDOB desc)rn, Count(*) over(partition by boatid) qty
from BoatOwners
)b where rn=1

How do I create a new SQL table with custom column names and populate these columns

So I currently have an SQL statement that generates a table with the most frequent occurring value as well as the least frequent occurring value in a table. However this table has 2 rows with the row values as well as the fields. I need to create a custom table with 2 columns with min and max. Then have one row with one value for each. The value for these columns needs to be from the same row.
(SELECT name, COUNT(name) AS frequency
FROM firefighter_certifications
GROUP BY name
ORDER BY frequency DESC limit 1)
UNION
(SELECT name, COUNT(name) AS frequency
FROM firefighter_certifications
GROUP BY name
ORDER BY frequency ASC limit 1);
So for the above query I would need the names of the min and max values in one row. I need to be able to define the name of new columns for the generated SQL query as well.
Min_Name | Max_Name
Certif_1 | Certif_2
I think this query should give you the results you want. It ranks each name according to the number of times it appears in the table, then uses conditional aggregation to select the min and max frequency names in one row:
with cte as (
select name,
row_number() over (order by count(*) desc) as maxr,
row_number() over (order by count(*)) as minr
from firefighter_certifications
group by name
)
select max(case when minr = 1 then name end) as Min_Name,
max(case when maxr = 1 then name end) as Max_Name
from cte
Postgres doesn't offer "first" and "last" aggregation functions. But there are other, similar methods:
select distinct first_value(name) over (order by cnt desc, name) as name_at_max,
first_value(name) over (order by cnt asc, name) as name_at_min
from (select name, count(*) as cnt
from firefighter_certifications
group by name
) n;
Or without any subquery at all:
select first_value(name) over (order by count(*) desc, name) as name_at_max,
first_value(name) over (order by count(*) asc, name) as name_at_min
from firefighter_certifications
group by name
limit 1;
Here is a db<>fiddle

Grouping while maintaining next record

I have a table (NerdsTable) with some of this data:
-------------+-----------+----------------
id name school
-------------+-----------+----------------
1 Joe ODU
2 Mike VCU
3 Ane ODU
4 Trevor VT
5 Cools VCU
When I run the following query
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable where school = 'ODU';
I get these results:
[id=1,name=Joe,nextid=3]
[id=3,name=Ane,nextid=NULL]
I want to write a query that does not need the static check for
where school = 'odu'
but gives back the same results as above. In another words, I want to select all results in the database, and have them grouped correctly as if i went through individually and ran queries for:
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'ODU';
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'VCU';
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'VT';
Here is the output I am hoping to see:
[id=1,name=Joe,nextid=3]
[id=3,name=Ane,nextid=NULL]
[id=2,name=Mike,nextid=5]
[id=5,name=Cools,nextid=NULL]
[id=4,name=Trevor,nextid=NULL]
Here is what I have tried, but am failing miserably:
SELECT id, name,
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
ORDER BY school;
-- Problem, as this does not sort by the id. I need the lowest id first for the group
SELECT id, name,
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
ORDER BY id, school;
-- Sorts by id, but the grouping is not correct, thus next_id is wrong
I then looked on the Microsoft doc site for aggregate functions, but do not see how i can use any to group my results correctly. I tried to use GROUPING_ID, as follows:
SELECT id, GROUPING_ID(name),
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
group by school;
But I get an error:
is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
Any idea as to what I am missing here?
From your desired output it looks like you are just trying to order the records by school. You can do that like this:
SELECT id, name
FROM dbo.NerdsTable
ORDER BY school ASC, id ASC
I don't know what next ID is supposed to mean.
create table schools (id int, name varchar(50), school varchar(3))
insert into schools values (1, 'Joe', 'ODU'), (2, 'Mike', 'VCU'), (3, 'Ane',
'ODU'), (4, 'Trevor', 'VT'), (5, 'Cools', 'VCU'), (6, 'Sarah', 'VCU')
select n.id, n.name, min(g.id) nextid
from schools n
left join
(
select id, school
from schools
) g on g.school = n.school and g.id > n.id
group by n.id, n.name
drop table schools

Oracle - Calculating time differences

Let's say I have following data:
Create Table Pm_Test (
Ticket_id Number,
Department_From varchar2(100),
Department_To varchar2(100),
Routing_Date Date
);
Insert Into Pm_Test Values (1,'A','B',To_Date('20140101120005','yyyymmddhh24miss'));
Insert Into Pm_Test Values (1,'B','C',To_Date('20140101130004','yyyymmddhh24miss'));
Insert Into Pm_Test Values (1,'C','D',To_Date('20140101130004','yyyymmddhh24miss'));
Insert Into Pm_Test Values (1,'D','E',To_Date('20140201150004','yyyymmddhh24miss'));
Insert Into Pm_Test Values (2,'A','B',To_Date('20140102120005','yyyymmddhh24miss'));
Insert Into Pm_Test Values (3,'D','B',To_Date('20140102120005','yyyymmddhh24miss'));
Insert Into Pm_Test Values (3,'B','A',To_Date('20140102170005','yyyymmddhh24miss'));
For the following requirements I already added two virtual columns, I think they might be necessary:
Select t.*,
Count(Ticket_id) Over (Partition By Ticket_id Order By Ticket_id) Cnt_Id,
Row_Number() Over (Partition By Ticket_id Order By Ticket_id ) row_number
From Pm_Test t;
1) I want to measure how long each ticket stayed in a department (routing_date of successor_department - routing_date of predecessor department) by adding the column PROCESSING_TIME:
2) I want to measure the total processing time by adding the column TOTAL_PROCESSING_TIME:
What SQL statements would be necessary to do so?
Thank you very much in advance!
To solve your problem, the way you described, the following sql should get you there. One thing to keep in mind, this data model doesn't seem the most efficient to capture processing times, if that's its true intent as the first department to get the ticket isn't measured.
select dept.ticket_id, department_from, department_to, routing_date, dept_processing_time, total_ticket_processing_time
from
(select ticket_id, max(routing_date) - min(routing_date) total_ticket_processing_time
from pm_test
group by ticket_id) total
join
(select ticket_id, department_from, department_to, routing_date,
coalesce(routing_date - lag(routing_date) over (partition by ticket_id order by routing_date), 0) dept_processing_time
from pm_test) dept
on (total.ticket_id = dept.ticket_id);
This query produces desired output. Analytic functions max(), min() and lag() used for calculations.
Results are in hours, like in your question.
SQLFiddle
select t.ticket_id, t.department_from, t.department_to,
to_char(t.routing_date, 'mm.dd.yy hh24:mi:ss') rd,
count(ticket_id) over (partition by ticket_id) cnt_id,
row_number() over (partition by ticket_id order by t.routing_date ) rn,
round(24 * (t.routing_date-
nvl(lag(t.routing_date) over (partition by ticket_id
order by t.routing_date), routing_date) ) , 8) dept_time,
round(24 * (max(t.routing_date) over (partition by ticket_id)
- min(t.routing_date) over (partition by ticket_id)), 8) total_time
from pm_test t