Display duplicate row indicator and get only one row when duplicate - sql

I built the schema at http://sqlfiddle.com/#!18/7e9e3
CREATE TABLE BoatOwners
(
BoatID INT,
OwnerDOB DATETIME,
Name VARCHAR(200)
);
INSERT INTO BoatOwners (BoatID, OwnerDOB,Name)
VALUES (1, '2021-04-06', 'Bob1'),
(1, '2020-04-06', 'Bob2'),
(1, '2019-04-06', 'Bob3'),
(2, '2012-04-06', 'Tom'),
(3, '2009-04-06', 'David'),
(4, '2006-04-06', 'Dale1'),
(4, '2009-04-06', 'Dale2'),
(4, '2013-04-06', 'Dale3');
I would like to write a query that would produce the following result characteristics :
Returns only one owner per boat
When multiple owners on a single boat, return the youngest owner.
Display a column to indicate if a boat has multiple owners.
So the following data set when apply that query would produce
I tried
ROW_NUMBER() OVER (PARTITION BY ....
but haven't had much luck so far.

with data as (
select BoatID, OwnerDOB, Name,
row_number() over (partition by BoatID order by OwnerDOB desc) as rn,
count() over (partition by BoatID) as cnt
from BoatOwners
)
select BoatID, OwnerDOB, Name,
case when cnt > 1 then 'Yes' else 'No' end as MultipleOwner
from data
where rn = 1

This is just a case of numbering the rows for each BoatId group and also counting the rows in each group, then filtering accordingly:
select BoatId, OwnerDob, Name, Iif(qty=1,'No','Yes') MultipleOwner
from (
select *, Row_Number() over(partition by boatid order by OwnerDOB desc)rn, Count(*) over(partition by boatid) qty
from BoatOwners
)b where rn=1

Related

Aggregating consecutive rows in SQL

Given the sql table (I'm using SQLite3):
CREATE TABLE person(name text, number integer);
And filling with the values:
insert into person values
('Leandro', 2),
('Leandro', 4),
('Maria', 8),
('Maria', 16),
('Jose', 32),
('Leandro', 64);
What I want is to get the sum of the number column, but only for consecutive rows, so that I can the result, that maintain the original insertion order:
Leandro|6
Maria|24
Jose|32
Leandro|64
The "closest" I got so far is:
select name, sum(number) over(partition by name) from person order by rowid;
But it clearly shows I'm far from understanding SQL, as the most important features (grouping and summation of consecutive rows) is missing, but at least the order is there :-):
Leandro|70
Leandro|70
Maria|24
Maria|24
Jose|32
Leandro|70
Preferably the answer should not require creation of temporary tables, as the output is expected to always have the same order of how the data was inserted.
This is a type of gaps-and-islands problem. You can use the difference of row numbers for this purpose:
select name, sum(number)
from (select p.*,
row_number() over (order by number) as seqnum,
row_number() over (partition by name order by number) as seqnum_1
from person p
) p
group by name, (seqnum - seqnum_1)
order by. min(number);
Why this works is a little tricky to explain. However, it becomes pretty obvious when you look at the results of the subquery. The difference of row numbers is constant on adjacent rows when the name does not change.
Here is a db<>fiddle.
You can do it with window functions:
LAG() to check if the previous name is the same as the current one
SUM() to create groups for consecutive same names
and then group by the groups and aggregate:
select name, sum(number) total
from (
select *, sum(flag) over (order by rowid) grp
from (
select *, rowid, name <> lag(name, 1, '') over (order by rowid) flag
from person
)
)
group by grp
See the demo.
Results:
> name | total
> :------ | ----:
> Leandro | 6
> Maria | 24
> Jose | 32
> Leandro | 64
I would change the create table statement to the following:
CREATE TABLE person(id integer, firstname nvarchar(255), number integer);
you need a third column to dertermine the insert order
I would rename the column name to something like firstname, because name is a keyword in some DBMS. This applies also for the column named number. Moreover I would change the text type of name to nvarchar, because it is sortable in the group by cause.
Then you can insert your data:
insert into person values
(1, 'Leandro', 2),
(2, 'Leandro', 4),
(3, 'Maria', 8),
(4, 'Maria', 16),
(5, 'Jose', 32),
(6, 'Leandro', 64);
After that you can query the data in the following way:
SELECT firstname, value FROM (
SELECT p.id, p.firstname, p.number, LAG(p.firstname) over (ORDER BY p.id) as prevname,
CASE
WHEN firstname LIKE LEAD(p.firstname) over (ORDER BY p.id) THEN number + LEAD(p.number) over(ORDER BY p.id)
ELSE number
END as value
FROM Person p
) AS temp
WHERE temp.firstname <> temp.prevname OR
temp.prevname IS NULL
First you select the value in the case statement
Then you filter the data and look at those entries which previous name is not the name of the actual name.
To understand the query better, you can run the subquery on it's own:
SELECT p.id, p.firstname, p.number, LEAD(p.firstname) over (ORDER BY p.id) as nextname, LAG(p.firstname) over (ORDER BY p.id) as prevname,
CASE
WHEN firstname LIKE LEAD(p.firstname) over (ORDER BY p.id) THEN number + LEAD(p.number) over(ORDER BY p.id)
ELSE number
END as value
FROM Person p
Based on Gordon Linoff's answer (https://stackoverflow.com/a/64727401/1721672), I extracted the inner select as CTE and the following query works pretty well:
with p(name, number, seqnum, seqnum_1) as
(select name, number,
row_number() over (order by number) as seqnum,
row_number() over (partition by name order by number) as seqnum_1
from person)
select
name, sum(number)
from
p
group by
name, (seqnum - seqnum_1)
order by
min(number);
Producing the expected result:
Leandro|6
Maria|24
Jose|32
Leandro|64

Removing All but the first and last values by group when the group is repeated in MS SQL Server (contiguous)

We have a chat system that generates multiple event logs per second sometimes for every event during a chat. The issue is that these consume a massive amount of data storage (which is very expensive on that platform) and we'd like to streamline what we actually store and delete things that really aren't necessary.
To that end, there's an event type for what position in the queue the chat is. We don't care about each position as long as they are not intervening events for that chat. So we want to keep only the first and last in each distinct group where there were no other event types to just get "total time in queue" for that period.
To complicate this, a customer can go in and out of queue as they get transferred by department, so the SAME CHAT can have multiple blocks of these queue position records. I've tried using FIRST_VALUE and LAST_VALUE and it gets me most of the way there, but fails when we have the case of two distinct blocks of these events.
Here's the script to generate the test data:
<!-- language: lang-sql -->
CREATE TABLE #testdata (
id varchar(18),
name varchar(8),
[type] varchar(20),
livechattranscriptid varchar(18),
groupid varchar(40))
INSERT INTO #testdata (id,name,[type],livechattranscriptid,groupid) VALUES
('0DZ14000003I2pOGAS','34128314','ChatRequest','57014000000ltfIAAQ','57014000000ltfIAAQChatRequest'),
('0DZ14000003IGmQGAW','34181980','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IHbqGAG','34185171','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003ILuHGAW','34201743','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IQ6cGAG','34217778','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IR7JGAW','34221794','PushAssignment','57014000000ltfIAAQ','57014000000ltfIAAQPushAssignment'),
('0DZ14000003IiDnGAK','34287448','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IiDoGAK','34287545','PushAssignment','57014000000ltfIAAQ','57014000000ltfIAAQPushAssignment'),
('0DZ14000003Iut5GAC','34336044','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003Iv7HGAS','34336906','Accept','57014000000ltfIAAQ','57014000000ltfIAAQAccept')
And here is the attempt to identify anything that was the first and last id for it's group ordered by the name field and grouped by the transcriptid:
select *,FIRST_VALUE(id) OVER(Partition BY groupid order by livechattranscriptid,name asc) as firstinstancegroup,
LAST_VALUE(id) OVER(Partition BY groupid order by livechattranscriptid,name asc RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as lastinstancegroup from #testdata order by livechattranscriptid,name
The issue is, it gives me the same first and last id for ALL of them by that entire group rather than treating each group of Enqueue records as a distinct group. How would I treat each distinct grouping instance of Enqueue as a unique group?
Here's a similar solution Grouping contiguous table data
not pretty but you will find the logic based from the OP. contiguous data over the same column
declare #mytable table (
id varchar(18),
name varchar(8),
[type] varchar(20),
livechattranscriptid varchar(18),
groupid varchar(100))
INSERT INTO #mytable (id,name,[type],livechattranscriptid,groupid) VALUES
('0DZ14000003I2pOGAS','34128314','ChatRequest','57014000000ltfIAAQ','57014000000ltfIAAQChatRequest'),
('0DZ14000003IGmQGAW','34181980','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IHbqGAG','34185171','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003ILuHGAW','34201743','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IQ6cGAG','34217778','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IR7JGAW','34221794','PushAssignment','57014000000ltfIAAQ','57014000000ltfIAAQPushAssignment'),
('0DZ14000003IiDnGAK','34287448','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IiDoGAK','34287545','PushAssignment','57014000000ltfIAAQ','57014000000ltfIAAQPushAssignment'),
('0DZ14000003Iut5GAC','34336044','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003Iv7HGAS','34336906','Accept','57014000000ltfIAAQ','57014000000ltfIAAQAccept')
;with myend as ( --- get all ends
select
*
from
(select
iif(groupid <> lead(groupid,1,groupid) over (order by name),
id,
'x') [newid],name
from #mytable
)x
where newid <> 'x'
)
, mystart as -- get all starts
(
select
*
from
(select
iif(groupid <> lag(groupid,1,groupid) over (order by name),
id,
'x') [newid], name,type,livechattranscriptid
from #mytable
)x
where newid <> 'x'
) ,
finalstart as ( --- get all starts including the first row
select id,
name,type,livechattranscriptid,
row_number() over (order by name) rn
from (
select id,name,type,livechattranscriptid
from (
select top 1 id, name,type,livechattranscriptid
from #mytable
order by name) x
union all
select newid,name,type,livechattranscriptid from mystart
) y
),
finalend as -- get all ends and add the last row
(
select id,
row_number() over (order by name) rn
from (
select id,name from (
select top 1 id,name
from #mytable
order by name desc) x
union all
select newid,name from myend
) y
)
select
s.id [startid]
,s.name
,s.type
,s.livechattranscriptid
,e.id [lastid]
from
finalend e
inner join finalstart s
on e.rn = s.rn --- bind the two results over the positions or row number

Grouping while maintaining next record

I have a table (NerdsTable) with some of this data:
-------------+-----------+----------------
id name school
-------------+-----------+----------------
1 Joe ODU
2 Mike VCU
3 Ane ODU
4 Trevor VT
5 Cools VCU
When I run the following query
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable where school = 'ODU';
I get these results:
[id=1,name=Joe,nextid=3]
[id=3,name=Ane,nextid=NULL]
I want to write a query that does not need the static check for
where school = 'odu'
but gives back the same results as above. In another words, I want to select all results in the database, and have them grouped correctly as if i went through individually and ran queries for:
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'ODU';
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'VCU';
SELECT id, name, LEAD(id) OVER (ORDER BY id) as next_id FROM dbo.NerdsTable where school = 'VT';
Here is the output I am hoping to see:
[id=1,name=Joe,nextid=3]
[id=3,name=Ane,nextid=NULL]
[id=2,name=Mike,nextid=5]
[id=5,name=Cools,nextid=NULL]
[id=4,name=Trevor,nextid=NULL]
Here is what I have tried, but am failing miserably:
SELECT id, name,
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
ORDER BY school;
-- Problem, as this does not sort by the id. I need the lowest id first for the group
SELECT id, name,
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
ORDER BY id, school;
-- Sorts by id, but the grouping is not correct, thus next_id is wrong
I then looked on the Microsoft doc site for aggregate functions, but do not see how i can use any to group my results correctly. I tried to use GROUPING_ID, as follows:
SELECT id, GROUPING_ID(name),
LEAD(id) OVER (ORDER BY id) as next_id
FROM dbo.NerdsTable
group by school;
But I get an error:
is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
Any idea as to what I am missing here?
From your desired output it looks like you are just trying to order the records by school. You can do that like this:
SELECT id, name
FROM dbo.NerdsTable
ORDER BY school ASC, id ASC
I don't know what next ID is supposed to mean.
create table schools (id int, name varchar(50), school varchar(3))
insert into schools values (1, 'Joe', 'ODU'), (2, 'Mike', 'VCU'), (3, 'Ane',
'ODU'), (4, 'Trevor', 'VT'), (5, 'Cools', 'VCU'), (6, 'Sarah', 'VCU')
select n.id, n.name, min(g.id) nextid
from schools n
left join
(
select id, school
from schools
) g on g.school = n.school and g.id > n.id
group by n.id, n.name
drop table schools

Adding a rank to first row of each group

This is returning what I want but is there a simpler, more elegant, approach?
IF OBJECT_ID('TEMPDB..#test') IS NOT NULL DROP TABLE #test;
CREATE TABLE #test
(
userAcc VARCHAR(100),
game VARCHAR(100),
amount INT
);
INSERT INTO #test
values
('jas', 'x', 10),
('jas', 'y', 100),
('jas', 'z', 20),
('sam', 'j', 10),
('sam', 'q', 5);
--initial table sample
SELECT userAcc,
game,
amount
FROM #test;
WITH
X AS
(
SELECT rn = ROW_NUMBER() OVER (PARTITION BY userAcc ORDER BY game),
userAcc,
game,
amount,
rk = RANK() OVER (PARTITION BY userAcc ORDER BY amount DESC)
FROM #test
),
Y AS
(
SELECT RK,userAcc,
game,
targ = rn
FROM X
WHERE rk = 1
)
SELECT X.userAcc,
X.game,
X.amount,
ISNULL(Y.targ,0)
FROM X
LEFT OUTER JOIN Y
ON
X.userAcc = Y.userAcc AND
X.rn = Y.rk
ORDER BY X.userAcc,X.rn;
It returns this:
Here is the initial table:
What the script is doing is this:
Add a new column to original table
In new column add the rank of the game for each userAcc with the highest amount.
The rank is the alphabetical position of the game with the highest amount amongst the user's games. So for jas his highest game is y and that is positioned 2nd amongst his games.
The rank found in step 3 should only go against the first alphabetical game of the respective user.
You don't need a join for this. You can use accumulation.
If I understand correctly:
select userAcc, game, amount,
isnull( (case when rn = 1
then max(case when rk = 1 then rn end) over (partition by userAcc)
end),0) as newcol
from (select t.*,
ROW_NUMBER() OVER (PARTITION BY userAcc ORDER BY game) as rn,
RANK() OVER (PARTITION BY userAcc ORDER BY amount DESC) as rk
from #test t
) t
order by userAcc;

Get most commonly occurring value for each user id

I have a table with userIds and product categories prod. I want to get a table of unique userIds and associated most occurring product categories prod. In other words, I want to know what item categorys each customer is buying the most. How can I achieve this in PL/SQL or Oracle SQL?
|userId|prod|
|------|----|
|123544|cars|
|123544|cars|
|123544|dogs|
|123544|cats|
|987689|bats|
|987689|cats|
I have already seen SO questions for getting the most common value of a column, but how do I get the most common value for each unique userId?
You should use just SQL to solve this .. if you really need it in pl/sql, just imbed this query within plsql ..
(setup)
drop table yourtable;
create table yourtable (
userID number,
prod varchar2(10)
)
/
insert into yourtable values ( 123544, 'cars' );
insert into yourtable values ( 123544, 'cars' );
insert into yourtable values ( 123544, 'dogs' );
insert into yourtable values ( 123544, 'cats' );
insert into yourtable values ( 987689, 'bats' );
insert into yourtable values ( 987689, 'cats' );
commit;
-- assuming ties are not broken, this logic returns both ties
with w_grp as (
select userID, prod, count(*) over ( partition by userID, prod ) rgrp
from yourtable
),
w_rnk as (
select userID, prod, rgrp,
rank() over (partition by userID order by rgrp desc) rnk,
from w_grp
)
select distinct userID, prod
from w_rnk
where rnk = 1
/
USERID PROD
---------- ----------
987689 bats
987689 cats
123544 cars
-- assuming you just want 1 .. this will return 1 random one if they are tied. (ie this time it pulled 987689 bats, next time it might pull 987689 cats. It will always return 123544 cars, however, since there is no tie for that one.
with w_grp as (
select userID, prod, count(*) over ( partition by userID, prod ) rgrp
from yourtable
),
w_rnk as (
select userID, prod, rgrp,
row_number() over (partition by userID order by rgrp desc) rnum
from w_grp
)
select userID, prod, rnum
from w_rnk
where rnum = 1
/
USERID PROD RNUM
---------- ---------- ----------
123544 cars 1
987689 bats 1
[edit] Cleaned up unused rank/row_number from functions to avoid confusion [/edit]
SELECT user_id, prod, prod_cnt FROM (
SELECT user_id, prod, prod_cnt
, RANK() OVER ( PARTITION BY user_id ORDER BY prod_cnt DESC ) AS rn
FROM (
SELECT user_id, prod, COUNT(*) AS prod_cnt
FROM mytable
GROUP BY user_id, prod
)
) WHERE rn = 1;
In the innermost subquery I am getting the COUNT of each product by user. Then I rank them using the analytic (window) function RANK(). Then I simply select all of those where the RANK is equal to 1. Using RANK() instead of ROW_NUMBER() ensures that ties will be returned.