SQL: detecting consecutive blocks of sequential rows with same key - sql

My problem boils down to the following. I have a table with some natural sequencing, and in it I have a key value which may repeat over time. I want to find the blocks where the key is the same, then changes, and then comes back to being the same. Example:
A
A
B
B
B
C
C
A
A
C
C
Here I want the result to be
A, 1-2
B, 3-5
C, 6-7
A, 8-9
C, 10-11
so I can't use that key value A, B, C to group by, because the same key can appear multiple times, I just want to squeeze out repetitive occurrences that are uninterrupted.
Needless to say, I want the simplest SQL one can come up with. It would use OLAP window functions.
I am usually pretty good with complicated SQL, but with sequences I am not so good. I will work on this a little bit myself, of course, and annex some ideas below this question in a subsequent edit.
Let's begin by defining the table for our discussion:
CREATE TABLE Seq (
num integer,
key char
);
UPDATE 1: doing some research I find a similar question here: How to find consecutive rows based on the value of a column? but both the question and the answers are wrapped up into a lot of extra stuff and confusing.
UPDATE 2: I already got one answer, thanks. Inspecting it now. Here is my test I am typing into PostgreSQL even as we speak:
CREATE TABLE Seq ( num int, key char );
INSERT INTO Seq VALUES
(1, 'A'), (2, 'A'),
(2, 'B'), (3, 'B'), (5, 'B'),
(6, 'C'), (7, 'C'),
(8, 'A'), (9, 'A'),
(10, 'C'), (11, 'C');
UPDATE 3: First contender of a solution is this
SELECT key, min(num), max(num)
FROM (
SELECT seq.*,
row_number() over (partition by key order by num) as seqnum
FROM Seq
) s
GROUP BY key, (num - seqnum)
ORDER BY min;
yields:
key | min | max
-----+-----+-----
A | 1 | 2
B | 2 | 3
B | 5 | 5
C | 6 | 7
A | 8 | 9
C | 10 | 11
(6 rows)
for some reason B repeats twice, I see why, I made a "mistake" in my test data, skipping sequence num 4 and going straight from 3 to 5.
This mistake is fortunate, because it allows me to point out that while in this example the sequence number is discrete, I am intending the sequence to arise from some continuous domain (e.g., time).
There is another "mistake" I made, in that I have num 2 repeated. Is that allowable? Probably not. So cleaning up the example, removing duplicate but leaving the gap:
DROP TABLE Seq;
CREATE TABLE Seq ( num int, key char );
INSERT INTO Seq VALUES
(1, 'A'), (2, 'A'),
(3, 'B'), (4, 'B'), (6, 'B'),
(7, 'C'), (8, 'C'),
(9, 'A'), (10, 'A'),
(11, 'C'), (12, 'C');
this still leaves us with the duplicate B block:
key | min | max
-----+-----+-----
A | 1 | 2
B | 3 | 4
B | 6 | 6
C | 7 | 8
A | 9 | 10
C | 11 | 12
(6 rows)
Now going with that first intuition by Gordon Linoff and trying to understand it and add to it:
SELECT s.*, num - seqnum AS diff
FROM (
SELECT seq.*,
row_number() over (partition by key order by num) as seqnum
FROM Seq
) s
ORDER BY num;
here is the num - seqnum trick before grouping:
num | key | seqnum | diff
-----+-----+--------+------
1 | A | 1 | 0
2 | A | 2 | 0
3 | B | 1 | 2
4 | B | 2 | 2
6 | B | 3 | 3
7 | C | 1 | 6
8 | C | 2 | 6
9 | A | 3 | 6
10 | A | 4 | 6
11 | C | 3 | 8
12 | C | 4 | 8
(11 rows)
I doubt that this is the answer quite yet.

Because of gaps you can't use num directly as Gordon's solution suggested. Row_number it too.
select key, min(num), max(num)
from (select seq.*,
row_number() over (order by num) as rn,
row_number() over (partition by key order by num) as seqnum
from seq
) s
group by key, (rn - seqnum)
order by min(num);

This answers the original problem.
You can enumerate the rows for each key and subtract that from num. Voila! This is number is constant when the key is constant on adjacent rows:
select key, min(num), max(num)
from (select seq.*,
row_number() over (partition by key order by num) as seqnum
from seq
) s
group by key, (num - seqnum);
Here is a db<>fiddle showing that it works.

Related

Selecting only top parent table row with all of it's children table rows

So I have two tables:
#ProjectHealthReports
Id | From | SubmittedOn
1 | 2020-01-01 |
2 | 2020-02-01 | 2020-10-23
3 | 2020-03-01 |
4 | 2020-04-01 | 2020-10-23
5 | 2020-05-01 | 2020-10-23
#ProjectHealthReportItems
Id | Note | ProjectHealthReportId
1 | First for 2020-01-01 | 1
2 | Second for 2020-01-01 | 1
3 | First for 2020-02-01 | 2
4 | Second for 2020-02-01 | 2
5 | First for 2020-03-01 | 3
6 | Second for 2020-03-01 | 3
7 | First for 2020-04-01 | 4
8 | Second for 2020-04-01 | 4
9 | (We want this one) First for 2020-05-01 | 5
10 | (We want this one) Second for 2020-05-01 | 5
How can I get all #ProjectHealthReportItems and #ProjectHealthReport details for the last From date which has value for SubmittedOn (so in this case it would be ProjectHealthReport 5 and ProjectHealthReportItems 9, 10).
Basically, I need something like this just, obviously without top 1 as it only returns one row and I need, in this case, to return 2 rows :)
select top 1 phr.Id, phr.[From], phr.SubmittedOn, phri.Note from #ProjectHealthReports phr
inner join #ProjectHealthReportItems phri on phr.Id = phri.ProjectHealthReportId
where phr.SubmittedOn is not null
order by phr.[From] desc
Here is the SQL for creating and seeding the tables
create table #ProjectHealthReports(
Id int primary key,
[From] date not null ,
SubmittedOn date null
)
go
create table #ProjectHealthReportItems(
Id int primary key,
Note nvarchar(max),
ProjectHealthReportId int constraint FK_PHR references #ProjectHealthReports
)
go
insert into #ProjectHealthReports(Id, [From], SubmittedOn)
values (1, '2020-01-01', null),
(2, '2020-02-01', getutcdate()),
(3, '2020-03-01', null),
(4, '2020-04-01', getutcdate()),
(5, '2020-05-01', getutcdate())
go
insert into #ProjectHealthReportItems(Id, Note, ProjectHealthReportId)
values (1, 'First for 2020-01-01', 1),
(2, 'Second for 2020-01-01', 1),
(3, 'First for 2020-02-01', 2),
(4, 'Second for 2020-02-01', 2),
(5, 'First for 2020-03-01', 3),
(6, 'Second for 2020-03-01', 3),
(7, 'First for 2020-04-01', 4),
(8, 'Second for 2020-04-01', 4),
(9, '(We want this one) First for 2020-05-01', 5),
(10, '(We want this one) Second for 2020-05-01', 5)
go
First select top then join
select t.*, phri.Note
from (select top(1) phr.Id phrid, phr.[From], phr.SubmittedOn
from #ProjectHealthReports phr
where phr.SubmittedOn is not null
order by phr.[From] desc) t
inner join #ProjectHealthReportItems phri on t.phrId = phri.ProjectHealthReportId
I would suggest window functions:
select phr.*, phri.*
from #ProjectHealthReports phr left join
(select phri.*,
row_number() over (partition by ProjectHealthReportId order by id desc) as seqnum
from #ProjectHealthReportItems phri
) phri
on phr.Id = phri.ProjectHealthReportId and seqnum = 1
order by phr.[From] desc;
You can also do this using filtering in the where, such as correlated subquery:
select phr.*, phri.*
from #ProjectHealthReports phr join
#ProjectHealthReportItems phri
on phr.Id = phri.ProjectHealthReportId and seqnum = 1
where phri.id = (select max(phri2.id)
from #ProjectHealthReportItems phri2
where phri2.ProjectHealthReportId = phri.ProjectHealthReportId
)
order by phr.[From] desc
An efficient way to do this without a LEFT JOIN would be assign a row number, using the ROW_NUMBER() windowing function, to the #ProjectHealthReports table. Something like this
with lv_cte as (
select *, row_number() over (order by [From] desc) rn
from #ProjectHealthReports)
select l.*, phri.*
from lv_cte l
join #ProjectHealthReportItems phri on l.id=phri.ProjectHealthReportId
where l.rn=1;
Output
Id From SubmittedOn rn Id Note ProjectHealthReportId
5 2020-05-01 2020-10-23 1 9 (We want this one) First for 2020-05-01 5
5 2020-05-01 2020-10-23 1 10 (We want this one) Second for 2020-05-01 5

Roll up multiple rows into one when joining in SQL Server

I have a table, Foo
ID | Name
-----------
1 | ONE
2 | TWO
3 | THREE
And another, Bar:
ID | FooID | Value
------------------
1 | 1 | Alpha
2 | 1 | Alpha
3 | 1 | Alpha
4 | 2 | Beta
5 | 2 | Gamma
6 | 2 | Beta
7 | 3 | Delta
8 | 3 | Delta
9 | 3 | Delta
I would like a query that joins these tables, returning one row for each row in Foo, rolling up the 'value' column from Bar. I can get back the first Bar.Value for each FooID:
SELECT * FROM Foo f OUTER APPLY
(
SELECT TOP 1 Value FROM Bar WHERE FooId = f.ID
) AS b
Giving:
ID | Name | Value
---------------------
1 | ONE | Alpha
2 | TWO | Beta
3 | THREE | Delta
But that's not what I want, and I haven't been able to find a variant that will bring back a rolled up value, that is the single Bar.Value if it is the same for each corresponding Foo, or a static string something like '(multiple)' if not:
ID | Name | Value
---------------------
1 | ONE | Alpha
2 | TWO | (multiple)
3 | THREE | Delta
I have found some solutions that would bring back concatenated values (albeit not very elegant) 'Alpha' Alpha, Alpha', 'Beta, Gamma, Beta' &c, but that's not what I want either.
One method, using a a CASE expression and assuming that [Value] cannot have a value of NULL:
WITH Foo AS
(SELECT *
FROM (VALUES (1, 'ONE'),
(2, 'TWO'),
(3, 'THREE')) V (ID, [Name])),
Bar AS
(SELECT *
FROM (VALUES (1, 1, 'Alpha'),
(2, 1, 'Alpha'),
(3, 1, 'Alpha'),
(4, 2, 'Beta'),
(5, 2, 'Gamma'),
(6, 2, 'Beta'),
(7, 3, 'Delta'),
(8, 3, 'Delta'),
(9, 3, 'Delta')) V (ID, FooID, [Value]))
SELECT F.ID,
F.[Name],
CASE COUNT(DISTINCT B.[Value]) WHEN 1 THEN MAX(B.Value) ELSE '(Multiple)' END AS [Value]
FROM Foo F
JOIN Bar B ON F.ID = B.FooID
GROUP BY F.ID,
F.[Name];
You can also try below:
SELECT F.ID, F.Name, (case when B.Value like '%,%' then '(Multiple)' else B.Value end) as Value
FROM Foo F
outer apply
(
select SUBSTRING((
SELECT distinct ', '+ isnull(Value,',') FROM Bar WHERE FooId = F.ID
FOR XML PATH('')
), 2 , 9999) as Value
) as B

Creating column for every group in group by

Suppose I have a table T which has entries as follows:
id | type | value |
-------------------------
1 | A | 7
1 | B | 8
2 | A | 9
2 | B | 10
3 | A | 11
3 | B | 12
1 | C | 13
2 | C | 14
For each type, I want a different column. Since the number of types is exhaustive, I would like all different types to be enumerated and a corresponding column for each. I wanted to make id a primary key for the table.
So, the desired output is something like:
id | A's value | B's value | C's value
------------------------------------------
1 | 7 | 8 | 13
2 | 9 | 10 | 14
3 | 11 | 12 | NULL
Please note that this is a simplified version. The actual table T is derived from a much bigger table using group by. And for each group, I would like a separate column. Is that even possible?
Use conditional aggregation:
select id,
max(case when type = 'A' then value end) as a_value,
max(case when type = 'B' then value end) as b_value,
max(case when type = 'C' then value end) as c_value
from t
group by id;
I'd recommend looking into the PIVOT function:
https://docs.snowflake.com/en/sql-reference/constructs/pivot.html
The main blocker with this function though is the list of values for the pivot_column needs to be
pre-determined. To do this, I normally use the LISTAGG function:
https://docs.snowflake.com/en/sql-reference/functions/listagg.html
I've included a query below to show you how to build that string,
and doing this together in a script like
Python or even a Stored Procedure should be fairly straightforward (build the pivot_column, build the aggregate/pivot command, execute the aggregate/pivot command).
I hope this helps...Rich
CREATE OR REPLACE TABLE monthly_sales(
empid INT,
amount INT,
month TEXT)
AS SELECT * FROM VALUES
(1, 10000, 'JAN'),
(1, 400, 'JAN'),
(2, 4500, 'JAN'),
(2, 35000, 'JAN'),
(1, 5000, 'FEB'),
(1, 3000, 'FEB'),
(2, 200, 'FEB'),
(2, 90500, 'FEB'),
(1, 6000, 'MAR'),
(1, 5000, 'MAR'),
(2, 2500, 'MAR'),
(2, 9500, 'MAR'),
(1, 8000, 'APR'),
(1, 10000, 'APR'),
(2, 800, 'APR'),
(2, 4500, 'APR');
SELECT *
FROM monthly_sales
PIVOT(SUM(amount)
FOR month IN ('JAN', 'FEB', 'MAR', 'APR'))
AS p
ORDER BY empid;
SELECT LISTAGG( DISTINCT ''''||month||'''', ', ' )
FROM monthly_sales;

SELECT check the colum of the max row

Here my row with my first select:
SELECT
user.id, analytic_youtube_demographic.age,
analytic_youtube_demographic.percent
FROM
`user`
INNER JOIN
analytic ON analytic.user_id = user.id
INNER JOIN
analytic_youtube_demographic ON analytic_youtube_demographic.analytic_id = analytic.id
Result:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |13-17| 19,6 |
| 1 |18-24| 38.4 |
| 1 |25-34| 22.5 |
| 1 |35-44| 11.5 |
| 1 |45-54| 5.3 |
| 1 |55-64| 1.6 |
| 1 |65+ | 1.2 |
| 2 |13-17| 10 |
| 2 |18-24| 10 |
| 2 |25-34| 25 |
| 2 |35-44| 5 |
| 2 |45-54| 25 |
| 2 |55-64| 5 |
| 1 |65+ | 20 |
---------------------------
The max value by user_id:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |18-24| 38.4 |
| 2 |45-54| 25 |
| 2 |25-34| 25 |
---------------------------
And I need to filter Age in ['25-34', '65+']
I must have at the end :
-----------
| id |
|----------
| 2 |
-----------
Thanks a lot for your help.
Have tried to use MAX(analytic_youtube_demographic.percent). But I don't know how to filter with the age too.
Thanks a lot for your help.
You can use the rank() function to identify the largest percentage values within each user's data set, and then a simple WHERE clause to get those entries that are both of the highest rank and belong to one of the specific demographics you're interested in. Since you can't use windowed functions like rank() in a WHERE clause, this is a two-step process with a subquery or a CTE. Something like this ought to do it:
-- Sample data from the question:
create table [user] (id bigint);
insert [user] values
(1), (2);
create table analytic (id bigint, [user_id] bigint);
insert analytic values
(1, 1), (2, 2);
create table analytic_youtube_demographic (analytic_id bigint, age varchar(32), [percent] decimal(5, 2));
insert analytic_youtube_demographic values
(1, '13-17', 19.6),
(1, '18-24', 38.4),
(1, '25-34', 22.5),
(1, '35-44', 11.5),
(1, '45-54', 5.3),
(1, '55-64', 1.6),
(1, '65+', 1.2),
(2, '13-17', 10),
(2, '18-24', 10),
(2, '25-34', 25),
(2, '35-44', 5),
(2, '45-54', 25),
(2, '55-64', 5),
(2, '65+', 20);
-- First, within the set of records for each user.id, use the rank() function to
-- identify the demographics with the highest percentage.
with RankedDataCTE as
(
select
[user].id,
youtube.age,
youtube.[percent],
[rank] = rank() over (partition by [user].id order by youtube.[percent] desc)
from
[user]
inner join analytic on analytic.[user_id] = [user].id
inner join analytic_youtube_demographic youtube on youtube.analytic_id = analytic.id
)
-- Now select only those records that are (a) of the highest rank within their
-- user.id and (b) either the '25-34' or the '65+' age group.
select
id,
age,
[percent]
from
RankedDataCTE
where
[rank] = 1 and
age in ('25-34', '65+');

Count Based on Columns in SQL Server

I have 3 tables:
SELECT id, letter
FROM As
+--------+--------+
| id | letter |
+--------+--------+
| 1 | A |
| 2 | B |
+--------+--------+
SELECT id, letter
FROM Xs
+--------+------------+
| id | letter |
+--------+------------+
| 1 | X |
| 2 | Y |
| 3 | Z |
+--------+------------+
SELECT id, As_id, Xs_id
FROM A_X
+--------+-------+-------+
| id | As_id | Xs_id |
+--------+-------+-------+
| 9 | 1 | 1 |
| 10 | 1 | 2 |
| 11 | 2 | 3 |
| 12 | 1 | 2 |
| 13 | 2 | 3 |
| 14 | 1 | 1 |
+--------+-------+-------+
I can count all As and Bs with group by. But I want to count As and Bs based on X,Y and Z. What I want to get is below:
+-------+
| X,Y,Z |
+-------+
| 2,2,0 |
| 0,0,2 |
+-------+
X,Y,Z
A 2,2,0
B 0,0,2
What is the best way to do this at MSSQL? Is it an efficent way to use foreach for example?
edit: It is not a duplicate because I just wanted to know the efficent way not any way.
For what you're trying to do without knowing what is inefficient with your current code (because none was provided), a Pivot is best. There are a million resources online and here in the stack overflow Q/A forums to find what you need. This is probably the simplest explanation of a Pivot which I frequently need to remind myself of the complicated syntax of a pivot.
To specifically answer your question, this is the code that shows how the link above applies to your question
First Tables needed to be created
DECLARE #AS AS TABLE (ID INT, LETTER VARCHAR(1))
DECLARE #XS AS TABLE (ID INT, LETTER VARCHAR(1))
DECLARE #XA AS TABLE (ID INT, AsID INT, XsID INT)
Values were added to the tables
INSERT INTO #AS (ID, Letter)
SELECT 1,'A'
UNION
SELECT 2,'B'
INSERT INTO #XS (ID, Letter)
SELECT 1,'X'
UNION
SELECT 2,'Y'
UNION
SELECT 3,'Z'
INSERT INTO #XA (ID, ASID, XSID)
SELECT 9,1,1
UNION
SELECT 10,1,2
UNION
SELECT 11,2,3
UNION
SELECT 12,1,2
UNION
SELECT 13,2,3
UNION
SELECT 14,1,1
Then the query which does the pivot is constructed:
SELECT LetterA, [X],[Y],[Z]
FROM (SELECT A.LETTER AS LetterA
,B.LETTER AS LetterX
,C.ID
FROM #XA C
JOIN #AS A
ON A.ID = C.ASID
JOIN #XS B
ON B.ID = C.XSID
) Src
PIVOT (COUNT(ID)
FOR LetterX IN ([X],[Y],[Z])
) AS PVT
When executed, your results are as follows:
Letter X Y Z
A 2 2 0
B 0 0 2
As i said in comment ... just join and do simple pivot
if object_id('tempdb..#AAs') is not null drop table #AAs
create table #AAs(id int, letter nvarchar(5))
if object_id('tempdb..#XXs') is not null drop table #XXs
create table #XXs(id int, letter nvarchar(5))
if object_id('tempdb..#A_X') is not null drop table #A_X
create table #A_X(id int, AAs int, XXs int)
insert into #AAs (id, letter) values (1, 'A'), (2, 'B')
insert into #XXs (id, letter) values (1, 'X'), (2, 'Y'), (3, 'Z')
insert into #A_X (id, AAs, XXs)
values (9, 1, 1),
(10, 1, 2),
(11, 2, 3),
(12, 1, 2),
(13, 2, 3),
(14, 1, 1)
select LetterA,
ISNULL([X], 0) [X],
ISNULL([Y], 0) [Y],
ISNULL([Z], 0) [Z]
from (
select distinct a.letter [LetterA], x.letter [LetterX],
count(*) over (partition by a.letter, x.letter order by a.letter) [Counted]
from #A_X ax
join #AAs A on ax.AAs = A.ID
join #XXs X on ax.XXs = X.ID
)src
PIVOT
(
MAX ([Counted]) for LetterX in ([X], [Y], [Z])
) piv
You get result as you asked for
LetterA X Y Z
A 2 2 0
B 0 0 2