SQL Collapse Data - sql

I am trying to collapse data that is in a sequence sorted by date. While grouping on the person and the type.
The data is stored in an SQL server and looks like the following -
seq person date type
--- ------ ------------------- ----
1 1 2018-02-10 08:00:00 1
2 1 2018-02-11 08:00:00 1
3 1 2018-02-12 08:00:00 1
4 1 2018-02-14 16:00:00 1
5 1 2018-02-15 16:00:00 1
6 1 2018-02-16 16:00:00 1
7 1 2018-02-20 08:00:00 2
8 1 2018-02-21 08:00:00 2
9 1 2018-02-22 08:00:00 2
10 1 2018-02-23 08:00:00 1
11 1 2018-02-24 08:00:00 1
12 1 2018-02-25 08:00:00 2
13 2 2018-02-10 08:00:00 1
14 2 2018-02-11 08:00:00 1
15 2 2018-02-12 08:00:00 1
16 2 2018-02-14 16:00:00 3
17 2 2018-02-15 16:00:00 3
18 2 2018-02-16 16:00:00 3
This data set contains about 1.2 million records that resemble the above.
The result that I would like to get from this would be -
person start type
------ ------------------- ----
1 2018-02-10 08:00:00 1
1 2018-02-20 08:00:00 2
1 2018-02-23 08:00:00 1
1 2018-02-25 08:00:00 2
2 2018-02-10 08:00:00 1
2 2018-02-14 16:00:00 3
I have the data in the first format by running the following query -
select
ROW_NUMBER() OVER (ORDER BY date) AS seq
person,
date,
type,
from table
group by person, date, type
I am just not sure how to keep the minimum date with the other distinct values from person and type.

This is a gaps-and-islands problem so, you can use differences of row_number() & use them in grouping :
select person, min(date) as start, type
from (select *,
row_number() over (partition by person order by seq) seq1,
row_number() over (partition by person, type order by seq) seq2
from table
) t
group by person, type, (seq1 - seq2)
order by person, start;

The correct solution using the difference of row numbers is:
select person, type, min(date) as start
from (select t.*,
row_number() over (partition by person order by seq) as seqnum_p,
row_number() over (partition by person, type order by seq) as seqnum_pt
from t
) t
group by person, type, (seqnum_p - seqnum_pt)
order by person, start;
type needs to be included in the GROUP BY.

Related

How to count consecutive days in a table where days are duplicated "PostgresSQL"

Hello I would like to know the highest count of consecutive days a user has trained for.
My logs table that stores the records looks like this:
id
user_id
day
ground_id
created_at
1
1
1
1
2023-01-24 10:00:00
2
1
2
1
2023-01-25 10:00:00
3
1
3
1
2023-01-26 10:00:00
4
1
4
1
2023-01-27 10:00:00
5
1
5
1
2023-01-28 10:00:00
The closest I could get is with this query, which does work only if the user has trained on one ground at a day.
SELECT COUNT(*) AS days_in_row
FROM (SELECT row_number() OVER (ORDER BY day) - day AS grp
FROM logs
WHERE created_at >= '2023-01-24 00:00:00'
AND user_id = 1) x
GROUP BY grp
logs table:
id
user_id
day
ground_id
created_at
1
1
1
1
2023-01-24 10:00:00
2
1
2
1
2023-01-25 10:00:00
3
1
3
1
2023-01-26 10:00:00
4
1
4
1
2023-01-27 10:00:00
5
1
5
1
2023-01-28 10:00:00
This query would return a count of 5 consecutive days which is correct.
However my query doesn't work once a user trains multiple times on different training grounds in one day:
logs table:
id
user_id
day
ground_id
created_at
1
1
1
1
2023-01-24 10:00:00
2
1
2
1
2023-01-25 10:00:00
3
1
3
1
2023-01-26 10:00:00
4
1
3
2
2023-01-26 10:00:00
5
1
4
1
2023-01-27 10:00:00
Than the query from above would return a count of 2 consecutive days which is not what I expect instead I would expect the number four because the user has trained the following days in row (1,2,3,4).
Thank you for reading.
Select only distinct data of interest first
SELECT min(created_at) start, COUNT(*) AS days_in_row
FROM (SELECT created_at, row_number() OVER (ORDER BY day) - day AS grp
FROM (
select distinct day, created_at
from logs
where created_at >= '2023-01-24 00:00:00'
AND user_id = 1) t
) x
GROUP BY grp

Sum column values over a window based on variable date range (impala)

Given a table as follows :
client_id date connections
---------------------------------------
121438297 2018-01-03 0
121438297 2018-01-08 1
121438297 2018-01-10 3
121438297 2018-01-12 1
121438297 2018-01-19 7
363863811 2018-01-18 0
363863811 2018-01-30 5
363863811 2018-02-01 4
363863811 2018-02-10 0
I am looking for an efficient way to sum the number of connections that occur within x number of days following the current row (the current row being included in the sum), partitioned by client_id.
If x=6 then the output table would result in :
client_id date connections connections_within_6_days
---------------------------------------------------------------------
121438297 2018-01-03 0 1
121438297 2018-01-08 1 5
121438297 2018-01-10 3 4
121438297 2018-01-12 1 1
121438297 2018-01-19 7 7
363863811 2018-01-18 0 0
363863811 2018-01-30 5 9
363863811 2018-02-01 4 4
363863811 2018-02-10 0 0
Concerns :
I do not want to add all missing dates and then perform a sliding window counting the x following rows because my table is already extremely large.
I am using Impala and the range between interval 'x' days following and current row is not supported.
The generic solution is a bit troublesome for multiple periods, but you can use multiple CTEs to support that. The idea is to "unpivot" the counts based on when they go in and out and then use a cumulative sum.
So:
with conn as (
select client_id, date, connections
from t
union all
select client_id, date + interval 7 day, -connections
from t
),
conn1 as (
select client_id, date,
sum(sum(connections)) over (partition by client_id order by date) as connections_within_6_days
from t
group by client_id, date
)
select t.*, conn1. connections_within_6_days
from t join
conn1
on conn1.client_id = t.client_id and
conn1.date = t.date;

Following Start and End Date Columns

I have start and end date columns, and there are some where the start date equals the end date of the previous row without a gap. I'm trying to get it so that it would basically go from the Start Date row who's End Date is null and kinda "zig-zag" up going until the Start Date does not match the End Date.
I've tried CTEs, and ROW_NUMBER() OVER().
START_DTE END_DTE
2018-01-17 2018-01-19
2018-01-26 2018-02-22
2018-02-22 2018-08-24
2018-08-24 2018-09-24
2018-09-24 NULL
Expected:
START_DTE END_DTE
2018-01-26 2018-09-24
EDIT
Using a proposed solution with an added CTE to ensure dates don't have times with them.
WITH
CTE_TABLE_NAME AS
(
SELECT
ID_NUM,
CONVERT(DATE,START_DTE) START_DTE,
CONVERT(DATE,END_DTE) END_DTE
FROM
TABLE_NAME
WHERE ID_NUM = 123
)
select min(start_dte) as start_dte, max(end_dte) as end_dte, grp
from (select t.*,
sum(case when prev_end_dte = end_dte then 0 else 1 end) over (order by start_dte) as grp
from (select t.*,
lag(end_dte) over (order by start_dte) as prev_end_dte
from CTE_TABLE_NAME t
) t
) t
group by grp;
The following query provides these results:
start_dte end_dte grp
2014-08-24 2014-12-19 1
2014-08-31 2014-09-02 2
2014-09-02 2014-09-18 3
2014-09-18 2014-11-03 4
2014-11-18 2014-12-09 5
2014-12-09 2015-01-16 6
2015-01-30 2015-02-02 7
2015-02-02 2015-05-15 8
2015-05-15 2015-07-08 9
2015-07-08 2015-07-09 10
2015-07-09 2015-08-25 11
2015-08-31 2015-09-01 12
2015-10-06 2015-10-29 13
2015-11-10 2015-12-11 14
2015-12-11 2015-12-15 15
2015-12-15 2016-01-20 16
2016-01-29 2016-02-01 17
2016-02-01 2016-03-03 18
2016-03-30 2016-08-29 19
2016-08-30 2016-12-06 20
2017-01-27 2017-02-20 21
2017-02-20 2017-08-15 22
2017-08-15 2017-08-29 23
2017-08-29 2018-01-17 24
2018-01-17 2018-01-19 25
2018-01-26 2018-02-22 26
2018-02-22 2018-08-24 27
2018-08-24 2018-09-24 28
2018-09-24 NULL 29
I tried using having count (*) > 1 as suggested, but it provided no results
Expected example
START_DTE END_DTE
2017-01-27 2018-01-17
2018-01-26 2018-09-24
You can identify where groups of connected rows start by looking for where adjacent rows are not connected. A cumulative sum of these starts then gives you the groups.
select min(start_dte) as start_dte, max(end_dte) as end_dte
from (select t.*,
sum(case when prev_end_dte = start_dte then 0 else 1 end) over (order by start_dte) as grp
from (select t.*,
lag(end_dte) over (order by start_dte) as prev_end_dte
from t
) t
) t
group by grp;
If you want only multiply connected rows (as implied by your question), then add having count(*) > 1 to the outer query.
Here is a db<>fiddle.

PostgreSQL group by with interval but without window functions

This is follow-up of my previous question:
PostgreSQL group by with interval
There was a very good answer but unfortunately it is not working with PostgreSQL 8.0 - some clients still use this old version.
So I need to find another solution without using window functions
Here is what I have as a table:
id quantity price1 price2 date
1 100 1 0 2018-01-01 10:00:00
2 200 1 0 2018-01-02 10:00:00
3 50 5 0 2018-01-02 11:00:00
4 100 1 1 2018-01-03 10:00:00
5 100 1 1 2018-01-03 11:00:00
6 300 1 0 2018-01-03 12:00:00
I need to sum "quantity" grouped by "price1" and "price2" but only when they change
So the end result should look like this:
quantity price1 price2 dateStart dateEnd
300 1 0 2018-01-01 10:00:00 2018-01-02 10:00:00
50 5 0 2018-01-02 11:00:00 2018-01-02 11:00:00
200 1 1 2018-01-03 10:00:00 2018-01-03 11:00:00
300 1 0 2018-01-03 12:00:00 2018-01-03 12:00:00
It is not efficient, but you can implement the same logic with subqueries:
select sum(quantity), price1, price2,
min(date) as dateStart, max(date) as dateend
from (select d.*,
(select count(*)
from data d2
where d2.date <= d.date
) as seqnum,
(select count(*)
from data d2
where d2.price1 = d.price1 and d2.price2 = d.price2 and d2.date <= d.date
) as seqnum_pp
from data d
) t
group by price1, price2, (seqnum - seqnum_pp)
order by dateStart

PostgreSQL - rank over rows listed in blocks of 0 and 1

I have a table that looks like:
id code date1 date2 block
--------------------------------------------------
20 1234 2017-07-01 2017-07-31 1
15 1234 2017-06-01 2017-06-30 1
13 1234 2017-05-01 2017-05-31 0
11 1234 2017-03-01 2017-03-31 0
9 1234 2017-02-01 2017-02-28 1
8 1234 2017-01-01 2017-01-31 0
7 1234 2016-11-01 2016-11-31 0
6 1234 2016-10-01 2016-10-31 1
2 1234 2016-09-01 2016-09-31 1
I need to rank the rows according to the blocks of 0's and 1's, like:
id code date1 date2 block desired_rank
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 1
13 1234 2017-05-01 2017-05-31 0 2
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 4
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 5
2 1234 2016-09-01 2016-09-31 1 5
I've tried to use rank() and dense_rank(), but the result I end up with is:
id code date1 date2 block dense_rank()
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 2
13 1234 2017-05-01 2017-05-31 0 1
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 3
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 4
2 1234 2016-09-01 2016-09-31 1 5
In the last table, the rank doesn't care about the rows, it just takes all the 1's and 0's as a unit and sets an ascending count starting at the first 1 and 0.
My query goes like this:
CREATE TEMP TABLE data (id integer,code text, date1 date, date2 date, block integer);
INSERT INTO data VALUES
(20,'1234', '2017-07-01','2017-07-31',1),
(15,'1234', '2017-06-01','2017-06-30',1),
(13,'1234', '2017-05-01','2017-05-31',0),
(11,'1234', '2017-03-01','2017-03-31',0),
(9, '1234', '2017-02-01','2017-02-28',1),
(8, '1234', '2017-01-01','2017-01-31',0),
(7, '1234', '2016-11-01','2016-11-30',0),
(6, '1234', '2016-10-01','2016-10-31',1),
(2, '1234', '2016-09-01','2016-09-30',1);
SELECT *,dense_rank() OVER (PARTITION BY code,block ORDER BY date2 DESC)
FROM data
ORDER BY date2 DESC;
By the way, the database is in postgreSQL.
I hope there's a workaround... Thanks :)
Edit: Note that the blocks of 0's and 1's aren't equal.
There's no way to get this result using a single Window Function:
SELECT *,
Sum(flag) -- now sum the 0/1 to create the "rank"
Over (PARTITION BY code
ORDER BY date2 DESC)
FROM
(
SELECT *,
CASE
WHEN Lag(block) -- check if this is the 1st row of a new block
Over (PARTITION BY code
ORDER BY date2 DESC) = block
THEN 0
ELSE 1
END AS flag
FROM DATA
) AS dt