Get users who took ride for 3 or more consecutive dates - sql

I have below table, it shows user_id and ride_date.
+---------+------------+
| user_id | ride_date |
+---------+------------+
| 1 | 2019-11-01 |
| 1 | 2019-11-03 |
| 1 | 2019-11-05 |
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-05 |
| 4 | 2019-11-07 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 5 | 2019-11-11 |
| 5 | 2019-11-13 |
+---------+------------+
I want user_id who took rides for 3 or more consecutive days along with days on which they took consecutive rides
The desired result is as below
+---------+-----------------------+
| user_id | consecutive_ride_date |
+---------+-----------------------+
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 4 | 2019-11-10 |
+---------+-----------------------+
SQL Fiddle

With LAG() and LEAD() window functions:
with cte as (
select *,
datediff(
day,
lag([ride_date]) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev1,
datediff(
day,
lag([ride_date], 2) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev2,
datediff(
day,
[ride_date],
lead([ride_date]) over (partition by [user_id] order by [ride_date])
) next1,
datediff(
day,
[ride_date],
lead([ride_date], 2) over (partition by [user_id] order by [ride_date])
) next2
from Table1
)
select [user_id], [ride_date]
from cte
where
(prev1 = 1 and prev2 = 2) or
(prev1 = 1 and next1 = 1) or
(next1 = 1 and next2 = 2)
See the demo.
Results:
> user_id | ride_date
> ------: | :---------
> 2 | 03/11/2019
> 2 | 04/11/2019
> 2 | 05/11/2019
> 2 | 06/11/2019
> 3 | 03/11/2019
> 3 | 04/11/2019
> 3 | 05/11/2019
> 3 | 06/11/2019
> 4 | 07/11/2019
> 4 | 08/11/2019
> 4 | 09/11/2019

Here is one way to adress this gaps-and-island problem:
first, assign a rank to each user ride with row_number(), and recover the previous ride_date (aliased lag_ride_date)
then, compare the date of the previous ride to the current one in a conditional sum, that increases when the dates are successive ; by comparing this with the rank of the user ride, you get groups (aliased grp) that represent consecutive rides with a 1 day spacing
do a window count how many records belong to each group (aliased cnt)
filter on records whose window count is greater than 3
Query:
select user_id, ride_date
from (
select
t.*,
count(*) over(partition by user_id, grp) cnt
from (
select
t.*,
rn1
- sum(case when ride_date = dateadd(day, 1, lag_ride_date) then 1 else 0 end)
over(partition by user_id order by ride_date) grp
from (
select
t.*,
row_number() over(partition by user_id order by ride_date) rn1,
lag(ride_date) over(partition by user_id order by ride_date) lag_ride_date
from Table1 t
) t
) t
) t
where cnt >= 3
Demo on DB Fiddle

This is a typical gaps and island problems.
We can solve it as follows
with data
as (
select user_id
,ride_date
,dateadd(day
,-row_number() over(partition by user_id order by ride_date asc)
,ride_date) as grp_field
from Table1
)
,consecutive_days
as(
select user_id
,ride_date
,count(*) over(partition by user_id,grp_field) as cnt
from data
)
select *
from consecutive_days
where cnt>=3
order by user_id,ride_date
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=7bb851d9a12966b54afb4d8b144f3d46

There is no need to apply gaps-and-islands methodologies to this problem. The problem is much simpler to solve.
You can return the users and first date just by using LEAD():
SELECT t1.*
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1
WHERE ride_date_2 = DATEADD(day, 2, ride_date);
If you want the actual dates, you can unpivot the results:
SELECT DISTINCT t1.user_id, v.ride_date
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1 CROSS APPLY
(VALUES (t1.ride_date),
(DATEADD(day, 1, t1.ride_date)),
(DATEADD(day, 2, t1.ride_date))
) v(ride_date)
WHERE t1.ride_date_2 = DATEADD(day, 2, t1.ride_date)
ORDER BY t1.user_id, v.ride_date;

Related

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

Grouping consecutive sequences of rows

I'm trying to group consecutive rows where a boolean value is true on SQL Server. For example, here's what some source data looks like:
AccountID | ID | IsTrue | Date
-------------------------------
1 | 1 | 1 | 1/1/2013
1 | 2 | 1 | 1/2/2013
1 | 3 | 1 | 1/3/2013
1 | 4 | 0 | 1/4/2013
1 | 5 | 1 | 1/5/2013
1 | 6 | 0 | 1/6/2013
1 | 7 | 1 | 1/7/2013
1 | 8 | 1 | 1/8/2013
1 | 9 | 1 | 1/9/2013
And here's what I'd like as the output
AccountID | Start | End
-------------------------------
1 | 1/1/2013 | 1/3/2013
1 | 1/7/2013 | 1/9/2013
I have a hunch that there's some trick with grouping by partitions that will make this work but I've been unable to figure it out. I've made some progress using LAG but haven't been able to put it all together.
Thanks for the help!
This is an example of a gaps and islands problem. For this version, you just need a sequential number for each isTrue. Subtracting this number of days from each date is a constant for adjacent values that are the same:
select accountId, isTrue, min(date), max(date)
from (select t.*,
row_number() over (partition by accountId, isTrue order by date) as seqnum
from t
) t
group by accountId, isTrue, dateadd(day, -seqnum, date);
This defines all groups. If I assume that you just want values of "1" that are more than 1 day long, then:
select accountId, isTrue, min(date), max(date)
from (select t.*,
row_number() over (partition by accountId, isTrue order by date) as seqnum
from t
where isTrue = 1
) t
group by accountId, isTrue, dateadd(day, -seqnum, date)
having count(*) > 1;
You can try the following, here is the demo. I have assumption that id will always have consecutive values.
with cte as
(
select
*,
count(*) over (partition by IsTrue, rnk) as total
from
(
select
*,
id - row_number() over (partition by IsTrue order by id, date) as rnk
from myTable
) val
)
select
accountId,
min(date) as start,
max(date) as end
from cte
where total > 1
group by
accountId,
rnk
Output:
| accountid | start | end |
| --------- | ---------- | -----------|
| 1 | 2013-01-01 | 2013-01-03 |
| 1 | 2013-01-07 | 2013-01-09 |

Group by on column that repeats

I'm having trouble putting this issue into words which is probably why I can't find an example so here is what I'd like to do.
I have a table like such
| counter| timestamp |
| 1 | 2018-01-01T11:11:01 |
| 1 | 2018-01-01T11:11:02 |
| 1 | 2018-01-01T11:11:03 |
| 2 | 2018-01-01T11:11:04 |
| 2 | 2018-01-01T11:11:05 |
| 3 | 2018-01-01T11:11:06 |
| 3 | 2018-01-01T11:11:07 |
| 1 | 2018-01-01T11:11:08 |
| 1 | 2018-01-01T11:11:09 |
| 1 | 2018-01-01T11:11:10 |
what I'd like to do is group by each group of counters so if I do a query like
SELECT counter, MAX(timestamp) as st, MIN(timestamp) as et
FROM table
GROUP BY counter;
the result would be
| counter | st | et |
| 1 | 2018-01-01T11:11:01 | 2018-01-01T11:11:03 |
| 2 | 2018-01-01T11:11:04 | 2018-01-01T11:11:05 |
| 3 | 2018-01-01T11:11:06 | 2018-01-01T11:11:07 |
| 1 | 2018-01-01T11:11:08 | 2018-01-01T11:11:10 |
instead of what actually happens which is
| counter | st | et |
| 1 | 2018-01-01T11:11:01 | 2018-01-01T11:11:10 |
| 2 | 2018-01-01T11:11:04 | 2018-01-01T11:11:05 |
| 3 | 2018-01-01T11:11:06 | 2018-01-01T11:11:07 |
So I'd like some what to combine group by and partition ideally without having nested queries
You have to designate groups with the same repeating values of counter. This can be done using two window functions lag() and cumulative sum():
select counter, min(timestamp) as st, max(timestamp) as et
from (
select counter, timestamp, sum(grp) over w as grp
from (
select *, (lag(counter, 1, 0) over w <> counter)::int as grp
from my_table
window w as (order by timestamp)
) s
window w as (order by timestamp)
) s
group by counter, grp
order by st
DbFiddle.
You should calculate a new groups:
create table tbl(counter int, ts timestamp);
insert into tbl values
(1, '2018-01-01T11:11:01'),
(1, '2018-01-01T11:11:02'),
(1, '2018-01-01T11:11:03'),
(2, '2018-01-01T11:11:04'),
(2, '2018-01-01T11:11:05'),
(3, '2018-01-01T11:11:06'),
(3, '2018-01-01T11:11:07'),
(1, '2018-01-01T11:11:08'),
(1, '2018-01-01T11:11:09'),
(1, '2018-01-01T11:11:10');
✓
10 rows affected
select min(counter) as counter, min(ts) as st, max(ts) as et
from
(
select counter, ts, sum(rst) over (order by ts) as grp
from
(
select counter, ts,
case when coalesce(lag(counter) over (order by ts), -1) <> counter then 1 end rst
from tbl
) t1
) t2
group by grp
counter | st | et
------: | :------------------ | :------------------
3 | 2018-01-01 11:11:06 | 2018-01-01 11:11:07
1 | 2018-01-01 11:11:08 | 2018-01-01 11:11:10
2 | 2018-01-01 11:11:04 | 2018-01-01 11:11:05
1 | 2018-01-01 11:11:01 | 2018-01-01 11:11:03
db<>fiddle here
You can use ranking function
select counter, min(timestamp) st, max(timestamp) et
from (select *,
row_number() over (order by timestamp) Seq1,
row_number() over (partition by counter order by timestamp) Seq2
from table
) t
group by counter, (Seq1-Seq2);
This would use the differences of two ranking functions (Seq1-Seq2) and use them in GROUP BY clause.

Redshift count with variable

Imagine I have a table on Redshift with this similar structure. Product_Bill_ID is the Primary Key of this table.
| Store_ID | Product_Bill_ID | Payment_Date
| 1 | 1 | 01/10/2016 11:49:33
| 1 | 2 | 01/10/2016 12:38:56
| 1 | 3 | 01/10/2016 12:55:02
| 2 | 4 | 01/10/2016 16:25:05
| 2 | 5 | 02/10/2016 08:02:28
| 3 | 6 | 03/10/2016 02:32:09
If I want to query the number of Product_Bill_ID that a store sold in the first hour after it sold its first Product_Bill_ID, how could I do this?
This example should outcome
| Store_ID | First_Payment_Date | Sold_First_Hour
| 1 | 01/10/2016 11:49:33 | 2
| 2 | 01/10/2016 16:25:05 | 1
| 3 | 03/10/2016 02:32:09 | 1
You need to get the first hour. That is easy enough using window functions:
select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
Then, you need to do the date filtering and aggregation:
select store_id, count(*)
from (select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
) s
where payment_date <= first_payment_date + interval '1 hour'
group by store_id;
SELECT
store_id,
first_payment_date,
SUM(
CASE WHEN payment_date < DATEADD(hour, 1, first_payment_date) THEN 1 END
) AS sold_first_hour
FROM
(
SELECT
*,
MIN(payment_date) OVER (PARTITION BY store_id) AS first_payment_date
FROM
yourtable
)
parsed_table
GROUP BY
store_id,
first_payment_date

Aggregate/Windowed Function To Find Min and Max of Sequential Rows

I've got a SQL table where I want to find the first and last dates of a group of records, providing they're sequential.
Patient | TestType | Result | Date
------------------------------------------
1 | 1 | A | 2012-03-04
1 | 1 | A | 2012-08-19
1 | 1 | B | 2013-05-27
1 | 1 | A | 2013-06-20
1 | 2 | X | 2012-08-19
1 | 2 | X | 2013-06-20
2 | 1 | B | 2014-09-09
2 | 1 | B | 2015-04-19
Should be returned as
Patient | TestType | Result | StartDate | EndDate
--------------------------------------------------------
1 | 1 | A | 2012-03-04 | 2012-08-19
1 | 1 | B | 2013-05-27 | 2013-05-27
1 | 1 | A | 2013-06-20 | 2013-06-20
1 | 2 | X | 2012-08-19 | 2013-06-20
2 | 1 | B | 2014-09-09 | 2015-04-19
The problem is that if I just group by Patient, TestType, and Result,
then the first and third rows in the example above would become a single row.
Patient | TestType | Result | StartDate | EndDate
--------------------------------------------------------
1 | 1 | A | 2012-03-04 | 2013-06-20
1 | 1 | B | 2013-05-27 | 2013-05-27
1 | 2 | X | 2012-08-19 | 2013-06-20
2 | 1 | B | 2014-09-09 | 2015-04-19
I feel like there's got to be something clever I can do with a partition, but I can't quite figure out what it is.
There are several ways to approach this. I like identifying the groups using the difference of row number values:
select patient, testtype, result,
min(date) as startdate, max(date) as enddate
from (select t.*,
(row_number() over (partition by patient, testtype order by date) -
row_number() over (partition by patient, testtype, result order by date)
) as grp
from table t
) t
group by patient, testtype, result, grp
order by patient, startdate;
select patient, testtype, result, date as startdate,
isnull(lead(date) over(partition by patient, testtype, result order by date), date) as enddate
from tablename;
You can use lead function to get the value of date (as enddate) from the next row in each group.
SQL Fiddle with sample data.
See if this gives you what you need.
with T1 as (
select
*,
case when lag(Patient,1)
over (order by Patient, TestType, Result) = Patient
and lag(TestType,1)
over (order by Patient, TestType, Result) = TestType
and lag(Result,1)
over (order by Patient, TestType, Result) = Result
then null else 1 end as Changes
from t
), T2 as (
select
Patient,
TestType,
Result,
dt,
sum(Changes) over (
order by Patient, TestType, Result, dt
) as seq
from T1
)
select
Patient,
TestType,
Result,
min(dt) as dtFrom,
max(dt) as dtTo
from T2
group by Patient, TestType, Result, seq
order by Patient, TestType, Result