Category Entry and Exit Dates per ID AND Category - sql

I have the following table, where ID is the unique identifier. An can move from category to category, both up and down. My table records each day an ID stays in a given category. I am trying to identify the start date and the end date of an ID in a given category. The problem is that an ID can move up a category, and move back down to its original category after a certain number of days. Here is my table as an example with only 1 ID:
ID Category Date
1 1 2021-01-01
1 1 2021-01-02
...
1 1 2021-01-24
1 2 2021-01-25
...
1 2 2021-02-15
1 1 2021-02-16
...
1 1 2021-04-20
1 2 2021-04-21
When I try to get the MIN(DATE) and MAX(DATE) and group by the category and ID, it shows me that the account was in Category 1 from 2021-01-01 to 2021-04-20, and in Category 2 from 02-25 to 04-21. I am trying to track the movements of the file in each bucket step by step, meaning in my ideal result, the movements of the account will be tracked as:
ID Category StartDate EndDate
1 1 2021-01-01 2021-01-24
1 2 2021-01-25 2021-02-15
1 1 2021-02-16 2021-04-20
1 2 2021-04-21 NULL (or GETDATE())
How can I achieve this result? Any help would be appreciated. I tried using the RANK() function but because the table records every single day, it seems useless.

This is a type of gaps-and-islands problem that is most easily solved using the difference of row numbers:
select id, category, min(date), max(date)
from (select t.*,
row_number() over (partition by id order by date) as seqnum,
row_number() over (partition by id, category order by date) as seqnum_2
from t
) t
group by id, category, (seqnum - seqnum_2);
Actually, the difference of row numbers is only simplest because you have not specified the database. You can just subtract a sequence of numbers from the date to get a constant that defines each group. That looks like:
select id, category, min(date), max(date)
from (select t.*,
row_number() over (partition by id, category order by date) as seqnum
from t
) t
group by id, category, date - seqnum * interval '1 day';
However, the date arithmetic varies by database.

Related

Calculate the streaks of visit of users limited to 7

I am trying to calculate the consecutive visits a user makes on an app. I used the rank function to determine the streaks maintained by each user. However, my requirement is that the streaks should not exceed 7.
For instance, if a user visits the app for 9 consecutive days. He will have 2 different streaks: one with count 7 and the other with 2.
Using MaxCompute. It's similar to MySQL.
I have the following table named visitors_data:
user_id visit_date
murtaza 01-01-2021
john 01-01-2021
murtaza 02-01-2021
murtaza 03-01-2021
murtaza 04-01-2021
john 01-01-2021
murtaza 05-01-2021
murtaza 06-01-2021
john 02-01-2021
john 03-01-2021
murtaza 07-01-2021
murtaza 08-01-2021
murtaza 09-01-2021
john 20-01-2021
john 21-01-2021
Output should look like this:
user_id streak
murtaza 7
murtaza 2
john 3
john 2
I was able to get the streaks by the following query, but I could not limit the streaks to 7.
WITH groups AS (
SELECT user_id,
RANK() OVER (ORDER BY user_id, visit_date) AS RANK,
visit_date,
DATEADD(visit_date, -RANK() OVER (ORDER BY user_id, visit_date), 'dd') AS date_group
FROM visitors_data
ORDER BY user_id, visit_date)
SELECT
user_id,
COUNT(*) AS streak
FROM groups
GROUP BY
user_id,
date_group
HAVING COUNT(*)>1
ORDER BY COUNT(*);
My thinking ran along similar lines to forpas':
SELECT user_id, COUNT(*) streak
FROM
(
SELECT
user_id, streak,
FLOOR((ROW_NUMBER() OVER (PARTITION BY user_id, streak ORDER BY visit_date)-1)/7) substreak
FROM
(
SELECT
user_id, visit_date,
SUM(runtot) OVER (PARTITION BY user_id ORDER BY visit_date) streak
FROM (
SELECT
user_id, visit_date,
CASE WHEN DATE_ADD(visit_date, INTERVAL -1 DAY) = LAG(visit_date) OVER (PARTITION BY user_id ORDER BY visit_date) THEN 0 ELSE 1 END as runtot
FROM visitors_data
GROUP BY user_id, visit_date
) x
) y
) z
GROUP BY user_id, streak, substreak
As an explanation of how this works; a usual trick for counting runs of successive records is to use LAG to examine the record before and if there is only e.g. one day difference then put a 0, otherwise put a 1. This then means the first record of a consecutive run is 1, and the rest are 0, so the column ends up looking like ​1,0,0,0,1,0... SUM OVER ORDER BY sums this in a "running total" fashion. This effectively means it forms a counter that ticks up every time the start of a run is encountered so a run of 4 days followed by a gap then a run of 3 days looks like 1,1,1,1,2,2,2 etc and it forms a "streak ID number".
If this is then fed into a row numbering that partitions by the streak ID number, it establishes an incrementing counter that restarts every time the streak ID changes. If we sub 1 off this so it runs from 0 instead of 1 then we can divide it by 7 to get a "sub streak ID" for our 9-long streak that is 0,0,0,0,0,0,0,1,1 (and so on. A streak of 25 would have 7 zeroes, 7 ones, 7 twos, and 4 threes)
All that remains then is to group by the user, the streak ID, the substreakID and count the result
Before the final group and count the data looks like:
Which should give some idea of how it all works
With a mix of window functions and aggregation:
SELECT user_id, COALESCE(NULLIF(MAX(counter) % 7, 0), 7) streak
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY user_id, grp ORDER BY visit_date) counter
FROM (
SELECT *, SUM(flag) OVER (PARTITION BY user_id ORDER BY visit_date) grp
FROM (
SELECT *, COALESCE(DATE_ADD(visit_date, INTERVAL -1 DAY) <>
LAG(visit_date) OVER (PARTITION BY user_id ORDER BY visit_date), 1) flag
FROM (SELECT DISTINCT * FROM visitors_data) t
) t
) t
) t
GROUP BY user_id, grp, FLOOR((counter - 1) / 7)
See the demo.
You could break them up after the fact. For instance, if you never have more than 21:
SELECT user_id, LEAST(streak, 7)
FROM (SELECT user_id, COUNT(*) AS streak
FROM groups
GROUP BY user_id, date_group
HAVING COUNT(*) > 1
) gu JOIN
(SELECT 1 as n UNION ALL SELECT 2 as n UNION ALL SELECT 3 UNION ALL SELECT 4
) n
ON streak >= n * 7
ORDER BY LEAST(streak, 7);
If you have an indeterminate number range for the longest streak, you can do something similar with a recursive CTE>

Count new entries day by day

I would like to count new id's in each day. Saying new, I mean new relative to the day before.
Assume we have a table:
Date
Id
2021-01-01
1
2021-01-02
4
2021-01-02
5
2021-01-02
6
2021-01-03
1
2021-01-03
5
2021-01-03
7
My desired output, would look like this:
Date
Count(NewId)
2021-01-01
1
2021-01-02
3
2021-01-03
2
You can use two levels of aggregation:
select date, count(*)
from (select id, min(date) as date
from t
group by id
) i
group by date
order by date;
If by "relative to the day before" you mean that you want to count someone as new whenever they have no record on the previous day, then use lag() . . . carefully:
select date,
sum(case when prev_date = date - interval '1' day then 0 else 1 end)
from (select t.*,
lag(date) over (partition by id order by date) as prev_date
from t
) t
group by date
order by date;
here is another way, probably the simplest :
select t1.Date, count(*) from table t1
where id not in (select id from table t2 where t2.date = t1.date- interval '1 day')
group by t1.Date
Maybe this other option could also do the job, but being honest I would prefer the #GordonLinoff answer:
select date, count(*)
from your_table t
where not exists (
select 1
from your_table tt
where tt.Id=t.id
and tt.date = date_sub(t.date,1)
)
group by date

Add a column with customers orders count at the time they passed the order

I have the following table
order_id
created_at
customer_id
1
2020-01-02
11
2
2020-02-03
12
3
2020-02-03
11
I would like to add a column "customer_orders_count" that will assign the number of orders that a customer passed to each transaction, ie obtain this table :
order_id
created_at
customer_id
customer_orders_count
1
2020-01-02
11
1
2
2020-02-03
12
1
2
2020-02-03
11
2
My problem it's I can't find how to calculated a local "customer_orders_count" dependind on each order, I only managed to add a column with the global "customer_orders_count" and for example for the first row order_id=1 I'll get customer_orders_count=2 whereas I'll like to be 1.
Does anyone has and idea ?
Use cumulative count:
with mytable as (
select 1 as order_id, date '2020-01-02' as created_at, 11 as customer_id union all
select 2, '2020-02-03', 12 union all
select 3 , '2020-02-03', 11
)
select *, count(*) over (partition by customer_id order by created_at) as customer_orders_count
from mytable
order by order_id
Use row_number():
select t.*,
row_number() over (partition by customer_id order by created_at) as customer_order_count
from t;
This is subtly different from using a cumulative count(). This version guarantees that the numbers for a given customer are never duplicated, even when the dates are the same. A cumulative count has no such guarantee.

Joining client records based on overlapping date ranges in oracle SQL

I have a dataset that looks like this:
Client id
stayId
start_date
end_date
type
1
101
1-1-2010
20-7-2010
A
1
105
1-7-2010
30-12-2010
A
2
108.
8-10-2012
10-12-2012
B
2
108.
8-10-2012
10-12-2012
B
And i want to merge rows with overlapping date ranges and take the highest stayId but only if the client id and types match. How should i do this in oracle sql?
The result would look like this:
Client id
stayId
start_date
end_date
type
1
105
1-1-2010
30-12-2010
A
2
108.
8-10-2012
10-12-2012
B
2
108.
01-01-2013
13-10-2013
B
This is a type of gaps-and-islands problem. It looks tricky, because there can be arbitrary overlaps -- I suspect that the overlap might even be an earlier record, as in:
|------| |-------|
|------------------|
For this version, I recommend a cumulative max to identify the rows with no overlap. These rows start the "islands". Then, a cumulative sum identifies the islands (the sum of rows where there is no overlap). The final step is aggregation:
select clientid, type, max(stayid),
min(start_date), max(end_date)
from (select t.*,
sum(case when prev_end_date >= start_date then 0 else 1 end) over
(partition by clientid, type
order by start_date
) as grp
from (select t.*,
max(end_date) over (partition by clientid, type
order by start_date
range between unbounded preceding and '1' day preceding
) as prev_end_date
from t
) t
) t
group by clientid, type, grp;

How to use SQL to get column count for a previous date?

I have the following table,
id status price date
2 complete 10 2020-01-01 10:10:10
2 complete 20 2020-02-02 10:10:10
2 complete 10 2020-03-03 10:10:10
3 complete 10 2020-04-04 10:10:10
4 complete 10 2020-05-05 10:10:10
Required output,
id status_count price ratio
2 0 0 0
2 1 10 0
2 2 30 0.33
I am looking to add the price for previous row. Row 1 is 0 because it has no previous row value.
Find ratio ie 10/30=0.33
You can use analytical function ROW_NUMBER and SUM as follows:
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
COALESCE(SUM(price) OVER (PARTITION BY id ORDER BY date), 0) - price as price
FROM yourTable;
DB<>Fiddle demo
I think you want something like this:
SELECT
id,
COUNT(*) OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
COALESCE(SUM(price) OVER (PARTITION BY id
ORDER BY date ROWS BETWEEN
UNBOUNDED PRECEDING AND 1 PRECEDING), 0) price
FROM yourTable;
Demo
Please also check another method:
with cte
as(*,ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
SUM(price) OVER (PARTITION BY id ORDER BY date) ss from yourTable)
select id,status_count,isnull(ss,0)-price price
from cte