How to create a cumulative count distinct with partition by in SQL? - sql

I have a table with user data and want to create a cumulative count distinct but this type of window function does not exist. This is my table
date | user-id | purchase-id
2020-01-01 | 1 | 244
2020-01-03 | 1 | 244
2020-02-01 | 1 | 524
2020-03-01 | 2 | 443
Now, I want a cum count distinct for purchase id like this:
date | user-id | purchase-id | cum_purchase
2020-01-01 | 1 | 244 | 1
2020-01-03 | 1 | 244 | 1
2020-02-01 | 1 | 524 | 2
2020-03-01 | 2 | 443 | 1
I tried
Select
dt,
user_id,
count(distinct purchase_id) over (partition by user_id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cum_ct
from table
I get an error that I cannot use count distinct with an order by statement. What to do?

Something like this
Select
dt as [date],
user_id,
purchase_id
SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) over (partition by user_id ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cum_ct
from (
SELECT
dt,
user_id,
purchase_id,
ROW_NUMBER() OVER (PARTITION BY user_id, purchase_id ORDER BY dt) as RN
FROM sometable
) sub

Related

Counting number of rows leading to some number

The following is a snippet of my table...
My table has a lot of more users and higher order_rank
I'm trying to get the number of visits leading up to that order_rank in postgres.
So the result I'm trying to generate looks like...
I would address this as a gaps-and-island problem, where each island ends with a visit. You want the end of each island, along with the count of preceding records in the same island.
You can define the group with a window count of non-null values that starts from the end of the table. Then, just use that information to count how many records belong to each group:
select *
from (
select t.*,
count(*) over(partition by customer_id, grp) - 1 as number_of_visits
from (
select t.*,
count(order_rank) over(partition by customer_id order by visit_time desc) grp
from mytable t
) t
) t
where order_rank is not null
Demo on DB Fiddle:
customer_id | visit_time | txn_flag | order_rank | grp | number_of_visits
----------: | :--------- | -------: | ---------: | --: | ---------------:
123 | 2020-01-04 | 1 | 1 | 3 | 3
123 | 2020-01-06 | 1 | 2 | 2 | 1
123 | 2020-01-11 | 1 | 3 | 1 | 4

How to calculate average of values without including the last value (sql)?

I have a table. I partition it by the id and want to calculate average of the values previous to the current, without including the current value. Here is a sample table:
+----+-------+------------+
| id | Value | Date |
+----+-------+------------+
| 1 | 51 | 2020-11-26 |
| 1 | 45 | 2020-11-25 |
| 1 | 47 | 2020-11-24 |
| 2 | 32 | 2020-11-26 |
| 2 | 51 | 2020-11-25 |
| 2 | 45 | 2020-11-24 |
| 3 | 47 | 2020-11-26 |
| 3 | 32 | 2020-11-25 |
| 3 | 35 | 2020-11-24 |
+----+-------+------------+
In this case, it means calculating the average of values for dates BEFORE 2020-11-26. This is the expected result
+----+-------+
| id | Value |
+----+-------+
| 1 | 46 |
| 2 | 48 |
| 3 | 33.5 |
+----+-------+
I have calculated it using ROWS N PRECEDING but it appears that this way I average N preceding + last row, and I want to exclude the last row (which is the most recent date in my case).
Here is my query:
SELECT ID,
(avg(Value) OVER(
PARTITION BY ID
ORDER BY Date
ROWS 9 PRECEDING )) as avg9
FROM t1
Then define your window in full using both the start and ends with BETWEEN:
SELECT ID,
(AVG(Value) OVER (PARTITION BY ID ORDER BY Date ROWS BETWEEN 9 PRECEDING AND 1 PRECEDING)) AS avg9
FROM t1;
Why not just filter:
select id, avg(value)
from t1
where date < '2020-11-26'
group by id;
If you want the date to be flexible -- say the most recent value for each date, then:
select id, avg(value)
from (select t1.*,
max(date) over (partition by id) as max_date
from t1
) t1
where date < max_date
group by id;
Do a row_number() over (Partition by id ORDER BY [Date] DESC). This will give a rank = 1 to the row with latest date. Wrap it within a CTE and then calculate avg for each partition where RANK > 1. Please check syntax.
;with a as
(
select id, value, Date, row_number() over (partition by id order by date
desc) as RN
)
select id, avg(Value) from a group by id where r.RN > 1

Grouping consecutive sequences of rows

I'm trying to group consecutive rows where a boolean value is true on SQL Server. For example, here's what some source data looks like:
AccountID | ID | IsTrue | Date
-------------------------------
1 | 1 | 1 | 1/1/2013
1 | 2 | 1 | 1/2/2013
1 | 3 | 1 | 1/3/2013
1 | 4 | 0 | 1/4/2013
1 | 5 | 1 | 1/5/2013
1 | 6 | 0 | 1/6/2013
1 | 7 | 1 | 1/7/2013
1 | 8 | 1 | 1/8/2013
1 | 9 | 1 | 1/9/2013
And here's what I'd like as the output
AccountID | Start | End
-------------------------------
1 | 1/1/2013 | 1/3/2013
1 | 1/7/2013 | 1/9/2013
I have a hunch that there's some trick with grouping by partitions that will make this work but I've been unable to figure it out. I've made some progress using LAG but haven't been able to put it all together.
Thanks for the help!
This is an example of a gaps and islands problem. For this version, you just need a sequential number for each isTrue. Subtracting this number of days from each date is a constant for adjacent values that are the same:
select accountId, isTrue, min(date), max(date)
from (select t.*,
row_number() over (partition by accountId, isTrue order by date) as seqnum
from t
) t
group by accountId, isTrue, dateadd(day, -seqnum, date);
This defines all groups. If I assume that you just want values of "1" that are more than 1 day long, then:
select accountId, isTrue, min(date), max(date)
from (select t.*,
row_number() over (partition by accountId, isTrue order by date) as seqnum
from t
where isTrue = 1
) t
group by accountId, isTrue, dateadd(day, -seqnum, date)
having count(*) > 1;
You can try the following, here is the demo. I have assumption that id will always have consecutive values.
with cte as
(
select
*,
count(*) over (partition by IsTrue, rnk) as total
from
(
select
*,
id - row_number() over (partition by IsTrue order by id, date) as rnk
from myTable
) val
)
select
accountId,
min(date) as start,
max(date) as end
from cte
where total > 1
group by
accountId,
rnk
Output:
| accountid | start | end |
| --------- | ---------- | -----------|
| 1 | 2013-01-01 | 2013-01-03 |
| 1 | 2013-01-07 | 2013-01-09 |

Get users who took ride for 3 or more consecutive dates

I have below table, it shows user_id and ride_date.
+---------+------------+
| user_id | ride_date |
+---------+------------+
| 1 | 2019-11-01 |
| 1 | 2019-11-03 |
| 1 | 2019-11-05 |
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-05 |
| 4 | 2019-11-07 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 5 | 2019-11-11 |
| 5 | 2019-11-13 |
+---------+------------+
I want user_id who took rides for 3 or more consecutive days along with days on which they took consecutive rides
The desired result is as below
+---------+-----------------------+
| user_id | consecutive_ride_date |
+---------+-----------------------+
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 4 | 2019-11-10 |
+---------+-----------------------+
SQL Fiddle
With LAG() and LEAD() window functions:
with cte as (
select *,
datediff(
day,
lag([ride_date]) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev1,
datediff(
day,
lag([ride_date], 2) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev2,
datediff(
day,
[ride_date],
lead([ride_date]) over (partition by [user_id] order by [ride_date])
) next1,
datediff(
day,
[ride_date],
lead([ride_date], 2) over (partition by [user_id] order by [ride_date])
) next2
from Table1
)
select [user_id], [ride_date]
from cte
where
(prev1 = 1 and prev2 = 2) or
(prev1 = 1 and next1 = 1) or
(next1 = 1 and next2 = 2)
See the demo.
Results:
> user_id | ride_date
> ------: | :---------
> 2 | 03/11/2019
> 2 | 04/11/2019
> 2 | 05/11/2019
> 2 | 06/11/2019
> 3 | 03/11/2019
> 3 | 04/11/2019
> 3 | 05/11/2019
> 3 | 06/11/2019
> 4 | 07/11/2019
> 4 | 08/11/2019
> 4 | 09/11/2019
Here is one way to adress this gaps-and-island problem:
first, assign a rank to each user ride with row_number(), and recover the previous ride_date (aliased lag_ride_date)
then, compare the date of the previous ride to the current one in a conditional sum, that increases when the dates are successive ; by comparing this with the rank of the user ride, you get groups (aliased grp) that represent consecutive rides with a 1 day spacing
do a window count how many records belong to each group (aliased cnt)
filter on records whose window count is greater than 3
Query:
select user_id, ride_date
from (
select
t.*,
count(*) over(partition by user_id, grp) cnt
from (
select
t.*,
rn1
- sum(case when ride_date = dateadd(day, 1, lag_ride_date) then 1 else 0 end)
over(partition by user_id order by ride_date) grp
from (
select
t.*,
row_number() over(partition by user_id order by ride_date) rn1,
lag(ride_date) over(partition by user_id order by ride_date) lag_ride_date
from Table1 t
) t
) t
) t
where cnt >= 3
Demo on DB Fiddle
This is a typical gaps and island problems.
We can solve it as follows
with data
as (
select user_id
,ride_date
,dateadd(day
,-row_number() over(partition by user_id order by ride_date asc)
,ride_date) as grp_field
from Table1
)
,consecutive_days
as(
select user_id
,ride_date
,count(*) over(partition by user_id,grp_field) as cnt
from data
)
select *
from consecutive_days
where cnt>=3
order by user_id,ride_date
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=7bb851d9a12966b54afb4d8b144f3d46
There is no need to apply gaps-and-islands methodologies to this problem. The problem is much simpler to solve.
You can return the users and first date just by using LEAD():
SELECT t1.*
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1
WHERE ride_date_2 = DATEADD(day, 2, ride_date);
If you want the actual dates, you can unpivot the results:
SELECT DISTINCT t1.user_id, v.ride_date
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1 CROSS APPLY
(VALUES (t1.ride_date),
(DATEADD(day, 1, t1.ride_date)),
(DATEADD(day, 2, t1.ride_date))
) v(ride_date)
WHERE t1.ride_date_2 = DATEADD(day, 2, t1.ride_date)
ORDER BY t1.user_id, v.ride_date;

SQL statement with having, min , max

I have a table:
ID INTEGER NOT NULL, -- AUTOMATIC RECORD'S ID
CUSTOMER_ID INTEGER NOT NULL,
BILING_PERIOD DATE NOT NULL,
DOCUMENT_ID INTEGER NOT NULL,
DATE_CREATED DATE NOT NULL -- WHEN THE DOCUMENT WAS CREATED
I want to select number of documents for customer in biling period,
id for the document that was created first in biling period for customer
and id for the document that was created last in biling period for customer.
All should be sorted by customer and biling period.
I want only biling periods that have more than 1 document for customer.
So when we have for example such data:
ID CUSTOMER_ID BILING_PERIOD DOCUMENT_ID DATE_CREATED
1 5 2020-01-01 123 2020-02-01
2 5 2020-01-01 22 2019-02-01
3 5 2020-01-01 3 2010-02-01
4 99 2020-01-01 458 2021-02-01
5 99 2020-01-01 64 2010-02-01
6 100 2020-01-01 120 2020-02-01
7 99 2019-06-01 452 2019-06-01
8 99 2019-06-01 546 2019-12-01
I want my results looks like that:
CUSTOMER_ID BILING_PERIOD NR_OF_DOC FIRST_DOC_ID LAST_DOC_ID
5 2020-01-01 3 3 123
99 2019-06-01 2 452 546
99 2020-01-01 2 64 458
Myself I can only count number of documents per user and period
SELECT customer_id, biling_period, count(*) as nr_of_doc
FROM T1
GROUP BY customer_id, biling_period
HAVING COUNT() > 1;
CUSTOMER_ID BILING_PERIOD NR_OF_DOC
5 2020-01-01 3
99 2019-06-01 2
99 2020-01-01 2
I do not know hot to get document_id for newest and oldest document.
You can use row_number() and aggregation:
select
customer_id,
billing_period,
count(*),
max(case when rn_asc = 1 then document_id end) fist_doc_id,
max(case when rn_desc = 1 then document_id end) last_doc_id
from (
select
t.*,
row_number() over(
partition by customer_id, billing_period order by date_created
) rn_asc,
row_number() over(
partition by customer_id, billing_period order by date_created desc
) rn_desc
from t1 t
) t
group by customer_id, billing_period
having count(*) > 1
order by customer_id, billing_period
This will wodk properly even if the document ids are not in sequence.
Demo on DB Fiddle:
customer_id | billing_period | count | fist_doc_id | last_doc_id
----------: | :------------- | ----: | ----------: | ----------:
5 | 2020-01-01 | 3 | 3 | 123
99 | 2019-06-01 | 2 | 452 | 546
99 | 2020-01-01 | 2 | 64 | 458
In your sample data, the document ids seem to be assigned in order. If that is the case, you can just use aggregation:
SELECT customer_id, billing_period, count(*) as nr_of_doc,
MIN(document_id), MAX(document_id)
FROM T1
GROUP BY customer_id, billing_period
HAVING COUNT() > 1;