Rolling Daily Distinct Counts Partition by Month - sql

I have the following table:
CREATE TABLE tbl (
id int NOT NULL
, date date NOT NULL
, cid int NOT NULL
);
INSERT INTO tbl VALUES
(1 , '2022-01-01', 1)
, (2 , '2022-01-01', 1)
, (3 , '2022-01-01', 2)
, (4 , '2022-01-01', 3)
, (5 , '2022-01-02', 1)
, (6 , '2022-01-02', 4)
, (7 , '2022-01-03', 5)
, (8 , '2022-01-03', 6)
, (9 , '2022-02-01', 1)
, (10, '2022-02-01', 5)
, (11, '2022-02-02', 5)
, (12, '2022-02-02', 3)
;
I'm trying to count distinct users (= cid) each day, but the result is rolling during the month. E.g., for 2022-01-01, only distinct users with date = 2022-01-01 are counted. For 2022-01-02, distinct users with date between 2022-01-01 and 2022-01-02 are counted, and so on. The count should restart each month.
My desired output:
date distinct_cids
2022-01-01 3
2022-01-02 4
2022-01-03 6
2022-02-01 2
2022-02-02 3

I don't have access to snowflake so I can't guarantee that this will work, but from the sound of it:
select date, count(distinct cid) over (partition by month(date) order by date)
from tbl
order by date;
If you have several years worth of data, you can partition by year, month:
select date, count(distinct cid) over (partition by year(date), month(date) order by date)
from tbl
order by date;
Date is a reserved word, so you may consider renaming your column
EDIT: Since distinct is disallowed you can try a vanilla SQL variant. It is likely slow for a large table:
select dt, count(cid)
from (
select distinct dt.dt, x.cid
from tbl x
join (
select distinct date as dt from tbl
) dt (dt)
on x.date <= dt.dt
and month(x.date) = month(dt.dt)
) t
group by dt
order by dt
;
The idea is that we create a new relation (t) with distinct users with a date less than or equal to the current date in the current month. Then we can just count those users for each date.

Related

SQL Find the daily maximum units from a table which stores transactions

I have an SQL Table which stores the units (inventory) of items at any given timestamp. Any transaction(add/delete) on an item basically updates this table with the new quantity and the timestamp of occurrence.
update_timestamp item_id units
1637993217 item1 3
1637993227 item2 1
1637993117 item1 2
1637993237 item1 5
I need to fetch the daily maximum units for every item from this table.
The query I am using is something similar to this :
SELECT date_format(from_unixtime((CAST(update_timestamp AS BIGINT))/1000),'%Y-%m-%d') AS day,
item_id,
MAX(units) as max_units
from Table
group by item_id, day;
which gives an output like:
day item_id max_units
2021-11-23 item1 5
2021-11-24 item1 6
2021-11-23 item2 3
....
....
However when generating the output, I also need to account for the units carrying forward from the balance of the transaction previous to my current day.
Example : For item1, there were few transactions on day 2021-11-24 and the quantity at the end of that day was 6. Now if the next transaction(s) on this item occurred only on 2021-11-26, and say were in the following sequence for this date : [ 4, 2, 3 ]. Then 6 should continue to be the maximum units of the item for the days 2021-11-25 and 2021-11-26 as well.
I am stuck here and unable to get it working through SQL. Currently how I am approaching this is by fetching the last transaction for every day separately, and then using python scripts to forward-fill this data for next days, which is not clean and scalable in my case.
I am running queries on Presto SQL engine.
You can use lag window function to get previous value and select maximum between it and current one:
WITH dataset (update_timestamp, item_id, units) AS (
VALUES (timestamp '2021-11-21 00:00:01', 'item1', 10),
(timestamp '2021-11-23 00:00:02', 'item1', 6),
(timestamp '2021-11-23 00:00:03', 'item2', 1),
(timestamp '2021-11-24 00:00:01', 'item1', 2),
(timestamp '2021-11-24 00:00:04', 'item1', 5)
)
SELECT item_id,
day,
coalesce( -- greatest will return NULL if one of the arguments is NULL so fallback to "current"
greatest(
max_units,
lag(max_units) over (
partition by item_id
order by day
)
),
max_units
) as max_units
FROM (
SELECT item_id,
date_trunc('day', update_timestamp) day,
max(units) as max_units
FROM dataset
GROUP BY item_id,
date_trunc('day', update_timestamp)
)
Output:
item_id
day
max_units
item2
2021-11-23 00:00:00.000
1
item1
2021-11-21 00:00:00.000
10
item1
2021-11-23 00:00:00.000
10
item1
2021-11-24 00:00:00.000
6
I think my answer is really close to Guru's. I made an assumption that you might need to fill in dates that were missing, so created a calendar table - replace with whatever you want.
This was written in BigQuery, so not sure if it will compile/execute in Presto but I think they are syntactically close.
with transactions as (
select cast('2021-11-17' as date) as update_timestamp, 'item1' as item_id, 3 as units union all
select cast('2021-11-18' as date), 'item2', 1 union all
select cast('2021-11-18' as date), 'item2', 5 union all
select cast('2021-11-20' as date), 'item1', 2 union all
select cast('2021-11-20' as date), 'item2', 3 union all
select cast('2021-11-20' as date), 'item2', 2 union all
select cast('2021-11-20' as date), 'item1', 10 union all
select cast('2021-11-24' as date), 'item1', 8 union all
select cast('2021-11-24' as date), 'item1', 5
),
some_calendar_table AS (
SELECT cast(d as date) as cal_date
FROM UNNEST(GENERATE_DATE_ARRAY('2021-11-15', '2021-11-30', INTERVAL 1 DAY)) AS d
),
daily_transaction_max as (
SELECT update_timestamp AS transaction_date,
item_id,
MAX(units) as max_value
from transactions
group by item_id, transaction_date
)
select cal.cal_date
, t.item_id
, mt.max_value as max_inventory_from_this_dates_transactions
, greatest(coalesce(mt.max_value, 0), coalesce(last_value(mt.max_value ignore nulls) over(partition by t.item_id
order by cal.cal_date
rows between unbounded preceding and 1 preceding)
, 0)) as max_daily_inventory
from some_calendar_table cal
cross join (select distinct item_id from daily_transaction_max) t
left join daily_transaction_max mt
on mt.transaction_date = cal.cal_date
and mt.item_id = t.item_id
order by t.item_id, cal.cal_date

SQL query to find the number of customers who shopped for 3 consecutive days in month of January 2020

I have below table called orders which has customer id and their order date (Note: there can be multiple orders from same customer on a single day)
create table orders (Id char, order_dt date)
insert into orders values
('A','1/1/2020'),
('B','1/1/2020'),
('C','1/1/2020'),
('D','1/1/2020'),
('A','1/1/2020'),
('B','1/1/2020'),
('A','2/1/2020'),
('B','2/1/2020'),
('C','2/1/2020'),
('B','2/1/2020'),
('A','3/1/2020'),
('B','3/1/2020')
I'm trying to write an SQL query to find the number of customers who shopped for 3 consecutive days in month of January 2020
Based on above order values, the output should be: 2
I referred other similar questions but still wasn't able to come the exact solution
Here is my solution which works fine even there are many orders of one customer in one day;
Some scripts to build test environment:
create table orders (Id varchar2(1), order_dt date);
insert into orders values('A',to_date('01/01/2020','dd/mm/yyyy'));
insert into orders values('B',to_date('01/01/2020','dd/mm/yyyy'));
insert into orders values('C',to_date('01/01/2020','dd/mm/yyyy'));
insert into orders values('D',to_date('01/01/2020','dd/mm/yyyy'));
insert into orders values('A',to_date('01/01/2020','dd/mm/yyyy'));
insert into orders values('B',to_date('01/01/2020','dd/mm/yyyy'));
insert into orders values('A',to_date('02/01/2020','dd/mm/yyyy'));
insert into orders values('B',to_date('02/01/2020','dd/mm/yyyy'));
insert into orders values('C',to_date('02/01/2020','dd/mm/yyyy'));
insert into orders values('B',to_date('02/01/2020','dd/mm/yyyy'));
insert into orders values('A',to_date('03/01/2020','dd/mm/yyyy'));
insert into orders values('B',to_date('03/01/2020','dd/mm/yyyy'));
select distinct id, count_days from (
select id,
order_dt,
count(*) over(partition by id order by order_dt range between 1 preceding and 1 following ) count_days
from orders group by id, order_dt
)
where count_days = 3;
-- Insert for test more days than 3 consecutive
insert into orders values('A',to_date('04/01/2020','dd/mm/yyyy'));
You can use two window functions to calculate difference between consequtive dates and sliding window with ROWS offset to count distinct preceiding consequtive days. Example here:
with gen as (
select 1 as cust_id, (date '2020-01-10') + 1 as q from dual union all
select 1, (date '2020-01-10') + 2 as q from dual union all
select 1, (date '2020-01-10') + 3 as q from dual union all
select 1, (date '2020-01-10') + 3 as q from dual union all
select 1, (date '2020-01-10') + 5 as q from dual union all
select 1, (date '2020-01-10') + 7 as q from dual union all
select 1, (date '2020-01-10') + 8 as q from dual union all
select 1, (date '2020-01-10') + 9 as q from dual
)
, diff as (
select gen.*
, q - lag(q) over(partition by cust_id, trunc(q, 'mm') order by q asc) as datediff
from gen
)
, window as (
select diff.*
, sum(decode(datediff, 1, 1, 0)) over(partition by cust_id, trunc(q, 'mm') order by q asc range between 2 preceding and current row) as cnt
from diff
)
select sum(count(distinct q)) as cnt
from window
where cnt = 2
group by cust_id
why not join twice based on same following two days. As long as you have index on the customer's ID and date, the join should be optimized. Because the joins require match on the same starting date basis, it either finds or it doesn't. If not, it is left out of the result set.
select distinct
o1.id
from
orders o1
JOIN orders o2
on o1.id = o2.id
AND o1.order_dt = o2.order_dt - interval '1' day
JOIN orders o3
on o1.id = o3.id
AND o1.order_dt = o3.order_dt - interval '2' day
Hmmmm . . . one method is to use lead()/lag(). Assuming that you don't have duplicates on a single day, then:
select distinct id
from (select o.*,
lag(order_dt) over (partition by id order by order_dt) as prev_order_dt,
lag(order_dt, 2) over (partition by id order by order_dt) as prev_order_dt2
from orders o
where order_dt >= date '2020-01-01' and
order_dt < date '2020-02-01'
) o
where prev_order_dt = order_dt - interval '1' day and
prev_order_dt2 = order_dt - interval '2' day;
EDIT:
If the table has duplicate records, the above is easily tweaked:
select distinct id
from (select o.*,
lag(order_dt) over (partition by id order by order_dt) as prev_order_dt,
lag(order_dt, 2) over (partition by id order by order_dt) as prev_order_dt2
from (select distinct o.id, trunc(order_dt) as order_dt
from orders o
where order_dt >= date '2020-01-01' and
order_dt < date '2020-02-01'
) o
) o
where prev_order_dt = order_dt - interval '1' day and
prev_order_dt2 = order_dt - interval '2' day;

Calculate inactive customers from single table

I have table with fields Customer.No. , Posting date, Order_ID . I want to find total inactive customers for last 12 months on month basis which means they have placed order before 12 months back and became in active. So want calculate this every month basis to under stand how inactive customers are growing month by month.
if I run the query in July it should go back 365 days from the previous month end and give total number of inactive customers. I want to do this month by month.
I am in learning stage please help.
Thanks for your time in advance.
to get the customers
SELECT DISTINCT a.CustomerNo
FROM YourTable a
WHERE NOT EXISTS
(SELECT 0 FROM YourTable b WHere a.CustomerNo = b.CustomerNo
and b.PostingDate >
dateadd(day,-365 -datepart(day,getdate()),getdate())
)
to get a count
SELECT DISTINCT count(0) as InnactiveCount
FROM YourTable a
WHERE NOT EXISTS
(SELECT 0 FROM YourTable b WHere a.CustomerNo = b.CustomerNo
and b.PostingDate >
dateadd(day,-365 -datepart(day,getdate()),getdate())
..
generate a 'months' table by CTE, then look for inactive in those months
;WITH month_gen as (SELECT dateadd(day,-0 -datepart(day,getdate()),getdate()) eom, 1 as x
UNION ALL
SELECT dateadd(day,-datepart(day,eom),eom) eom, x + 1 x FROM month_gen where x < 12
)
SELECT DISTINCT CONVERT(varchar(7), month_gen.eom, 102), count(0) innactiveCount FROM YourTable a
cross join month_gen
WHERE NOT EXISTS(SELECT 0 FROM YourTable b WHere a.CustomerNo = b.CustomerNo and
YEAR(b.PostingDate) = YEAR(eom) and
MONTH(b.PostingDate) = MONTH(eom)
)
GROUP BY CONVERT(varchar(7), month_gen.eom, 102)
if that gets you anywhere, maybe a final step is to filter out anything getting 'counted' before it was ever active i.e. don't count 'new' customers before they became active
Try below query. To achieve your goal you need calendar table (which I defined with CTE). Below query counts inactivity for the first day of a month:
declare #tbl table (custNumber int, postDate date, orderId int);
insert into #tbl values
(1, '2017-01-01', 123),
(2, '2017-02-01', 124),
(3, '2017-02-01', 125),
(1, '2018-02-02', 126),
(2, '2018-05-01', 127),
(3, '2018-06-01', 128)
;with cte as (
select cast('2018-01-01' as date) dt
union all
select dateadd(month, 1, dt) from cte
where dt < '2018-12-01'
)
select dt, sum(case when t2.custNumber is null then 1 else 0 end)
from cte c
left join #tbl t1 on dateadd(year, -1, c.dt) >= t1.postDate
left join #tbl t2 on t2.postDate > dateadd(year, -1, c.dt) and t2.postDate <= c.dt and t1.custNumber = t2.custNumber
group by dt

Get average of last 7 days

I'm attacking a problem, where I have a value for a a range of dates. I would like to consolidate the rows in my table by averaging them and reassigning the date column to be relative to the last 7 days. My SQL experience is lacking and could use some help. Thanks for giving this a look!!
E.g.
7 rows with dates and values.
UniqueId Date Value
........ .... .....
a 2014-03-20 2
a 2014-03-21 2
a 2014-03-22 3
a 2014-03-23 5
a 2014-03-24 1
a 2014-03-25 0
a 2014-03-26 1
Resulting row
UniqueId Date AvgValue
........ .... ........
a 2014-03-26 2
First off I am not even sure this is possible. I'm am trying to attack a problem with this data at hand. I thought maybe using a framing window with a partition to roll the dates into one date with the averaged result, but am not exactly sure how to say that in SQL.
Am taking following as sample
CREATE TABLE some_data1 (unique_id text, date date, value integer);
INSERT INTO some_data1 (unique_id, date, value) VALUES
( 'a', '2014-03-20', 2),
( 'a', '2014-03-21', 2),
( 'a', '2014-03-22', 3),
( 'a', '2014-03-23', 5),
( 'a', '2014-03-24', 1),
( 'a', '2014-03-25', 0),
( 'a', '2014-03-26', 1),
( 'b', '2014-03-01', 1),
( 'b', '2014-03-02', 1),
( 'b', '2014-03-03', 1),
( 'b', '2014-03-04', 1),
( 'b', '2014-03-05', 1),
( 'b', '2014-03-06', 1),
( 'b', '2014-03-07', 1)
OPTION A : - Using PostgreSQL Specific Function WITH
with cte as (
select unique_id
,max(date) date
from some_data1
group by unique_id
)
select max(sd.unique_id),max(sd.date),avg(sd.value)
from some_data1 sd inner join cte using(unique_id)
where sd.date <=cte.date
group by cte.unique_id
limit 7
> SQLFIDDLE DEMO
OPTION B : - To work in PostgreSQL and MySQL
select max(sd.unique_id)
,max(sd.date)
,avg(sd.value)
from (
select unique_id
,max(date) date
from some_data1
group by unique_id
) cte inner join some_data1 sd using(unique_id)
where sd.date <=cte.date
group by cte.unique_id
limit 7
> SQLFDDLE DEMO
Maybe something along the lines of SELECT AVG(Value) AS 'AvgValue' FROM tableName WHERE Date BETWEEN dateStart AND dateEnd That will get you the average between those dates and you have dateEnd already so you could use that result to create the row you're looking for.
For PostgreSQL a window function might be what you want:
DROP TABLE IF EXISTS some_data;
CREATE TABLE some_data (unique_id text, date date, value integer);
INSERT INTO some_data (unique_id, date, value) VALUES
( 'a', '2014-03-20', 2),
( 'a', '2014-03-21', 2),
( 'a', '2014-03-22', 3),
( 'a', '2014-03-23', 5),
( 'a', '2014-03-24', 1),
( 'a', '2014-03-25', 0),
( 'a', '2014-03-26', 1),
( 'a', '2014-03-27', 3);
WITH avgs AS (
SELECT unique_id, date,
avg(value) OVER w AS week_avg,
count(value) OVER w AS num_days
FROM some_data
WINDOW w AS (
PARTITION BY unique_id
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW))
SELECT unique_id, date, week_avg
FROM avgs
WHERE num_days=7
Result:
unique_id | date | week_avg
-----------+------------+--------------------
a | 2014-03-26 | 2.0000000000000000
a | 2014-03-27 | 2.1428571428571429
Questions include:
What happens if a day from the preceding six days is missing? Do we want to add it and count it as zero?
What happens if you add a day? Is the result of the code above what you want (a rolling 7-day average)?
For SQL Server, you can follow the below approach. Try this
1. For weekly value's average
SET DATEFIRST 4
;WITH CTE AS
(
SELECT *,
DATEPART(WEEK,[DATE])WK,
--Find last day in that week
ROW_NUMBER() OVER(PARTITION BY UNIQUEID,DATEPART(WEEK,[DATE]) ORDER BY [DATE] DESC) RNO,
-- Find average value of that week
AVG(VALUE) OVER(PARTITION BY UNIQUEID,DATEPART(WEEK,[DATE])) AVGVALUE
FROM DATETAB
)
SELECT UNIQUEID,[DATE],AVGVALUE
FROM CTE
WHERE RNO=1
Click here to view result
2. For last 7 days value's average
DECLARE #DATE DATE = '2014-03-26'
;WITH CTE AS
(
SELECT UNIQUEID,[DATE],VALUE,#DATE CURRENTDATE
FROM DATETAB
WHERE [DATE] BETWEEN DATEADD(DAY,-7,#DATE) AND #DATE
)
SELECT UNIQUEID,CURRENTDATE [DATE],AVG(VALUE) AVGVALUE
FROM CTE
GROUP BY UNIQUEID,CURRENTDATE
Click here to view result

SQL return consecutive records

A simple table:
ForumPost
--------------
ID (int PK)
UserID (int FK)
Date (datetime)
What I'm looking to return how many times a particular user has made at least 1 post a day for n consecutive days.
Example:
User 15844 has posted at least 1 post a day for 30 consecutive days 10 times
I've tagged this question with linq/lambda as well as a solution there would also be great. I know I can solve this by iterating all the users records but this is slow.
There is a handy trick you can use using ROW_NUMBER() to find consecutive entries, imagine the following set of dates, with their row_number (starting at 0):
Date RowNumber
20130401 0
20130402 1
20130403 2
20130404 3
20130406 4
20130407 5
For consecutive entries if you subtract the row_number from the value you get the same result. e.g.
Date RowNumber date - row_number
20130401 0 20130401
20130402 1 20130401
20130403 2 20130401
20130404 3 20130401
20130406 4 20130402
20130407 5 20130402
You can then group by date - row_number to get the sets of consecutive days (i.e. the first 4 records, and the last 2 records).
To apply this to your example you would use:
WITH Posts AS
( SELECT FirstPost = DATEADD(DAY, 1 - ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY [Date]), [Date]),
UserID,
Date
FROM ( SELECT DISTINCT UserID, [Date] = CAST(Date AS [Date])
FROM ForumPost
) fp
), Posts2 AS
( SELECT FirstPost,
UserID,
Days = COUNT(*),
LastDate = MAX(Date)
FROM Posts
GROUP BY FirstPost, UserID
)
SELECT UserID, ConsecutiveDates = MAX(Days)
FROM Posts2
GROUP BY UserID;
Example on SQL Fiddle (simple with just most consecutive days per user)
Further example to show how to get all consecutive periods
EDIT
I don't think the above quite answered the question, this will give the number of times a user has posted on, or over n consecutive days:
WITH Posts AS
( SELECT FirstPost = DATEADD(DAY, 1 - ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY [Date]), [Date]),
UserID,
Date
FROM ( SELECT DISTINCT UserID, [Date] = CAST(Date AS [Date])
FROM ForumPost
) fp
), Posts2 AS
( SELECT FirstPost,
UserID,
Days = COUNT(*),
FirstDate = MIN(Date),
LastDate = MAX(Date)
FROM Posts
GROUP BY FirstPost, UserID
)
SELECT UserID, [Times Over N Days] = COUNT(*)
FROM Posts2
WHERE Days >= 30
GROUP BY UserID;
Example on SQL Fiddle
Your particular application makes this pretty simple, I think. If you have 'n' distinct dates in an 'n'-day interval, those 'n' distinct dates must be consecutive.
Scroll to the bottom for a general solution that requires only common table expressions and changing to PostgreSQL. (Kidding. I implemented in PostgreSQL, because I'm short of time.)
create table ForumPost (
ID integer primary key,
UserID integer not null,
post_date date not null
);
insert into forumpost values
(1, 1, '2013-01-15'),
(2, 1, '2013-01-16'),
(3, 1, '2013-01-17'),
(4, 1, '2013-01-18'),
(5, 1, '2013-01-19'),
(6, 1, '2013-01-20'),
(7, 1, '2013-01-21'),
(11, 2, '2013-01-15'),
(12, 2, '2013-01-16'),
(13, 2, '2013-01-17'),
(16, 2, '2013-01-17'),
(14, 2, '2013-01-18'),
(15, 2, '2013-01-19'),
(21, 3, '2013-01-17'),
(22, 3, '2013-01-17'),
(23, 3, '2013-01-17'),
(24, 3, '2013-01-17'),
(25, 3, '2013-01-17'),
(26, 3, '2013-01-17'),
(27, 3, '2013-01-17');
Now, let's look at the output of this query. For brevity, I'm looking at 5-day intervals, not 30-day intervals.
select userid, count(distinct post_date) distinct_dates
from forumpost
where post_date between '2013-01-15' and '2013-01-19'
group by userid;
USERID DISTINCT_DATES
1 5
2 5
3 1
For users that fit the criteria, the number of distinct dates in that 5-day interval will have to be 5, right? So we just need to add that logic to a HAVING clause.
select userid, count(distinct post_date) distinct_dates
from forumpost
where post_date between '2013-01-15' and '2013-01-19'
group by userid
having count(distinct post_date) = 5;
USERID DISTINCT_DATES
1 5
2 5
A more general solution
It doesn't really make sense to say that, if you post every day from 2013-01-01 to 2013-01-31, you've posted 30 consecutive days 2 times. Instead, I'd expect the clock to start over on 2013-01-31. My apologies for implementing in PostgreSQL; I'll try to implement in T-SQL later.
with first_posts as (
select userid, min(post_date) first_post_date
from forumpost
group by userid
),
period_intervals as (
select userid, first_post_date period_start,
(first_post_date + interval '4' day)::date period_end
from first_posts
), user_specific_intervals as (
select
userid,
(period_start + (n || ' days')::interval)::date as period_start,
(period_end + (n || ' days')::interval)::date as period_end
from period_intervals, generate_series(0, 30, 5) n
)
select userid, period_start, period_end,
(select count(distinct post_date)
from forumpost
where forumpost.post_date between period_start and period_end
and userid = forumpost.userid) distinct_dates
from user_specific_intervals
order by userid, period_start;