SQL return consecutive records - sql

A simple table:
ForumPost
--------------
ID (int PK)
UserID (int FK)
Date (datetime)
What I'm looking to return how many times a particular user has made at least 1 post a day for n consecutive days.
Example:
User 15844 has posted at least 1 post a day for 30 consecutive days 10 times
I've tagged this question with linq/lambda as well as a solution there would also be great. I know I can solve this by iterating all the users records but this is slow.

There is a handy trick you can use using ROW_NUMBER() to find consecutive entries, imagine the following set of dates, with their row_number (starting at 0):
Date RowNumber
20130401 0
20130402 1
20130403 2
20130404 3
20130406 4
20130407 5
For consecutive entries if you subtract the row_number from the value you get the same result. e.g.
Date RowNumber date - row_number
20130401 0 20130401
20130402 1 20130401
20130403 2 20130401
20130404 3 20130401
20130406 4 20130402
20130407 5 20130402
You can then group by date - row_number to get the sets of consecutive days (i.e. the first 4 records, and the last 2 records).
To apply this to your example you would use:
WITH Posts AS
( SELECT FirstPost = DATEADD(DAY, 1 - ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY [Date]), [Date]),
UserID,
Date
FROM ( SELECT DISTINCT UserID, [Date] = CAST(Date AS [Date])
FROM ForumPost
) fp
), Posts2 AS
( SELECT FirstPost,
UserID,
Days = COUNT(*),
LastDate = MAX(Date)
FROM Posts
GROUP BY FirstPost, UserID
)
SELECT UserID, ConsecutiveDates = MAX(Days)
FROM Posts2
GROUP BY UserID;
Example on SQL Fiddle (simple with just most consecutive days per user)
Further example to show how to get all consecutive periods
EDIT
I don't think the above quite answered the question, this will give the number of times a user has posted on, or over n consecutive days:
WITH Posts AS
( SELECT FirstPost = DATEADD(DAY, 1 - ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY [Date]), [Date]),
UserID,
Date
FROM ( SELECT DISTINCT UserID, [Date] = CAST(Date AS [Date])
FROM ForumPost
) fp
), Posts2 AS
( SELECT FirstPost,
UserID,
Days = COUNT(*),
FirstDate = MIN(Date),
LastDate = MAX(Date)
FROM Posts
GROUP BY FirstPost, UserID
)
SELECT UserID, [Times Over N Days] = COUNT(*)
FROM Posts2
WHERE Days >= 30
GROUP BY UserID;
Example on SQL Fiddle

Your particular application makes this pretty simple, I think. If you have 'n' distinct dates in an 'n'-day interval, those 'n' distinct dates must be consecutive.
Scroll to the bottom for a general solution that requires only common table expressions and changing to PostgreSQL. (Kidding. I implemented in PostgreSQL, because I'm short of time.)
create table ForumPost (
ID integer primary key,
UserID integer not null,
post_date date not null
);
insert into forumpost values
(1, 1, '2013-01-15'),
(2, 1, '2013-01-16'),
(3, 1, '2013-01-17'),
(4, 1, '2013-01-18'),
(5, 1, '2013-01-19'),
(6, 1, '2013-01-20'),
(7, 1, '2013-01-21'),
(11, 2, '2013-01-15'),
(12, 2, '2013-01-16'),
(13, 2, '2013-01-17'),
(16, 2, '2013-01-17'),
(14, 2, '2013-01-18'),
(15, 2, '2013-01-19'),
(21, 3, '2013-01-17'),
(22, 3, '2013-01-17'),
(23, 3, '2013-01-17'),
(24, 3, '2013-01-17'),
(25, 3, '2013-01-17'),
(26, 3, '2013-01-17'),
(27, 3, '2013-01-17');
Now, let's look at the output of this query. For brevity, I'm looking at 5-day intervals, not 30-day intervals.
select userid, count(distinct post_date) distinct_dates
from forumpost
where post_date between '2013-01-15' and '2013-01-19'
group by userid;
USERID DISTINCT_DATES
1 5
2 5
3 1
For users that fit the criteria, the number of distinct dates in that 5-day interval will have to be 5, right? So we just need to add that logic to a HAVING clause.
select userid, count(distinct post_date) distinct_dates
from forumpost
where post_date between '2013-01-15' and '2013-01-19'
group by userid
having count(distinct post_date) = 5;
USERID DISTINCT_DATES
1 5
2 5
A more general solution
It doesn't really make sense to say that, if you post every day from 2013-01-01 to 2013-01-31, you've posted 30 consecutive days 2 times. Instead, I'd expect the clock to start over on 2013-01-31. My apologies for implementing in PostgreSQL; I'll try to implement in T-SQL later.
with first_posts as (
select userid, min(post_date) first_post_date
from forumpost
group by userid
),
period_intervals as (
select userid, first_post_date period_start,
(first_post_date + interval '4' day)::date period_end
from first_posts
), user_specific_intervals as (
select
userid,
(period_start + (n || ' days')::interval)::date as period_start,
(period_end + (n || ' days')::interval)::date as period_end
from period_intervals, generate_series(0, 30, 5) n
)
select userid, period_start, period_end,
(select count(distinct post_date)
from forumpost
where forumpost.post_date between period_start and period_end
and userid = forumpost.userid) distinct_dates
from user_specific_intervals
order by userid, period_start;

Related

How to get a date interval with condition

How to get a continuous date interval from rows fulfilling specific condition?
I have a table of employees states with 2 types of user_position.
The interval is continuous if the next higher date_position per user_id has the same user_id, the next day value and user_position didn't change. The user cannot have different user positions in one day.
Have a feeling it requires several cases, window functions and tsrange, but can't quite get the right result.
I would be really grateful if you could help me.
Fiddle:
http://sqlfiddle.com/#!17/ba641/1/0
The result should look like this:
user_id
user_position
position_start
position_end
1
1
01.01.2019
02.01.2019
1
2
03.01.2019
04.01.2019
1
1
05.01.2019
06.01.2019
2
1
01.01.2019
03.01.2019
2
2
04.01.2019
05.01.2019
2
2
08.01.2019
08.01.2019
2
2
10.01.2019
10.01.2019
Create/insert query for the source data:
CREATE TABLE IF NOT EXISTS users_position
( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id integer,
user_position integer,
date_position date);
INSERT INTO users_position (user_id,
user_position,
date_position)
VALUES
(1, 1, '2019-01-01'),
(1, 1, '2019-01-02'),
(1, 2, '2019-01-03'),
(1, 2, '2019-01-04'),
(1, 1, '2019-01-05'),
(1, 1, '2019-01-06'),
(2, 1, '2019-01-01'),
(2, 1, '2019-01-02'),
(2, 1, '2019-01-03'),
(2, 2, '2019-01-04'),
(2, 2, '2019-01-05'),
(2, 2, '2019-01-08'),
(2, 2, '2019-01-10');
SELECT user_id, user_position
, min(date_position) AS position_start
, max(date_position) AS position_end
FROM (
SELECT user_id, user_position,date_position
, count(*) FILTER (WHERE (date_position = last_date + 1
AND user_position = last_pos) IS NOT TRUE)
OVER (PARTITION BY user_id ORDER BY date_position) AS interval
FROM (
SELECT user_id, user_position, date_position
, lag(date_position) OVER w AS last_date
, lag(user_position) OVER w AS last_pos
FROM users_position
WINDOW w AS (PARTITION BY user_id ORDER BY date_position)
) sub1
) sub2
GROUP BY user_id, user_position, interval
ORDER BY user_id, interval;
db<>fiddle here
Basically, this forms intervals by counting the number of disruptions in continuity. Whenever the "next" row per user_id is not what's expected, a new interval starts.
The WINDOW clause allows to specify a window frame once and use it repeatedly; no effect on performance.
last_date + 1 works while last_date is type date. See:
Is there a way to do date arithmetic on values of type DATE without result being of type TIMESTAMP?
Related:
Get start and end date time based on based on sequence of rows
Select longest continuous sequence
About the aggregate FILTER:
Aggregate columns with additional (distinct) filters

Rolling Daily Distinct Counts Partition by Month

I have the following table:
CREATE TABLE tbl (
id int NOT NULL
, date date NOT NULL
, cid int NOT NULL
);
INSERT INTO tbl VALUES
(1 , '2022-01-01', 1)
, (2 , '2022-01-01', 1)
, (3 , '2022-01-01', 2)
, (4 , '2022-01-01', 3)
, (5 , '2022-01-02', 1)
, (6 , '2022-01-02', 4)
, (7 , '2022-01-03', 5)
, (8 , '2022-01-03', 6)
, (9 , '2022-02-01', 1)
, (10, '2022-02-01', 5)
, (11, '2022-02-02', 5)
, (12, '2022-02-02', 3)
;
I'm trying to count distinct users (= cid) each day, but the result is rolling during the month. E.g., for 2022-01-01, only distinct users with date = 2022-01-01 are counted. For 2022-01-02, distinct users with date between 2022-01-01 and 2022-01-02 are counted, and so on. The count should restart each month.
My desired output:
date distinct_cids
2022-01-01 3
2022-01-02 4
2022-01-03 6
2022-02-01 2
2022-02-02 3
I don't have access to snowflake so I can't guarantee that this will work, but from the sound of it:
select date, count(distinct cid) over (partition by month(date) order by date)
from tbl
order by date;
If you have several years worth of data, you can partition by year, month:
select date, count(distinct cid) over (partition by year(date), month(date) order by date)
from tbl
order by date;
Date is a reserved word, so you may consider renaming your column
EDIT: Since distinct is disallowed you can try a vanilla SQL variant. It is likely slow for a large table:
select dt, count(cid)
from (
select distinct dt.dt, x.cid
from tbl x
join (
select distinct date as dt from tbl
) dt (dt)
on x.date <= dt.dt
and month(x.date) = month(dt.dt)
) t
group by dt
order by dt
;
The idea is that we create a new relation (t) with distinct users with a date less than or equal to the current date in the current month. Then we can just count those users for each date.

How to date add if end date is 1 day prior to the next rows start date

I'm trying to total the days of both rows only if the end date of the first row is the day prior to the start date of the next row. If the end date of the first row is not one day prior to the second rows start date then I would like to exclude both of those rows from the query. So with the example below I should come to a sum of 365. My live table has thousands of rows with different names and orderids and I need to perform this task while keeping the integrity of the orderID for each individual.
name
orderID
Start date
end date
Joe Smith
1
2020-01-01
2020-09-30
Joe Smith
2
2020-10-01
2020-12-30
If you want groups of more than one row that meet your condition, then this is a type of gaps-and-islands problem.
What you want to do is assign an "island" number to the rows. You can do so by peaking at the previous row to see if it meets your condition. If it does not, then an island starts. A cumulative sum of the island starts assigns an island number to the groups.
The rest is aggregation
select name, min(startdate), max(enddate),
datediff(day, min(startdate), max(enddate)) as num_days
from (select t.*,
sum(case when prev_end_date <> dateadd(day, -1, startdate) then 0 else 1 end) over
(partition by name order by startdate) as island
from (select t.*,
lag(end_date) over (partition by name order by start_date) as prev_end_date
from t
) t
) t
group by name, island
having count(*) > 1;
Based this on #Gordon-Linoff's answer and his clue about gaps and islands, but I was getting incorrect adding in my test data as mentioned in a comment as well. I used this post as well
https://bertwagner.com/posts/gaps-and-islands/
-- test data
DECLARE #t TABLE (name varchar(50), orderID int, StartDate DateTime, EndDate DateTime);
INSERT INTO #t
SELECT 'Joe Smith', 1, '2020-01-01', '2020-09-30' UNION
SELECT 'Joe Smith', 2, '2020-10-01', '2020-12-30' UNION
SELECT 'Joe Smith', 3, '2021-01-01', '2021-09-30' UNION
SELECT 'Joe Smith', 4, '2021-10-01', '2021-12-31' UNION
SELECT 'Joe Smith', 5, '2022-01-01', '2022-09-30' UNION
SELECT 'Jane Doe', 6, '2020-01-01', '2020-09-30' UNION
SELECT 'Jane Doe', 7, '2020-11-01', '2020-12-30';
-- caculate the difference add 1 because EndDate is inclusive (ends on the start of next day)
SELECT t.*, d.IslandStartDate, d.IslandEndDate, DATEDIFF(DAY, IslandStartDate, IslandEndDate) + 1 AS Days FROM (
-- return the minimum and maximum start and end dates
SELECT
name,
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM (
SELECT
*,
-- indicates when a new island begins by looking if the current row's StartDate occurs after the previous row's EndDate
CASE WHEN Groups.PreviousEndDate >= DATEADD(DAY, -1, StartDate) THEN 0 ELSE 1 END AS IslandStartInd,
-- indicates which island number the current row belongs to
SUM(CASE WHEN Groups.PreviousEndDate >= DATEADD(DAY, -1, StartDate) THEN 0 ELSE 1 END) OVER (PARTITION BY name ORDER BY Groups.RN) AS IslandId
FROM
(
-- create a row number column based on the sequence of start and end dates, as well as bring the previous row's EndDate to the current row
SELECT
name,
orderID,
ROW_NUMBER() OVER(PARTITION BY name ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (PARTITION BY name ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
#t
) Groups
) Islands
GROUP BY
name, IslandId
) d
-- join to get the orderID back
INNER JOIN #t t ON d.name = t.name AND t.StartDate >= d.IslandStartDate AND t.EndDate <= d.IslandEndDate
ORDER BY IslandStartDate, name

Calculate inactive customers from single table

I have table with fields Customer.No. , Posting date, Order_ID . I want to find total inactive customers for last 12 months on month basis which means they have placed order before 12 months back and became in active. So want calculate this every month basis to under stand how inactive customers are growing month by month.
if I run the query in July it should go back 365 days from the previous month end and give total number of inactive customers. I want to do this month by month.
I am in learning stage please help.
Thanks for your time in advance.
to get the customers
SELECT DISTINCT a.CustomerNo
FROM YourTable a
WHERE NOT EXISTS
(SELECT 0 FROM YourTable b WHere a.CustomerNo = b.CustomerNo
and b.PostingDate >
dateadd(day,-365 -datepart(day,getdate()),getdate())
)
to get a count
SELECT DISTINCT count(0) as InnactiveCount
FROM YourTable a
WHERE NOT EXISTS
(SELECT 0 FROM YourTable b WHere a.CustomerNo = b.CustomerNo
and b.PostingDate >
dateadd(day,-365 -datepart(day,getdate()),getdate())
..
generate a 'months' table by CTE, then look for inactive in those months
;WITH month_gen as (SELECT dateadd(day,-0 -datepart(day,getdate()),getdate()) eom, 1 as x
UNION ALL
SELECT dateadd(day,-datepart(day,eom),eom) eom, x + 1 x FROM month_gen where x < 12
)
SELECT DISTINCT CONVERT(varchar(7), month_gen.eom, 102), count(0) innactiveCount FROM YourTable a
cross join month_gen
WHERE NOT EXISTS(SELECT 0 FROM YourTable b WHere a.CustomerNo = b.CustomerNo and
YEAR(b.PostingDate) = YEAR(eom) and
MONTH(b.PostingDate) = MONTH(eom)
)
GROUP BY CONVERT(varchar(7), month_gen.eom, 102)
if that gets you anywhere, maybe a final step is to filter out anything getting 'counted' before it was ever active i.e. don't count 'new' customers before they became active
Try below query. To achieve your goal you need calendar table (which I defined with CTE). Below query counts inactivity for the first day of a month:
declare #tbl table (custNumber int, postDate date, orderId int);
insert into #tbl values
(1, '2017-01-01', 123),
(2, '2017-02-01', 124),
(3, '2017-02-01', 125),
(1, '2018-02-02', 126),
(2, '2018-05-01', 127),
(3, '2018-06-01', 128)
;with cte as (
select cast('2018-01-01' as date) dt
union all
select dateadd(month, 1, dt) from cte
where dt < '2018-12-01'
)
select dt, sum(case when t2.custNumber is null then 1 else 0 end)
from cte c
left join #tbl t1 on dateadd(year, -1, c.dt) >= t1.postDate
left join #tbl t2 on t2.postDate > dateadd(year, -1, c.dt) and t2.postDate <= c.dt and t1.custNumber = t2.custNumber
group by dt

SQL server query to find values grouped by one column but different in at least one of other columns

Please pardon the title of my question -
I have a table
TRXN (ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN)
I want to write a query to pull records which have same LRN but atleast one of the other column has different value. Is it possible?
In my answer I consider you have unique value for ID and exclude it.
Table created:
CREATE TABLE #TRXN (ID INT IDENTITY(1, 1)
,ACCT_NUM INT
,TRAN_MEMO INT
,AMOUNT INT
,[DATE] DATE
,LRN INT
)
Sample data inserted
INSERT INTO #TRXN VALUES (1, 2, 2, '1 jan 2000', 2)
,(2, 2, 2, '2 jan 2000', 2)
,(1, 2, 2, '1 jan 2000', 2)
,(1, 2, 2, '1 jan 2000', 3)
Have same LRN but at least one of the other column has different value
;WITH C AS(
SELECT ROW_NUMBER() OVER (PARTITION BY ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN ORDER BY ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN) AS Rn
,ID, ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN
FROM #TRXN WHERE LRN IN(
SELECT LRN FROM #TRXN GROUP BY LRN HAVING COUNT(ID) > 1)
)
SELECT ID, ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN
FROM C WHERE Rn = 1
Output:
ID ACCT_NUM TRAN_MEMO AMOUNT DATE LRN
---------------------------------------------
1 1 2 2 2000-01-01 2
2 2 2 2 2000-01-02 2
why simply, use group by:
SELECT COUNT(1) AS numberOfGroupedRows,ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN
FROM TRNX GROUP BY ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN
since group by it will group all similar rows in one row