How to get a date interval with condition - sql

How to get a continuous date interval from rows fulfilling specific condition?
I have a table of employees states with 2 types of user_position.
The interval is continuous if the next higher date_position per user_id has the same user_id, the next day value and user_position didn't change. The user cannot have different user positions in one day.
Have a feeling it requires several cases, window functions and tsrange, but can't quite get the right result.
I would be really grateful if you could help me.
Fiddle:
http://sqlfiddle.com/#!17/ba641/1/0
The result should look like this:
user_id
user_position
position_start
position_end
1
1
01.01.2019
02.01.2019
1
2
03.01.2019
04.01.2019
1
1
05.01.2019
06.01.2019
2
1
01.01.2019
03.01.2019
2
2
04.01.2019
05.01.2019
2
2
08.01.2019
08.01.2019
2
2
10.01.2019
10.01.2019
Create/insert query for the source data:
CREATE TABLE IF NOT EXISTS users_position
( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id integer,
user_position integer,
date_position date);
INSERT INTO users_position (user_id,
user_position,
date_position)
VALUES
(1, 1, '2019-01-01'),
(1, 1, '2019-01-02'),
(1, 2, '2019-01-03'),
(1, 2, '2019-01-04'),
(1, 1, '2019-01-05'),
(1, 1, '2019-01-06'),
(2, 1, '2019-01-01'),
(2, 1, '2019-01-02'),
(2, 1, '2019-01-03'),
(2, 2, '2019-01-04'),
(2, 2, '2019-01-05'),
(2, 2, '2019-01-08'),
(2, 2, '2019-01-10');

SELECT user_id, user_position
, min(date_position) AS position_start
, max(date_position) AS position_end
FROM (
SELECT user_id, user_position,date_position
, count(*) FILTER (WHERE (date_position = last_date + 1
AND user_position = last_pos) IS NOT TRUE)
OVER (PARTITION BY user_id ORDER BY date_position) AS interval
FROM (
SELECT user_id, user_position, date_position
, lag(date_position) OVER w AS last_date
, lag(user_position) OVER w AS last_pos
FROM users_position
WINDOW w AS (PARTITION BY user_id ORDER BY date_position)
) sub1
) sub2
GROUP BY user_id, user_position, interval
ORDER BY user_id, interval;
db<>fiddle here
Basically, this forms intervals by counting the number of disruptions in continuity. Whenever the "next" row per user_id is not what's expected, a new interval starts.
The WINDOW clause allows to specify a window frame once and use it repeatedly; no effect on performance.
last_date + 1 works while last_date is type date. See:
Is there a way to do date arithmetic on values of type DATE without result being of type TIMESTAMP?
Related:
Get start and end date time based on based on sequence of rows
Select longest continuous sequence
About the aggregate FILTER:
Aggregate columns with additional (distinct) filters

Related

Sum and Running Sum, Distinct and Running Distinct

I want to calculate sum, running sum, distinct, running distinct - preferably all in one query.
http://sqlfiddle.com/#!18/65eff/1
create table test (store int, day varchar(10), food varchar(10), quantity int)
insert into test select 101, '2021-01-01', 'rice', 1
insert into test select 101, '2021-01-01', 'rice', 1
insert into test select 101, '2021-01-01', 'rice', 2
insert into test select 101, '2021-01-01', 'fruit', 2
insert into test select 101, '2021-01-01', 'water', 3
insert into test select 101, '2021-01-01', 'fruit', 1
insert into test select 101, '2021-01-01', 'salt', 2
insert into test select 101, '2021-01-02', 'rice', 1
insert into test select 101, '2021-01-02', 'rice', 2
insert into test select 101, '2021-01-02', 'fruit', 1
insert into test select 101, '2021-01-02', 'pepper', 4
Uniques (distinct) & Total (sum) are simple:
select store, day, count(distinct food) as uniques, sum(quantity) as total
from test
group by store, day
But I want output to be :
store
day
uniques
run_uniques
total
run_total
101
2021-01-01
4
4
12
12
101
2021-01-02
3
5
10
22
I tried a self-join with t.day >= prev.day to get cumulative/running data, but it's causing double-counting.
First off: always store data in the correct data type, day should be a date column.
Calculating a running sum of sum(quantity) aggregate is quite simple, you just nest it inside a window function: SUM(SUM(...)) OVER (...).
Calculating the running number of unique food per store is more complicated because you want the rolling number of unique items before grouping, and there is no COUNT(DISTINCT window function in SQL Server (which is what I'm using).
So I've gone with calculating a row_number() for each store and food across all days, then we just sum up the number of times we get 1 i.e. this is the first time we've seen this food.
SELECT
t.store,
t.day,
uniques = COUNT(DISTINCT t.food),
run_uniques = SUM(SUM(CASE WHEN t.rn = 1 THEN 1 ELSE 0 END))
OVER (PARTITION BY t.store ORDER BY t.day ROWS UNBOUNDED PRECEDING),
total = SUM(t.quantity),
run_total = SUM(SUM(t.quantity))
OVER (PARTITION BY t.store ORDER BY t.day ROWS UNBOUNDED PRECEDING)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY store, food ORDER BY day) rn
FROM test
) t
GROUP BY t.store, t.day;

Retrieve IDs with a minimum time gap between consecutive rows

I have the following event table in Postgres 9.3:
CREATE TABLE event (
event_id integer PRIMARY KEY,
user_id integer,
event_type varchar,
event_time timestamptz
);
My goal is to retrieve all user_id's with a gap of at least 30 days between any of their events (or between their last event and the current time). An additional complication is that I only want the users who have one of these gaps occur at a later time than them performing a certain event_type 'convert'. How can this be done easily?
Some example data in the event table might look like:
INSERT INTO event (event_id, user_id, event_type, event_time)
VALUES
(10, 1, 'signIn', '2015-05-05 00:11'),
(11, 1, 'browse', '2015-05-05 00:12'), -- no 'convert' event
(20, 2, 'signIn', '2015-06-07 02:35'),
(21, 2, 'browse', '2015-06-07 02:35'),
(22, 2, 'convert', '2015-06-07 02:36'), -- only 'convert' event
(23, 2, 'signIn', '2015-08-10 11:00'), -- gap of >= 30 days
(24, 2, 'signIn', '2015-08-11 11:00'),
(30, 3, 'convert', '2015-08-07 02:36'), -- starting with 1st 'convert' event
(31, 3, 'signIn', '2015-08-07 02:36'),
(32, 3, 'convert', '2015-08-08 02:36'),
(33, 3, 'signIn', '2015-08-12 11:00'), -- all gaps below 30 days
(33, 3, 'browse', '2015-08-12 11:00'), -- gap until today (2015-08-20) too small
(40, 4, 'convert', '2015-05-07 02:36'),
(41, 4, 'signIn', '2015-05-12 11:00'); -- gap until today (2015-08-20) >= 30 days
Expected result:
user_id
--------
2
4
One way to do it:
SELECT user_id
FROM (
SELECT user_id
, lead(e.event_time, 1, now()) OVER (PARTITION BY e.user_id ORDER BY e.event_time)
- event_time AS gap
FROM ( -- only users with 'convert' event
SELECT user_id, min(event_time) AS first_time
FROM event
WHERE event_type = 'convert'
GROUP BY 1
) e1
JOIN event e USING (user_id)
WHERE e.event_time >= e1.first_time
) sub
WHERE gap >= interval '30 days'
GROUP BY 1;
The window function lead() allows to include a default value if there is no "next row", which is convenient to cover your additional requirement "or between their last event and the current time".
Indexes
You should at least have an index on (user_id, event_time) if your table is big:
CREATE INDEX event_user_time_idx ON event(user_id, event_time);
If you do that often and the event_type 'convert' is rare, add another partial index:
CREATE INDEX event_user_time_convert_idx ON event(user_id, event_time)
WHERE event_type = 'convert';
For many events per user
And only if gaps of 30 days are common (not a rare case).
Indexes become even more important.
Try this recursive CTE for better performance:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT DISTINCT ON (user_id)
user_id, event_time, interval '0 days' AS gap
FROM event
WHERE event_type = 'convert'
ORDER BY user_id, event_time
)
UNION ALL
SELECT c.user_id, e.event_time, COALESCE(e.event_time, now()) - c.event_time
FROM cte c
LEFT JOIN LATERAL (
SELECT e.event_time
FROM event e
WHERE e.user_id = c.user_id
AND e.event_time > c.event_time
ORDER BY e.event_time
LIMIT 1 -- the next later event
) e ON true -- add 1 row after last to consider gap till "now"
WHERE c.event_time IS NOT NULL
AND c.gap < interval '30 days'
)
SELECT * FROM cte
WHERE gap >= interval '30 days';
It has considerably more overhead, but can stop - per user - at the first gap that's big enough. If that should be the gap between the last event now, then event_time in the result is NULL.
New SQL Fiddle with more revealing test data demonstrating both queries.
Detailed explanation in these related answers:
Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?
SQL Fiddle
This is another way, probably not as neat as #Erwin but have all the step separated so is easy to adapt.
include_today: add a dummy event to indicate current date.
event_convert: calculate the first time the event convert appear for each user_id (in this case only user_id = 2222)
event_row: asign an unique consecutive id to each event. starting from 1 for each user_id
last part join all together and using rnum = rnum + 1 so could calculate date difference.
also the result show both event involve in the 30 days range so you can see if that is the result you want.
.
WITH include_today as (
(SELECT 'xxxx' event_id, user_id, 'today' event_type, current_date as event_time
FROM users)
UNION
(SELECT *
FROM event)
),
event_convert as (
SELECT user_id, MIN(event_time) min_time
FROM event
WHERE event_type = 'convert'
GROUP BY user_id
),
event_row as (
SELECT *, row_number() OVER (PARTITION BY user_id ORDER BY event_time desc) as rnum
FROM
include_today
)
SELECT
A.user_id,
A.event_id eventA,
A.event_type typeA,
A.event_time timeA,
B.event_id eventB,
B.event_type typeB,
B.event_time timeB,
(B.event_time - A.event_time) days
FROM
event_convert e
Inner Join event_row A
ON e.user_id = A.user_id and e.min_time <= a. event_time
Inner Join event_row B
ON A.rnum = B.rnum + 1
AND A.user_id = B.user_id
WHERE
(B.event_time - A.event_time) > interval '30 days'
ORDER BY 1,4

SQL server query to find values grouped by one column but different in at least one of other columns

Please pardon the title of my question -
I have a table
TRXN (ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN)
I want to write a query to pull records which have same LRN but atleast one of the other column has different value. Is it possible?
In my answer I consider you have unique value for ID and exclude it.
Table created:
CREATE TABLE #TRXN (ID INT IDENTITY(1, 1)
,ACCT_NUM INT
,TRAN_MEMO INT
,AMOUNT INT
,[DATE] DATE
,LRN INT
)
Sample data inserted
INSERT INTO #TRXN VALUES (1, 2, 2, '1 jan 2000', 2)
,(2, 2, 2, '2 jan 2000', 2)
,(1, 2, 2, '1 jan 2000', 2)
,(1, 2, 2, '1 jan 2000', 3)
Have same LRN but at least one of the other column has different value
;WITH C AS(
SELECT ROW_NUMBER() OVER (PARTITION BY ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN ORDER BY ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN) AS Rn
,ID, ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN
FROM #TRXN WHERE LRN IN(
SELECT LRN FROM #TRXN GROUP BY LRN HAVING COUNT(ID) > 1)
)
SELECT ID, ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN
FROM C WHERE Rn = 1
Output:
ID ACCT_NUM TRAN_MEMO AMOUNT DATE LRN
---------------------------------------------
1 1 2 2 2000-01-01 2
2 2 2 2 2000-01-02 2
why simply, use group by:
SELECT COUNT(1) AS numberOfGroupedRows,ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN
FROM TRNX GROUP BY ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN
since group by it will group all similar rows in one row

Get average of last 7 days

I'm attacking a problem, where I have a value for a a range of dates. I would like to consolidate the rows in my table by averaging them and reassigning the date column to be relative to the last 7 days. My SQL experience is lacking and could use some help. Thanks for giving this a look!!
E.g.
7 rows with dates and values.
UniqueId Date Value
........ .... .....
a 2014-03-20 2
a 2014-03-21 2
a 2014-03-22 3
a 2014-03-23 5
a 2014-03-24 1
a 2014-03-25 0
a 2014-03-26 1
Resulting row
UniqueId Date AvgValue
........ .... ........
a 2014-03-26 2
First off I am not even sure this is possible. I'm am trying to attack a problem with this data at hand. I thought maybe using a framing window with a partition to roll the dates into one date with the averaged result, but am not exactly sure how to say that in SQL.
Am taking following as sample
CREATE TABLE some_data1 (unique_id text, date date, value integer);
INSERT INTO some_data1 (unique_id, date, value) VALUES
( 'a', '2014-03-20', 2),
( 'a', '2014-03-21', 2),
( 'a', '2014-03-22', 3),
( 'a', '2014-03-23', 5),
( 'a', '2014-03-24', 1),
( 'a', '2014-03-25', 0),
( 'a', '2014-03-26', 1),
( 'b', '2014-03-01', 1),
( 'b', '2014-03-02', 1),
( 'b', '2014-03-03', 1),
( 'b', '2014-03-04', 1),
( 'b', '2014-03-05', 1),
( 'b', '2014-03-06', 1),
( 'b', '2014-03-07', 1)
OPTION A : - Using PostgreSQL Specific Function WITH
with cte as (
select unique_id
,max(date) date
from some_data1
group by unique_id
)
select max(sd.unique_id),max(sd.date),avg(sd.value)
from some_data1 sd inner join cte using(unique_id)
where sd.date <=cte.date
group by cte.unique_id
limit 7
> SQLFIDDLE DEMO
OPTION B : - To work in PostgreSQL and MySQL
select max(sd.unique_id)
,max(sd.date)
,avg(sd.value)
from (
select unique_id
,max(date) date
from some_data1
group by unique_id
) cte inner join some_data1 sd using(unique_id)
where sd.date <=cte.date
group by cte.unique_id
limit 7
> SQLFDDLE DEMO
Maybe something along the lines of SELECT AVG(Value) AS 'AvgValue' FROM tableName WHERE Date BETWEEN dateStart AND dateEnd That will get you the average between those dates and you have dateEnd already so you could use that result to create the row you're looking for.
For PostgreSQL a window function might be what you want:
DROP TABLE IF EXISTS some_data;
CREATE TABLE some_data (unique_id text, date date, value integer);
INSERT INTO some_data (unique_id, date, value) VALUES
( 'a', '2014-03-20', 2),
( 'a', '2014-03-21', 2),
( 'a', '2014-03-22', 3),
( 'a', '2014-03-23', 5),
( 'a', '2014-03-24', 1),
( 'a', '2014-03-25', 0),
( 'a', '2014-03-26', 1),
( 'a', '2014-03-27', 3);
WITH avgs AS (
SELECT unique_id, date,
avg(value) OVER w AS week_avg,
count(value) OVER w AS num_days
FROM some_data
WINDOW w AS (
PARTITION BY unique_id
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW))
SELECT unique_id, date, week_avg
FROM avgs
WHERE num_days=7
Result:
unique_id | date | week_avg
-----------+------------+--------------------
a | 2014-03-26 | 2.0000000000000000
a | 2014-03-27 | 2.1428571428571429
Questions include:
What happens if a day from the preceding six days is missing? Do we want to add it and count it as zero?
What happens if you add a day? Is the result of the code above what you want (a rolling 7-day average)?
For SQL Server, you can follow the below approach. Try this
1. For weekly value's average
SET DATEFIRST 4
;WITH CTE AS
(
SELECT *,
DATEPART(WEEK,[DATE])WK,
--Find last day in that week
ROW_NUMBER() OVER(PARTITION BY UNIQUEID,DATEPART(WEEK,[DATE]) ORDER BY [DATE] DESC) RNO,
-- Find average value of that week
AVG(VALUE) OVER(PARTITION BY UNIQUEID,DATEPART(WEEK,[DATE])) AVGVALUE
FROM DATETAB
)
SELECT UNIQUEID,[DATE],AVGVALUE
FROM CTE
WHERE RNO=1
Click here to view result
2. For last 7 days value's average
DECLARE #DATE DATE = '2014-03-26'
;WITH CTE AS
(
SELECT UNIQUEID,[DATE],VALUE,#DATE CURRENTDATE
FROM DATETAB
WHERE [DATE] BETWEEN DATEADD(DAY,-7,#DATE) AND #DATE
)
SELECT UNIQUEID,CURRENTDATE [DATE],AVG(VALUE) AVGVALUE
FROM CTE
GROUP BY UNIQUEID,CURRENTDATE
Click here to view result

SQL return consecutive records

A simple table:
ForumPost
--------------
ID (int PK)
UserID (int FK)
Date (datetime)
What I'm looking to return how many times a particular user has made at least 1 post a day for n consecutive days.
Example:
User 15844 has posted at least 1 post a day for 30 consecutive days 10 times
I've tagged this question with linq/lambda as well as a solution there would also be great. I know I can solve this by iterating all the users records but this is slow.
There is a handy trick you can use using ROW_NUMBER() to find consecutive entries, imagine the following set of dates, with their row_number (starting at 0):
Date RowNumber
20130401 0
20130402 1
20130403 2
20130404 3
20130406 4
20130407 5
For consecutive entries if you subtract the row_number from the value you get the same result. e.g.
Date RowNumber date - row_number
20130401 0 20130401
20130402 1 20130401
20130403 2 20130401
20130404 3 20130401
20130406 4 20130402
20130407 5 20130402
You can then group by date - row_number to get the sets of consecutive days (i.e. the first 4 records, and the last 2 records).
To apply this to your example you would use:
WITH Posts AS
( SELECT FirstPost = DATEADD(DAY, 1 - ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY [Date]), [Date]),
UserID,
Date
FROM ( SELECT DISTINCT UserID, [Date] = CAST(Date AS [Date])
FROM ForumPost
) fp
), Posts2 AS
( SELECT FirstPost,
UserID,
Days = COUNT(*),
LastDate = MAX(Date)
FROM Posts
GROUP BY FirstPost, UserID
)
SELECT UserID, ConsecutiveDates = MAX(Days)
FROM Posts2
GROUP BY UserID;
Example on SQL Fiddle (simple with just most consecutive days per user)
Further example to show how to get all consecutive periods
EDIT
I don't think the above quite answered the question, this will give the number of times a user has posted on, or over n consecutive days:
WITH Posts AS
( SELECT FirstPost = DATEADD(DAY, 1 - ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY [Date]), [Date]),
UserID,
Date
FROM ( SELECT DISTINCT UserID, [Date] = CAST(Date AS [Date])
FROM ForumPost
) fp
), Posts2 AS
( SELECT FirstPost,
UserID,
Days = COUNT(*),
FirstDate = MIN(Date),
LastDate = MAX(Date)
FROM Posts
GROUP BY FirstPost, UserID
)
SELECT UserID, [Times Over N Days] = COUNT(*)
FROM Posts2
WHERE Days >= 30
GROUP BY UserID;
Example on SQL Fiddle
Your particular application makes this pretty simple, I think. If you have 'n' distinct dates in an 'n'-day interval, those 'n' distinct dates must be consecutive.
Scroll to the bottom for a general solution that requires only common table expressions and changing to PostgreSQL. (Kidding. I implemented in PostgreSQL, because I'm short of time.)
create table ForumPost (
ID integer primary key,
UserID integer not null,
post_date date not null
);
insert into forumpost values
(1, 1, '2013-01-15'),
(2, 1, '2013-01-16'),
(3, 1, '2013-01-17'),
(4, 1, '2013-01-18'),
(5, 1, '2013-01-19'),
(6, 1, '2013-01-20'),
(7, 1, '2013-01-21'),
(11, 2, '2013-01-15'),
(12, 2, '2013-01-16'),
(13, 2, '2013-01-17'),
(16, 2, '2013-01-17'),
(14, 2, '2013-01-18'),
(15, 2, '2013-01-19'),
(21, 3, '2013-01-17'),
(22, 3, '2013-01-17'),
(23, 3, '2013-01-17'),
(24, 3, '2013-01-17'),
(25, 3, '2013-01-17'),
(26, 3, '2013-01-17'),
(27, 3, '2013-01-17');
Now, let's look at the output of this query. For brevity, I'm looking at 5-day intervals, not 30-day intervals.
select userid, count(distinct post_date) distinct_dates
from forumpost
where post_date between '2013-01-15' and '2013-01-19'
group by userid;
USERID DISTINCT_DATES
1 5
2 5
3 1
For users that fit the criteria, the number of distinct dates in that 5-day interval will have to be 5, right? So we just need to add that logic to a HAVING clause.
select userid, count(distinct post_date) distinct_dates
from forumpost
where post_date between '2013-01-15' and '2013-01-19'
group by userid
having count(distinct post_date) = 5;
USERID DISTINCT_DATES
1 5
2 5
A more general solution
It doesn't really make sense to say that, if you post every day from 2013-01-01 to 2013-01-31, you've posted 30 consecutive days 2 times. Instead, I'd expect the clock to start over on 2013-01-31. My apologies for implementing in PostgreSQL; I'll try to implement in T-SQL later.
with first_posts as (
select userid, min(post_date) first_post_date
from forumpost
group by userid
),
period_intervals as (
select userid, first_post_date period_start,
(first_post_date + interval '4' day)::date period_end
from first_posts
), user_specific_intervals as (
select
userid,
(period_start + (n || ' days')::interval)::date as period_start,
(period_end + (n || ' days')::interval)::date as period_end
from period_intervals, generate_series(0, 30, 5) n
)
select userid, period_start, period_end,
(select count(distinct post_date)
from forumpost
where forumpost.post_date between period_start and period_end
and userid = forumpost.userid) distinct_dates
from user_specific_intervals
order by userid, period_start;