Project data and cumulative sum forward

Project data and cumulative sum forward - sql

I am trying to push the last value of a cumulative dataset forward to present time.
Initialise test data:
drop table if exists test_table;
create table test_table
as select data_date::date, floor(random() * 10) as data_value
from
generate_series('2021-08-25'::date, '2021-08-31'::date, '1 day') data_date;
The above test data produces something like this:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
What I wish to do, is push the last data value (2021-08-31 7) forward to present time. For example, say today's date was 2021-09-03, I would want the result to be something like:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
2021-09-01 7 41
2021-09-02 7 48
2021-09-03 7 55

You need to get the value of the last date in the table. Common table expression is a good way to do that:
with cte as (
select data_value as last_val
from test_table
order by data_date desc
limit 1)
select
gen_date::date as data_date,
coalesce(data_value, last_val) as data_value,
sum(coalesce(data_value, last_val)) over (order by gen_date) as cumulative_sum
from generate_series('2021-08-25'::date, '2021-09-03', '1 day') as gen_date
left join test_table on gen_date = data_date
cross join cte
Test it in db<>fiddle.

You may use union and a scalar subquery to find the latest value of data_value for for the new rows. cumulative_value is re-evaluated.
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION all
select rd, (select data_value from test_table where data_date = '2021-08-31')
from generate_series('2021-09-01'::date, '2021-09-03', '1 day') rd
) t
order by data_date;
And here it is a bit smarter w/o fixed date literals.
with cte(latest_date) as (select max(data_date) from test_table)
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION ALL
select rd::date, (select data_value from test_table, cte where data_date = latest_date)
from generate_series((select latest_date from cte) + 1, CURRENT_DATE, '1 day') rd
) t
order by data_date;
SQL Fiddle here.

Related

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
I'm using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.

This is tricky to do as window functions -- because count(distinct) is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
However, that can be expensive. One solution is to "unpivot" the data. That is, do an incremental count per user of going "in" and "out" of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;

How to get values from the previous row?

I have a table like this:
ID
NUMBER
TIMESTAMP
1
1
05/28/2020 09:00:00
2
2
05/29/2020 10:00:00
3
1
05/31/2020 21:00:00
4
1
06/01/2020 21:00:00
And I want to show data like this:
ID
NUMBER
TIMESTAMP
RANGE
1
1
05/28/2020 09:00:00
0 Days
2
2
05/29/2020 10:00:00
0 Days
3
1
05/31/2020 21:00:00
3,5 Days
4
1
06/01/2020 21:00:00
1 Days
So it takes 3,5 Days to process the number 1 process.
I tried:
select a.id, a.number, a.timestamp, ((a.timestamp-b.timestamp)/24) as days
from my_table a
left join (select number,timestamp from my_table) b
on a.number=b.number
Didn't work as expected. How to do this properly?

Use the window function lag().
With standard interval output:
SELECT *, timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)
FROM tbl
ORDER BY id;
If you need decimal number like in your example:
SELECT *, round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2) || ' days'
FROM tbl
ORDER BY id;
If you also need to display '0 days' instead of NULL like in your example:
SELECT *, COALESCE(round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2), 0) || ' days'
FROM tbl
ORDER BY id;
db<>fiddle here

HIVE - compute statistics over partitions with window based on date

I've seen solutions for problems similar to mine, but none quite work for me. Also I'm confident that there should be a way to make it work.
Given a table with
ID
Date
target
1
2020-01-01
1
1
2020-01-02
1
1
2020-01-03
0
1
2020-01-04
1
1
2020-01-04
0
1
2020-06-01
1
1
2020-06-02
1
1
2020-06-03
0
1
2020-06-04
1
1
2020-06-04
0
2
2020-01-01
1
ID is BIGINT, target is Int and Date is DATE
I want to compute, for each ID/Date, the sum and the number of rows for the same ID in the 3 months and 12 months before the Date (inclusive). Example of output:
ID
Date
Sum_3
Count_3
Sum_12
Count_12
1
2020-01-01
1
1
1
1
1
2020-01-02
2
2
2
2
1
2020-01-03
2
3
2
3
1
2020-01-04
3
5
3
5
1
2020-06-01
1
1
4
6
1
2020-06-02
2
2
5
7
1
2020-06-03
2
3
6
8
1
2020-06-04
3
5
7
10
2
2020-01-01
1
1
1
1
How can I get this time of results in HIVE?
I'm not sure if I should use analytical functions (and how), group by, etc...?

If you can live with an approximation of months as a number of days, then you can use window functions in Hive:
select id, date,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
count(*) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(target) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
You can aggregate in the same query:
select id, date,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding -- 90 days
) as count_3,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 90 preceding
) as sum_3,
sum(count(*)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding -- 360 days
) as count_12,
sum(sum(target)) over(
partition by id
order by unix_timestamp(date)
range 60 * 60 * 24 * 360 preceding
) as sum_12
from mytable
group by id, date, unix_timestamp(date)

If you can do an estimation of interval (1 month = 30 days): (an improvement of GMB's answer)
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select ID, Date,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) sum_3,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) count_3,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) sum_12,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) count_12
from t
Or if you want exact intervals, you can do self joins (but expensive):
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select
t_3month.ID,
t_3month.Date,
t_3month.sum_3,
t_3month.count_3,
sum(t3.target) sum_12,
sum(t3.c_target) count_12
from (
select
t1.ID,
t1.Date,
sum(t2.target) sum_3,
sum(t2.c_target) count_3
from t t1
left join t t2
on t2.Date > t1.Date - interval 3 month and
t2.Date <= t1.Date and
t1.ID = t2.ID
group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
t3.Date <= t_3month.Date and
t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;

Add Missing monthly dates in a timeseries data in Postgresql

I have monthly time series data in table where dates are as a last day of month. Some of the dates are missing in the data. I want to insert those dates and put zero value for other attributes.
Table is as follows:
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-04-30 34
2 2014-05-31 45
2 2014-08-31 47
I want to convert this table to
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-03-31 0
1 2015-04-30 34
2 2014-05-31 45
2 2014-06-30 0
2 2014-07-31 0
2 2014-08-31 47
Is there any way we can do this in Postgresql?
Currently we are doing this in Python. As our data is growing day by day and its not efficient to handle I/O just for one task.
Thank you

You can do this using generate_series() to generate the dates and then left join to bring in the values:
with m as (
select id, min(report_date) as minrd, max(report_date) as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) as report_date
from m
) m left join
t
on m.report_date = t.report_date;
EDIT:
Turns out that the above doesn't quite work, because adding months to the end of month doesn't keep the last day of the month.
This is easily fixed:
with t as (
select 1 as id, date '2012-01-31' as report_date, 10 as price union all
select 1 as id, date '2012-04-30', 20
), m as (
select id, min(report_date) - interval '1 day' as minrd, max(report_date) - interval '1 day' as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) + interval '1 day' as report_date
from m
) m left join
t
on m.report_date = t.report_date;
The first CTE is just to generate sample data.

This is a slight improvement over Gordon's query which fails to get the last date of a month in some cases.
Essentially you generate all the month end dates between the min and max date for each id (using generate_series) and left join on this generated table to show the missing dates with 0 price.
with minmax as (
select id, min(report_date) as mindt, max(report_date) as maxdt
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select *,
generate_series(date_trunc('MONTH',mindt+interval '1' day),
date_trunc('MONTH',maxdt+interval '1' day),
interval '1' month) - interval '1 day' as report_date
from minmax
) m
left join t on m.report_date = t.report_date
Sample Demo

SQL issue - calculate max days sequence

There is a table with visits data:
uid (INT) | created_at (DATETIME)
I want to find how many days in a row a user has visited our app. So for instance:
SELECT DISTINCT DATE(created_at) AS d FROM visits WHERE uid = 123
will return:
d
------------
2012-04-28
2012-04-29
2012-04-30
2012-05-03
2012-05-04
There are 5 records and two intervals - 3 days (28 - 30 Apr) and 2 days (3 - 4 May).
My question is how to find the maximum number of days that a user has visited the app in a row (3 days in the example). Tried to find a suitable function in the SQL docs, but with no success. Am I missing something?
UPD:
Thank you guys for your answers! Actually, I'm working with vertica analytics database (http://vertica.com/), however this is a very rare solution and only a few people have experience with it. Although it supports SQL-99 standard.
Well, most of solutions work with slight modifications. Finally I created my own version of query:
-- returns starts of the vitit series
SELECT t1.d as s FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
s
---------------------
2012-04-28 01:00:00
2012-05-03 01:00:00
-- returns end of the vitit series
SELECT t1.d as f FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
f
---------------------
2012-04-30 01:00:00
2012-05-04 01:00:00
So now only what we need to do is to join them somehow, for instance by row index.
SELECT s, f, DATEDIFF(day, s, f) + 1 as seq FROM (
SELECT t1.d as s, ROW_NUMBER() OVER () as o1 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', -1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl1 LEFT JOIN (
SELECT t1.d as f, ROW_NUMBER() OVER () as o2 FROM testing t1
LEFT JOIN testing t2 ON DATE(t2.d) = DATE(TIMESTAMPADD('day', 1, t1.d))
WHERE t2.d is null GROUP BY t1.d
) tbl2 ON o1 = o2
Sample output:
s | f | seq
---------------------+---------------------+-----
2012-04-28 01:00:00 | 2012-04-30 01:00:00 | 3
2012-05-03 01:00:00 | 2012-05-04 01:00:00 | 2

Another approach, the shortest, do a self-join:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select d, group_number, count(d) over m as consecutive_days
from grouped_result
window m as (partition by group_number)
Output:
d | group_number | consecutive_days
---------------------+--------------+------------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)
Live test: http://www.sqlfiddle.com/#!1/93789/1
sr = second row, fr = first row ( or perhaps previous row? ツ ). Basically we are doing a back tracking, it's a simulated lag on database that doesn't support LAG (Postgres supports LAG, but the solution is very long, as windowing doesn't support nested windowing). So in this query, we uses a hybrid approach, simulate LAG via join, then use SUM windowing against it, this produces group number
UPDATE
Forgot to put the final query, the query above illustrate the underpinnings of group numbering, need to morph that into this:
with grouped_result as
(
select
sr.d,
sum((fr.d is null)::int) over(order by sr.d) as group_number
from tbl sr
left join tbl fr on sr.d = fr.d + interval '1 day'
)
select min(d) as starting_date, max(d) as end_date, count(d) as consecutive_days
from grouped_result
group by group_number
-- order by consecutive_days desc limit 1
STARTING_DATE END_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
UPDATE
I know why my other solution that uses window function became long, it became long on my attempt to illustrate the logic of group numbering and counting over the group. If I'd cut to the chase like in my MySql approach, that windowing function could be shorter. Having said that, here's my old windowing function approach, albeit better now:
with headers as
(
select
d,lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over (order by d) as group_number
from headers
)
select min(d) as starting_date,max(d) as ending_date,count(d) as consecutive_days
from sequence_group
group by group_number
-- order by consecutive_days desc limit 1
Live test: http://www.sqlfiddle.com/#!1/93789/21

In MySQL you could do this:
SET #nextDate = CURRENT_DATE;
SET #RowNum = 1;
SELECT MAX(RowNumber) AS ConecutiveVisits
FROM ( SELECT #RowNum := IF(#NextDate = Created_At, #RowNum + 1, 1) AS RowNumber,
Created_At,
#NextDate := DATE_ADD(Created_At, INTERVAL 1 DAY) AS NextDate
FROM Visits
ORDER BY Created_At
) Visits
Example here:
http://sqlfiddle.com/#!2/6e035/8
However I am not 100% certain this is the best way to do it.
In Postgresql:
;WITH RECURSIVE VisitsCTE AS
( SELECT Created_At, 1 AS ConsecutiveDays
FROM Visits
UNION ALL
SELECT v.Created_At, ConsecutiveDays + 1
FROM Visits v
INNER JOIN VisitsCTE cte
ON 1 + cte.Created_At = v.Created_At
)
SELECT MAX(ConsecutiveDays) AS ConsecutiveDays
FROM VisitsCTE
Example here:
http://sqlfiddle.com/#!1/16c90/9

I know Postgresql has something similar to common table expressions as available in MSSQL. I'm not that familiar with Postgresql, but the code below works for MSSQL and does what you want.
create table #tempdates (
mydate date
)
insert into #tempdates(mydate) values('2012-04-28')
insert into #tempdates(mydate) values('2012-04-29')
insert into #tempdates(mydate) values('2012-04-30')
insert into #tempdates(mydate) values('2012-05-03')
insert into #tempdates(mydate) values('2012-05-04');
with maxdays (s, e, c)
as
(
select mydate, mydate, 1
from #tempdates
union all
select m.s, mydate, m.c + 1
from #tempdates t
inner join maxdays m on DATEADD(day, -1, t.mydate)=m.e
)
select MIN(o.s),o.e,max(o.c)
from (
select m1.s,max(m1.e) e,max(m1.c) c
from maxdays m1
group by m1.s
) o
group by o.e
drop table #tempdates
And here's the SQL fiddle: http://sqlfiddle.com/#!3/42b38/2

All are very good answers, but I think I should contribute by showing another approach utilizing an analytical capability specific to Vertica (after all it is part of what you paid for). And I promise the final query is short.
First, query using conditional_true_event(). From Vertica's documentation:
Assigns an event window number to each row, starting from 0, and
increments the number by 1 when the result of the boolean argument
expression evaluates true.
The example query looks like this:
select uid, created_at,
conditional_true_event( created_at - lag(created_at) > '1 day' )
over (partition by uid order by created_at) as seq_id
from visits;
And output:
uid created_at seq_id
--- ------------------- ------
123 2012-04-28 00:00:00 0
123 2012-04-29 00:00:00 0
123 2012-04-30 00:00:00 0
123 2012-05-03 00:00:00 1
123 2012-05-04 00:00:00 1
123 2012-06-04 00:00:00 2
123 2012-06-04 00:00:00 2
Now the final query becomes easy:
select uid, seq_id, count(1) num_days, min(created_at) s, max(created_at) f
from
(
select uid, created_at,
conditional_true_event( created_at - lag(created_at) > '1 day' )
over (partition by uid order by created_at) as seq_id
from visits
) as seq
group by uid, seq_id;
Final Output:
uid seq_id num_days s f
--- ------ -------- ------------------- -------------------
123 0 3 2012-04-28 00:00:00 2012-04-30 00:00:00
123 1 2 2012-05-03 00:00:00 2012-05-04 00:00:00
123 2 2 2012-06-04 00:00:00 2012-06-04 00:00:00
One final note:
num_days is actually number of rows of the inner query. If there are two '2012-04-28' visits in the original table (i.e. duplicates), you might want to work around that.

The following should be Oracle friendly, and not require recursive logic.
;WITH
visit_dates (
visit_id,
date_id,
group_id
)
AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY TRUNC(created_at)),
TRUNC(SYSDATE) - TRUNC(created_at),
TRUNC(SYSDATE) - TRUNC(created_at) - ROW_NUMBER() OVER (ORDER BY TRUNC(created_at))
FROM
visits
GROUP BY
TRUNC(created_at)
)
,
group_duration (
group_id,
duration
)
AS
(
SELECT
group_id,
MAX(date_id) - MIN(date_id) + 1 AS duration
FROM
visit_dates
GROUP BY
group_id
)
SELECT
MAX(duration) AS max_duration
FROM
group_duration

Postgresql:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group
window m as (partition by group_number)
)
select * from consecutive_list
Divide-and-conquer approach: 3 steps
1st step, find headers:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
select * from headers
Output:
d | header
---------------------+--------
2012-04-28 08:00:00 | t
2012-04-29 08:00:00 | f
2012-04-30 08:00:00 | f
2012-05-03 08:00:00 | t
2012-05-04 08:00:00 | f
(5 rows)
2nd step, designate grouping:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
select * from sequence_group
Output:
d | group_number
---------------------+--------------
2012-04-28 08:00:00 | 1
2012-04-29 08:00:00 | 1
2012-04-30 08:00:00 | 1
2012-05-03 08:00:00 | 2
2012-05-04 08:00:00 | 2
(5 rows)
3rd step, count max days:
with headers as
(
select
d,
lag(d) over m is null or d - lag(d) over m <> interval '1 day' as header
from tbl
window m as (order by d)
)
,sequence_group as
(
select d, sum(header::int) over m as group_number
from headers
window m as (order by d)
)
,consecutive_list as
(
select d, group_number, count(d) over m as consecutive_count
from sequence_group
window m as (partition by group_number)
)
select * from consecutive_list
Output:
d | group_number | consecutive_count
---------------------+--------------+-----------------
2012-04-28 08:00:00 | 1 | 3
2012-04-29 08:00:00 | 1 | 3
2012-04-30 08:00:00 | 1 | 3
2012-05-03 08:00:00 | 2 | 2
2012-05-04 08:00:00 | 2 | 2
(5 rows)

This is for MySQL, the shortest, and uses minimal variable (one variable only):
select
min(d) as starting_date, max(d) as ending_date,
count(d) as consecutive_days
from
(
select
sr.d,
IF(fr.d is null,#group_number := #group_number + 1,#group_number)
as group_number
from tbl sr
left join tbl fr on sr.d = adddate(fr.d,interval 1 day)
cross join (select #group_number := 0) as grp
) as x
group by group_number
Output:
STARTING_DATE ENDING_DATE CONSECUTIVE_DAYS
April, 28 2012 08:00:00-0700 April, 30 2012 08:00:00-0700 3
May, 03 2012 08:00:00-0700 May, 04 2012 08:00:00-0700 2
Live test: http://www.sqlfiddle.com/#!2/65169/1

For PostgreSQL 8.4 or later, there is a short and clean way with window functions and no JOIN.
I'd expect this to be the fastest solution posted so far:
WITH x AS (
SELECT created_at AS d
, lag(created_at) OVER (ORDER BY created_at) = (created_at - 1) AS nu
FROM visits
WHERE uid = 1
)
, y AS (
SELECT d, count(NULLIF(nu, TRUE)) OVER (ORDER BY d) AS seq
FROM x
)
SELECT count(*) AS max_days, min(d) AS seq_from, max(d) AS seq_to
FROM y
GROUP BY seq
ORDER BY 1 DESC
LIMIT 1;
Returns:
max_days | seq_from | seq_to
---------+------------+-----------
3 | 2012-04-28 | 2012-04-30
Assuming that created_at is a date and unique.
In CTE x: for every day our user visits, check if he was here yesterday, too.
To calculate "yesterday" just use created_at - 1 The first row is a special case and will produce NULL here.
In CTE y: calculate a running count of "days without yesterday so far" (seq) for every day. NULL values don't count, so count(NULLIF(nu, TRUE)) is the fastes and shortest way, also covering the special case.
Finally, group days per seq and count the days. While being at it I added first and last day of the sequence.
ORDER BY length of the sequence, and pick the longest one.

Upon seeing OP's query approach for their Vertica database, I tried making the two joins run at the same time:
These Postgresql and Sql Server query versions shall both work in Vertica
Postgresql version:
select
min(gr.d) as start_date,
max(gr.d) as end_date,
date_part('day', max(gr.d) - min(gr.d))+1 as consecutive_days
from
(
select
cr.d, (row_number() over() - 1) / 2 as pair_number
from tbl cr
left join tbl pr on pr.d = cr.d - interval '1 day'
left join tbl nr on nr.d = cr.d + interval '1 day'
where pr.d is null <> nr.d is null
) as gr
group by pair_number
order by start_date
Regarding pr.d is null <> nr.d is null. It means, it's either the previous row is null or next row is null, but they can never both be null, so this basically removes the non-consecutive dates, as non-consecutive dates' previous & next row are nulls (and this basically gives us all dates that are just headers and footers only). This is also called an XOR operation
If we are left with consecutive dates only, we can now pair them via row_number:
(row_number() over() - 1) / 2 as pair_number
row_number() starts with 1, we need to subtract it with 1 (we can also add with 1 instead), then we divide it by two; this makes the paired date adjacent to each other
Live test: http://www.sqlfiddle.com/#!1/fc440/7
This is the Sql Server version:
select
min(gr.d) as start_date,
max(gr.d) as end_date,
datediff(day, min(gr.d),max(gr.d)) +1 as consecutive_days
from
(
select
cr.d, (row_number() over(order by cr.d) - 1) / 2 as pair_number
from tbl cr
left join tbl pr on pr.d = dateadd(day,-1,cr.d)
left join tbl nr on nr.d = dateadd(day,+1,cr.d)
where
case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end
) as gr
group by pair_number
order by start_date
Same logic as above, except for artificial differences on date functions. And sql Server requires an ORDER BY clause on its OVER, while Postgresql's OVER can be left empty.
Sql Server has no first class boolean, that's why we cannot compare booleans directly:
pr.d is null <> nr.d is null
We must do this in Sql Server:
case when pr.d is null then 1 else 0 end
<> case when nr.d is null then 1 else 0 end
Live test: http://www.sqlfiddle.com/#!3/65df2/17

There have already been several answers to this question. However the SQL statements all seem too complex. This can be accomplished with basic SQL, a way to enumerate rows, and some date arithmetic.
The key observation is that if you have a bunch of days and have a parallel sequence of integers, then the difference is a constant date when the days are in a sequence.
The following query uses this observation to answer the original question:
select uid, min(d) as startdate, count(*) as numdaysinseq
from
(
select uid, d, adddate(d, interval -offset day) as groupstart
from
(
select uid, d, row_number() over (partition by uid order by date) as offset
from
(
SELECT DISTINCT uid, DATE(created_at) AS d
FROM visits
) t
) t
) t
Alas, mysql does not have the row_number() function. However, there is a work-around with variables (and most other databases do have this function).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Project data and cumulative sum forward - sql

Related

SQL 30 day active user query

How to get values from the previous row?

HIVE - compute statistics over partitions with window based on date

Add Missing monthly dates in a timeseries data in Postgresql

SQL issue - calculate max days sequence

Categories

Resources