I have a log table like below and want to simplfy it by getting min start date and max end date for consecutive Status values for each Id. I tried many window function combinations but no luck.
This is what I have:
This is what want to see:
This is a typical gaps-and-islands problem. You want to aggregate groups of consecutive records that have the same Id and Status.
No need for recursion, here is one way to solve it using window functions:
select
Id,
Status,
min(StartDate) StartDate,
max(EndDate) EndDate
from (
select
t.*,
row_number() over(partition by id order by StartDate) rn1,
row_number() over(partition by id, status order by StartDate) rn2
from mytable t
) t
group by
Id,
Status,
rn1 - rn2
order by Id, min(StartDate)
The query works by ranking records over two different partitions (by Id, and by Id and Status). The difference between the ranks gives you the group each record belongs to. You can run the subquery independently to see what it returns and understand the logic.
Demo on DB Fiddle:
Id | Status | StartDate | EndDate
-: | :----- | :------------------ | :------------------
1 | B | 07/02/2019 00:00:00 | 18/02/2019 00:00:00
1 | C | 18/02/2019 00:00:00 | 10/03/2019 00:00:00
1 | B | 10/03/2019 00:00:00 | 01/04/2019 00:00:00
2 | A | 05/02/2019 00:00:00 | 22/04/2019 00:00:00
2 | D | 22/04/2019 00:00:00 | 05/05/2019 00:00:00
2 | A | 05/05/2019 00:00:00 | 30/06/2019 00:00:00
Try the following query. First order the data by StartDate and generate a sequence (rid). Then you the recursive cte to get the first row (rid=1) for each group (id,status), and recursively get the next row and compare the start/end date.
;WITH cte_r(id,[Status],StartDate,EndDate,rid)
AS
(
SELECT id,[Status],StartDate,EndDate, ROW_NUMBER() OVER(PARTITION BY Id,[Status] ORDER BY StartDate) AS rid
FROM log_table
),
cte_range(id,[Status],StartDate,EndDate,rid)
AS
(
SELECT id,[Status],StartDate,EndDate,rid
FROM cte_r
WHERE rid=1
UNION ALL
SELECT p.id, p.[Status], CASE WHEN c.StartDate<p.EndDate THEN p.StartDate ELSE c.StartDate END AS StartDate, c.EndDate,c.rid
FROM cte_range p
INNER JOIN cte_r c
ON p.id=c.id
AND p.[Status]=c.[Status]
AND p.rid+1=c.rid
)
SELECT id,[Status],StartDate,MAX(EndDate) AS EndDate FROM cte_range GROUP BY id,StartDate ;
Related
I am trying to go from the following table
| user_id | touch | Date | Purchase Amount
| 1 | Impression| 2020-09-12 |0
| 1 | Impression| 2020-10-12 |0
| 1 | Purchase | 2020-10-13 |125$
| 1 | Email | 2020-10-14 |0
| 1 | Impression| 2020-10-15 |0
| 1 | Purchase | 2020-10-30 |122
| 2 | Impression| 2020-10-15 |0
| 2 | Impression| 2020-10-16 |0
| 2 | Email | 2020-10-17 |0
to
| user_id | path | Number of days between First Touch and Purchase | Purchase Amount
| 1 | Impression,Impression,Purchase | 2020-10-13(Purchase) - 2020-09-12 (Impression) |125$
| 1 | Email,Impression, Purchase | 2020-10-30(Purchase) - 2020-10-14(Email) | 122$
| 2 | Impression, Impression, Email | 2020-12-31 (Fixed date) - 2020-10-15(Impression) | 0$
In essence, I am trying to create a new row for each unique user in the table every time a 'Purchase' is encountered in a comma-separated string.
Also, take the difference between the first touch and first purchase for each unique user. When a new row is created we do the same for the same user as show in the example above.
From the little I have gathered I need to use a mixture of cross join and string agg but I tried using a case statement within string agg and was not able to get to the required result.
Is there a better way to do it in SQL (Bigquery).
Thank you
Below is for BigQuery Standard SQL
#standardSQL
select user_id,
string_agg(touch order by date) path,
date_diff(max(date), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
if to apply to sample data from your question - output is
another change, in case there is no Purchase in the touch we calculate the number of days from a fixed window we have set. How can I add this to the query above?
select user_id,
string_agg(touch order by date) path,
date_diff(if(countif(touch = 'Purchase') = 0, '2020-12-31', max(date)), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
with output
Means you need solution which divides row if there is purchase in touch.
Use following query:
Select user_id,
Aggregation function according to your requirement,
Sum(purchase_amount)
From
(Select t.*,
Sum(case when touch = 'Purchase' then 1 else 0 end) over (partition by user_id order by date) as sm
From t) t
Group by user_id, sm
We could approach this as a gaps-and-island problem, where every island ends with a purchase. How do we define the groups? By counting how many purchases we have ahead (current row included) - so with a descending sort in the query.
select user_id, string_agg(touch order by date),
min(date) as first_date, max(date) as max_date,
date_diff(max(date), min(date)) as cnt_days
from (
select t.*,
countif(touch = 'Purchase') over(partition by user_id order by date desc) as grp
from mytable t
) t
group by user_id, grp
You can create a value for each row that corresponds to the number of instances where table.touch = 'Purchase', which can then be used to group on:
with r as (select row_number() over(order by t1.user_id) rid, t1.* from table t1)
select t3.user_id, group_concat(t3.touch), sum(t3.amount), date_diff(max(t3.date), min(t3.date))
from (select
(select sum(r1.touch = 'Purchase' AND r1.rid < r2.rid) from r r1) c1, r2.* from r r2
) t3
group by t3.c1;
I have data as below. Need to find the minimum date for a person is active on for the latest continuous period
SCENARIO 1
NAME | STARTDATE | END DATE
--------------------------------------
name | 01-JAN-2016 | 31-DEC-2017
name | 01-JAN-2017 | 31-OCT-2018
name | 01-JAN-2018 | 31-DEC-2019
name | 01-JAN-2019 | 31-DEC-2020
I need output as:
NAME | STARTDATE | END DATE
--------------------------------------
MIKE | 01-01-2018 | 31-12-2020
Scenario 2:-
NAME | STARTDATE | END DATE
--------------------------------------
name | 01-01-2016 | 31-DEC-2017
name | 01-01-2017 | 31-OCT-2018
name | 01-01-2018 | 31-DEC-2019
name | 01-01-2019 | 31-DEC-2020
I need output as:
NAME | STARTDATE | END DATE
--------------------------------------
name | 01-01-2019 | 31-12-2020
So basically output is MIN and MAX for LATEST continuous period for the person.
Hmmm . . . I think you can do this with the following logic:
select name, max(startdate), max_enddate
from (select t.*,
lag(enddate) over (partition by name order by startdate) as prev_enddate,
max(enddate) over (partition by name) as max_enddate
from t
) t
where startdate <= prev_enddate + interval '1 day'
group by name, max_enddate;
The subquery simply gets the previous end date and the overrall end date.
The outer query does two things:
Filters for when a new period of activity begins.
Aggregates to take the latest date when it occurs.
Here is a db<>fiddle
This is a gaps-and-island problem. Here is one approach that uses lag() and a cumulative sum() to build the groups, and then filter on the first group per name:
select name, min(startdate) startdate, max(enddate) enddate
from (
select t.*,
sum(case when startdate = lag_enddate + interval '1 day' or lag_enddate is null then 0 else 1 end) over(partition by name order by startdate) grp
from (
select t.*,
lag(enddate) over(partition by name order by startdate) lag_enddate
from mytable t
) t
) t
where grp = 0
group by name
I would like to create a view based on data in following structure:
CREATE TABLE my_table (
date date,
daily_cumulative_precip float4
);
INSERT INTO my_table (date, daily_cumulative_precip)
VALUES
('2016-07-28', 3.048)
, ('2016-08-04', 2.286)
, ('2016-08-11', 5.334)
, ('2016-08-12', 0.254)
, ('2016-08-13', 2.794)
, ('2016-08-14', 2.286)
, ('2016-08-15', 3.302)
, ('2016-08-17', 3.81)
, ('2016-08-19', 15.746)
, ('2016-08-20', 46.739998);
I would like to accumulate the precipitation for consecutive days only.
Below is the desired result for a different test case - except that days without rain should be omitted:
I have tried window functions with OVER(PARTITION BY date, rain_on_day) but they do not yield the desired result.
How could I solve this?
SELECT date
, dense_rank() OVER (ORDER BY grp) AS consecutive_group_nr -- optional
, daily_cumulative_precip
, sum(daily_cumulative_precip) OVER (PARTITION BY grp ORDER BY date) AS cum_precipitation_mm
FROM (
SELECT date, t.daily_cumulative_precip
, row_number() OVER (ORDER BY date) - t.rn AS grp
FROM (
SELECT generate_series (min(date), max(date), interval '1 day')::date AS date
FROM my_table
) d
LEFT JOIN (SELECT *, row_number() OVER (ORDER BY date) AS rn FROM my_table) t USING (date)
) x
WHERE daily_cumulative_precip > 0
ORDER BY date;
db<>fiddle here
Returns all rainy days with cumulative sums for consecutive days (and a running group number).
Basics:
Select longest continuous sequence
Here's a way to calculate cumulative precipitation without having to explicitly enumerate all dates:
SELECT date, daily_cumulative_precip, sum(daily_cumulative_precip) over (partition by group_num order by date) as cum_precip
FROM
(SELECT date, daily_cumulative_precip, sum(start_group) over (order by date) as group_num
FROM
(SELECT date, daily_cumulative_precip, CASE WHEN (date != prev_date + 1) THEN 1 ELSE 0 END as start_group
FROM
(SELECT date, daily_cumulative_precip, lag(date, 1, '-infinity'::date) over (order by date) as prev_date
FROM my_table) t1) t2) t3
yields
| date | daily_cumulative_precip | cum_precip |
|------------+-------------------------+------------|
| 2016-07-28 | 3.048 | 3.048 |
| 2016-08-04 | 2.286 | 2.286 |
| 2016-08-11 | 5.334 | 5.334 |
| 2016-08-12 | 0.254 | 5.588 |
| 2016-08-13 | 2.794 | 8.382 |
| 2016-08-14 | 2.286 | 10.668 |
| 2016-08-15 | 3.302 | 13.97 |
| 2016-08-17 | 3.81 | 3.81 |
| 2016-08-19 | 15.746 | 15.746 |
| 2016-08-20 | 46.74 | 62.486 |
I'm trying to create a sql query to merge rows where there are equal dates. the idea is to do this based on the highest amount of hours, so that i in the end gets the corresponding id for each date with the highest amount of hours. i've been trying to do with a simple group by, but does not seem to work, since i CANT just put a aggregate function on id column, since it should be based the hours condition
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 2 | 2012-01-01 | 10 |
| 3 | 2012-01-01 | 5 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
desired result
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
If you want exactly one row -- even if there are ties -- then use row_number():
select t.*
from (select t.*, row_number() over (partition by date order by hours desc) as seqnum
from t
) t
where seqnum = 1;
Ironically, both Postgres and Oracle (the original tags) have what I would consider to be better ways of doing this, but they are quite different.
Postgres:
select distinct on (date) t.*
from t
order by date, hours desc;
Oracle:
select date, max(hours) as hours,
max(id) keep (dense_rank first over order by hours desc) as id
from t
group by date;
Here's one approach using row_number:
select id, dt, hours
from (
select id, dt, hours, row_number() over (partition by dt order by hours desc) rn
from yourtable
) t
where rn = 1
You can use subquery with correlation approach :
select t.*
from table t
where id = (select t1.id
from table t1
where t1.date = t.date
order by t1.hours desc
limit 1);
In Oracle you can use fetch first 1 row only in subquery instead of LIMIT clause.
I have this table structure
+----------------+----------------+
| DATE | VALUE |
|----------------|----------------|
| 2015-01-01 | 5 |
| 2015-01-02 | 4 |
| 2015-01-03 | NULL |
| 2015-02-10 | 2 |
| 2015-02-25 | 1 |
+----------------+----------------+
I'm trying to get the most recent non null value within each month. In this case it would be this:
+----------------+----------------+
| MONTH | VALUE |
|----------------|----------------|
| 2015-01 | 4 |
| 2015-02 | 1 |
+----------------+----------------+
I've tried DENSE_RANK but i'm having a difficult time dealing with the null values.
Using:
SELECT TO_CHAR(date,'YYYY-MM'),
MAX(value) KEEP (DENSE_RANK FIRST ORDER BY date DESC)
FROM mytable
GROUP BY TO_CHAR(date,'YYYY-MM')
I'm getting
+----------------+----------------+
| MONTH | VALUE |
|----------------|----------------|
| 2015-01 | NULL |
| 2015-02 | 1 |
+----------------+----------------+
Obviously I'm doing something wrong.
Can you help me figure this out?
Thanks in advance.
EDIT: Unfortunately, adding the condition "WHERE value IS NOT NULL" can't be applied to my situation.
Unfortunately, MAX() KEEP doesn't have an IGNORE NULLS clause, as far as I know. But LAST_VALUE does. So, how about this:
SELECT mth,
MAX (last_val)
FROM (SELECT TO_CHAR (d, 'YYYY-MM') mth,
d,
n,
LAST_VALUE (
n IGNORE NULLS)
OVER (PARTITION BY TO_CHAR (d, 'YYYY-MM')
ORDER BY d ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
last_val
FROM matt_test)
GROUP BY mth
This solution eliminates null values and uses row_number to get the latest date and the corresponding value for each month.
select myr, value
from (
SELECT date, value,TO_CHAR(date,'YYYY-MM') myr,
row_number() over(partition by TO_CHAR(date,'YYYY-MM') order by date desc) rn
FROM mytable
where value is not null) t
where rn = 1
I have a personal averseness to construction like SELECT ... FROM (SELECT ... FROM ...), so this is my proposal:
SELECT DISTINCT
TRUNC(THE_DATE, 'MM') AS MONTH,
FIRST_VALUE(THE_VALUE IGNORE NULLS) OVER (PARTITION BY TRUNC(THE_DATE, 'MM') ORDER BY THE_VALUE) AS VALUE
FROM MY_TABLE;
I could not use LAST_VALUE because of Group By and many other reasons.
So this worked for me, example line:
SELECT
MAX(the_value) KEEP (dense_rank LAST ORDER BY (CASE WHEN the_value IS NOT NULL THEN 1 END) NULLS FIRST, the_date) the_value
FROM ...
or like that:
SELECT
MAX(the_value) KEEP (dense_rank FIRST ORDER BY (CASE WHEN the_value IS NOT NULL THEN 1 END) NULLS LAST, the_date DESC) the_value
FROM ...