Running total sum over date presto SQL - sql

I'm trying to calculate the cumulative sum of columns t and s over a date from my sample data below, using Presto SQL.
Date | T | S
1/2/19 | 2 | 5
2/1/19 | 5 | 1
3/1/19 | 1 | 1
I would like to get
Date | T | S | cum_T | cum_S
1/2/19 | 2 | 5 | 2 | 5
2/1/19 | 5 | 1 | 7 | 6
3/1/19 | 1 | 1 | 8 | 7
However when I run the below query using Presto SQL I am receiving an unexpected error message, telling me to put columns T and S into the group by section of my query.
Is this expected? When I remove the group by from my query it runs without error, but produces duplicate date rows. +
select
date_trunc('day',tb1.date),
sum(tb1.S) over (partition by date_trunc('day',tb1.date) order by date_trunc('day',tb1.date) rows unbounded preceding ) as cum_S,
sum(tb1.T) over (partition by date_trunc('day',tb1.date) order by date_trunc('day',tb1.date) rows unbounded preceding) as cum_T
from esi_dpd_bi_esds_prst.points_tb1_use_dedup_18months_vw tb1
where
tb1.reason_id not in (45,264,418,983,990,997,999,1574)
and tb1.group_id not in (22)
and tb1.point_status not in (3)
and tb1.date between cast(DATE '2019-01-01' as date) and cast( DATE '2019-01-03' as date)
group by
1
order by date_trunc('day',tb1.date) desc
Error looks like this:
Error: line 3:1: '"sum"(tb1.S) OVER (PARTITION BY "date_trunc"('day', tb1.tb1) ORDER BY "date_trunc"('day', tb1.tb1) ASC ROWS UNBOUNDED PRECEDING)' must be an aggregate expression or appear in GROUP BY clause.

You have an aggregation query and you want to mix the aggregations with window functions. The correct syntax is:
select date_trunc('day', tb1.date),
sum(tbl1.S) as S,
sum(tbl1.T) as T,
sum(sum(tb1.S)) over (order by date_trunc('day', tb1.date) rows unbounded preceding ) as cum_S,
sum(sum(tb1.T)) over (order by date_trunc('day', tb1.date) rows unbounded preceding) as cum_T
from esi_dpd_bi_esds_prst.points_tb1_use_dedup_18months_vw tb1
where tb1.reason_id not in (45, 264, 418, 983, 990, 997, 999, 1574) and
tb1.group_id not in (22) and
tb1.point_status not in (3) and
tb1.date between cast(DATE '2019-01-01' as date) and cast( DATE '2019-01-03' as date)
group by 1
order by date_trunc('day', tb1.date) desc ;
That is, the window function is running after the aggregation and needs to process the aggregated value.

Related

How to aggregate over date including all prior dates

I am working with a table in Databricks Delta lake. It gets new records appended every month. The field insert_dt indicates when the records are inserted.
| ID | Mrc | insert_dt |
|----|-----|------------|
| 1 | 40 | 2022-01-01 |
| 2 | 30 | 2022-01-01 |
| 3 | 50 | 2022-01-01 |
| 4 | 20 | 2022-02-01 |
| 5 | 45 | 2022-02-01 |
| 6 | 55 | 2022-03-01 |
Now I want to aggregate by insert_dt and calculate the average of Mrc. For each date, the average is done not just for the records of that date but all records with date prior to that. In this example, there are 3 rows for 2022-01-01, 5 rows for 2022-02-01 and 6 rows for 2022-03-01. The expected results would look like this:
| Mrc | insert_dt |
|-----|------------|
| 40 | 2022-01-01 |
| 37 | 2022-02-01 |
| 40 | 2022-03-01 |
How do I write a query to do that?
I checked the documentation for Delta-lake databricks (https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html ) and it looks like TSQL so I think this will work for you, but you may need to tweak slightly.
The approach is to condense each day to a single point and then use window functions to get the running totals. Note that any given day may have a different count, so you can't just average the averages.
--Enter the sample data you gave as a CTE for testing
;with cteSample as (
SELECT * FROM ( VALUES
(1, 40, CONVERT(date,'2022-01-01'))
, ('2', '30', '2022-01-01')
, ('3', '50', '2022-01-01')
, ('4', '20', '2022-02-01')
, ('5', '45', '2022-02-01')
, ('6', '55', '2022-03-01')
) as TabA(ID, Mrc, insert_dt)
)--Solution begins here, find the total and count for each date
--because window can only handle a single "last row"
, cteGrouped as (
SELECT insert_dt, SUM(Mrc) as MRCSum, COUNT(*) as MRCCount
FROM cteSample
GROUP BY insert_dt
)--Now use the window function to get the totals "up to today"
, cteTotals as (
SELECT insert_dt
, SUM(MRCSum) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcSum
, SUM(MRCCount) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcCount
FROM cteGrouped as G
) --Now divide out to get the average to date
SELECT insert_dt, MrcSum/MrcCount as MRCAverage
FROM cteTotals as T
This gives the following output
insert_dt
MRCAverage
2022-01-01
40
2022-02-01
37
2022-03-01
40
Calculate a running average using a window function (the inner subquery) and then pick only one row per insert_dt - the one with the highest id. I only tested this on PostgreSQL 13 so not sure how far does delta-lake support the SQL standard and will it work there or not though.
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from the_table
) t
where rn = 1
order by insert_dt;
DB-fiddle demo
Update If the_table has no id column then use a CTE to add one.
with t_id as (select *, row_number() over (order by insert_dt) id from the_table)
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from t_id
) t
where rn = 1
order by insert_dt;

How to combine Cross Join and String Agg in Bigquery with date time difference

I am trying to go from the following table
| user_id | touch | Date | Purchase Amount
| 1 | Impression| 2020-09-12 |0
| 1 | Impression| 2020-10-12 |0
| 1 | Purchase | 2020-10-13 |125$
| 1 | Email | 2020-10-14 |0
| 1 | Impression| 2020-10-15 |0
| 1 | Purchase | 2020-10-30 |122
| 2 | Impression| 2020-10-15 |0
| 2 | Impression| 2020-10-16 |0
| 2 | Email | 2020-10-17 |0
to
| user_id | path | Number of days between First Touch and Purchase | Purchase Amount
| 1 | Impression,Impression,Purchase | 2020-10-13(Purchase) - 2020-09-12 (Impression) |125$
| 1 | Email,Impression, Purchase | 2020-10-30(Purchase) - 2020-10-14(Email) | 122$
| 2 | Impression, Impression, Email | 2020-12-31 (Fixed date) - 2020-10-15(Impression) | 0$
In essence, I am trying to create a new row for each unique user in the table every time a 'Purchase' is encountered in a comma-separated string.
Also, take the difference between the first touch and first purchase for each unique user. When a new row is created we do the same for the same user as show in the example above.
From the little I have gathered I need to use a mixture of cross join and string agg but I tried using a case statement within string agg and was not able to get to the required result.
Is there a better way to do it in SQL (Bigquery).
Thank you
Below is for BigQuery Standard SQL
#standardSQL
select user_id,
string_agg(touch order by date) path,
date_diff(max(date), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
if to apply to sample data from your question - output is
another change, in case there is no Purchase in the touch we calculate the number of days from a fixed window we have set. How can I add this to the query above?
select user_id,
string_agg(touch order by date) path,
date_diff(if(countif(touch = 'Purchase') = 0, '2020-12-31', max(date)), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
with output
Means you need solution which divides row if there is purchase in touch.
Use following query:
Select user_id,
Aggregation function according to your requirement,
Sum(purchase_amount)
From
(Select t.*,
Sum(case when touch = 'Purchase' then 1 else 0 end) over (partition by user_id order by date) as sm
From t) t
Group by user_id, sm
We could approach this as a gaps-and-island problem, where every island ends with a purchase. How do we define the groups? By counting how many purchases we have ahead (current row included) - so with a descending sort in the query.
select user_id, string_agg(touch order by date),
min(date) as first_date, max(date) as max_date,
date_diff(max(date), min(date)) as cnt_days
from (
select t.*,
countif(touch = 'Purchase') over(partition by user_id order by date desc) as grp
from mytable t
) t
group by user_id, grp
You can create a value for each row that corresponds to the number of instances where table.touch = 'Purchase', which can then be used to group on:
with r as (select row_number() over(order by t1.user_id) rid, t1.* from table t1)
select t3.user_id, group_concat(t3.touch), sum(t3.amount), date_diff(max(t3.date), min(t3.date))
from (select
(select sum(r1.touch = 'Purchase' AND r1.rid < r2.rid) from r r1) c1, r2.* from r r2
) t3
group by t3.c1;

Subtracting two column on group by same table SQL

I have this table
create table events
(
event_type integer not null,
value integer not null,
time timestamp not null,
unique(event_type, time)
);
I want to write a SQL query that, for each that has been registered more than once, returns the difference between the latest (i.e. the most recent in terms of) and the second latest. The table should be ordered by (in ascending order).
Sample data is:
event_type | value | time
-------------+------------+--------------------
2 | 5 | 2015-05-09 12:42:00
4 | -42 | 2015-05-09 13:19:57
2 | 2 | 2015-05-09 14:48:30
2 | 7 | 2015-05-09 12:54:39
3 | 16 | 2015-05-09 13:19:57
3 | 20 | 2015-05-09 15:01:09
The output should be
event_type | value
------------+-----------
2 | -5
3 | 4
So far I tried doing this
SELECT event_type
FROM events
GROUP BY event_type
HAVING COUNT(event_type) > 1
ORDER BY event_type
I cannot find a way two get the right value for the second column that I've mentioned. I'm using PostgreSQL 9.4
One way to do it using lead, which gets the next value of a given column based on a specified ordering. The penultimate row for a given event_type will have the latest value which can be used for subtraction in this case. (Run the inner query to see how the next_val is assigned)
select event_type,next_val-value as diff
from (select t.*
,lead(value) over(partition by event_type order by time) as next_val,
,row_number() over(partition by event_type order by time desc) as rnum
from tbl t
) t
where next_val is not null and rnum=2
One more option with DISTINCT ON and lead.
select distinct on (event_type) event_type,next_val-value as diff
from (select t.*,lead(value) over(partition by event_type order by time) as next_val
from events t
) t
where next_val is not null
order by event_type,time desc
You can do this using ANSI/ISO standard window functions:
select event_type,
sum(case when seqnum = 1 then value
when seqnum = 2 then - value
end) as diff_latest
from (select e.*,
row_number() over (partition by event_type order by time desc) as seqnum
from events e
) e
where seqnum in (1, 2)
group by event_type
having count(*) = 2;
Here is a SQL Fiddle.

get the id based on condition in group by

I'm trying to create a sql query to merge rows where there are equal dates. the idea is to do this based on the highest amount of hours, so that i in the end gets the corresponding id for each date with the highest amount of hours. i've been trying to do with a simple group by, but does not seem to work, since i CANT just put a aggregate function on id column, since it should be based the hours condition
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 2 | 2012-01-01 | 10 |
| 3 | 2012-01-01 | 5 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
desired result
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
If you want exactly one row -- even if there are ties -- then use row_number():
select t.*
from (select t.*, row_number() over (partition by date order by hours desc) as seqnum
from t
) t
where seqnum = 1;
Ironically, both Postgres and Oracle (the original tags) have what I would consider to be better ways of doing this, but they are quite different.
Postgres:
select distinct on (date) t.*
from t
order by date, hours desc;
Oracle:
select date, max(hours) as hours,
max(id) keep (dense_rank first over order by hours desc) as id
from t
group by date;
Here's one approach using row_number:
select id, dt, hours
from (
select id, dt, hours, row_number() over (partition by dt order by hours desc) rn
from yourtable
) t
where rn = 1
You can use subquery with correlation approach :
select t.*
from table t
where id = (select t1.id
from table t1
where t1.date = t.date
order by t1.hours desc
limit 1);
In Oracle you can use fetch first 1 row only in subquery instead of LIMIT clause.

Oracle - Dealing with NULLS in DENSE_RANK

I have this table structure
+----------------+----------------+
| DATE | VALUE |
|----------------|----------------|
| 2015-01-01 | 5 |
| 2015-01-02 | 4 |
| 2015-01-03 | NULL |
| 2015-02-10 | 2 |
| 2015-02-25 | 1 |
+----------------+----------------+
I'm trying to get the most recent non null value within each month. In this case it would be this:
+----------------+----------------+
| MONTH | VALUE |
|----------------|----------------|
| 2015-01 | 4 |
| 2015-02 | 1 |
+----------------+----------------+
I've tried DENSE_RANK but i'm having a difficult time dealing with the null values.
Using:
SELECT TO_CHAR(date,'YYYY-MM'),
MAX(value) KEEP (DENSE_RANK FIRST ORDER BY date DESC)
FROM mytable
GROUP BY TO_CHAR(date,'YYYY-MM')
I'm getting
+----------------+----------------+
| MONTH | VALUE |
|----------------|----------------|
| 2015-01 | NULL |
| 2015-02 | 1 |
+----------------+----------------+
Obviously I'm doing something wrong.
Can you help me figure this out?
Thanks in advance.
EDIT: Unfortunately, adding the condition "WHERE value IS NOT NULL" can't be applied to my situation.
Unfortunately, MAX() KEEP doesn't have an IGNORE NULLS clause, as far as I know. But LAST_VALUE does. So, how about this:
SELECT mth,
MAX (last_val)
FROM (SELECT TO_CHAR (d, 'YYYY-MM') mth,
d,
n,
LAST_VALUE (
n IGNORE NULLS)
OVER (PARTITION BY TO_CHAR (d, 'YYYY-MM')
ORDER BY d ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
last_val
FROM matt_test)
GROUP BY mth
This solution eliminates null values and uses row_number to get the latest date and the corresponding value for each month.
select myr, value
from (
SELECT date, value,TO_CHAR(date,'YYYY-MM') myr,
row_number() over(partition by TO_CHAR(date,'YYYY-MM') order by date desc) rn
FROM mytable
where value is not null) t
where rn = 1
I have a personal averseness to construction like SELECT ... FROM (SELECT ... FROM ...), so this is my proposal:
SELECT DISTINCT
TRUNC(THE_DATE, 'MM') AS MONTH,
FIRST_VALUE(THE_VALUE IGNORE NULLS) OVER (PARTITION BY TRUNC(THE_DATE, 'MM') ORDER BY THE_VALUE) AS VALUE
FROM MY_TABLE;
I could not use LAST_VALUE because of Group By and many other reasons.
So this worked for me, example line:
SELECT
MAX(the_value) KEEP (dense_rank LAST ORDER BY (CASE WHEN the_value IS NOT NULL THEN 1 END) NULLS FIRST, the_date) the_value
FROM ...
or like that:
SELECT
MAX(the_value) KEEP (dense_rank FIRST ORDER BY (CASE WHEN the_value IS NOT NULL THEN 1 END) NULLS LAST, the_date DESC) the_value
FROM ...