Windows functions orderen by date when some dates doesn't exist - sql

Suppose this example query:
select
id
, date
, sum(var) over (partition by id order by date rows 30 preceding) as roll_sum
from tab
When some dates are not present on date column the window will not consider the unexistent dates. How could i make this windowns aggregation including these unexistent dates?
Many thanks!

You can join a sequence containing all dates from a desired interval.
select
*
from (
select
d.date,
q.id,
q.roll_sum
from unnest(sequence(date '2000-01-01', date '2030-12-31')) d
left join ( your_query ) q on q.date = d.date
) v
where v.date > (select min(my_date) from tab2)
and v.date < (select max(my_date) from tab2)

In standard SQL, you would typically use a window range specification, like:
select
id,
date,
sum(var) over (
partition by id
order by date
range interval '30' day preceding
) as roll_sum
from tab
However I am unsure that Presto supports this syntax. You can resort a correlated subquery instead:
select
id,
date,
(
select sum(var)
from tab t1
where
t1.id = t.id
and t1.date >= t.date - interval '30' day
and t1.date <= t.date
) roll_sum
from tab t

I don't think Presto support window functions with interval ranges. Alas. There is an old fashioned way to doing this, by counting "ins" and "outs" of values:
with t as (
select id, date, var, 1 as is_orig
from t
union all
select id, date + interval '30 day', -var, 0
from t
)
select id.*
from (select id, date, sum(var) over (partition by id order by date) as running_30,
sum(is_org) as is_orig
from t
group by id, date
) id
where is_orig > 0

Related

BigQuery: 'join lateral' alternative for referencing value in subquery

I have a BigQuery table that holds append-only data - each time an entity is updated a new version of it is inserted. Each entity has its unique ID and each entry has a timestamp of when it was inserted.
When querying for the latest version of the entity, I order by rank, partition by id, and select the most recent version.
I want to take advantage of this and chart the progression of these entities over time. For example, I would like to generate a row for each day since Jan. 1st, with a summary of the entities as they were on that day. In postgres, I would do:
select
...
from generate_series('2022-01-01'::timestamp, '2022-09-01'::timestamp, '1 day'::interval) query_date
left join lateral (
select *
from (
with snapshot as (
select distinct on (id) *
from table
where "createdOn" <= query_date
order by id, "createdOn" desc
)
This basically behaves like a for-each, having each subquery run once for each query_date (day, in this instance) which I can reference in the where clause. Each subquery then filters the data so that it only uses data up to a certain time.
I know that I can create a saved query for the "subquery" logic and then schedule a prefill to run once for each day over the timeline, but I would like to understand how to write an exploratory query.
EDIT 1
Using a correlated subquery is a step in the right direction, but does not work when the subquery needs to join with another table (another append-only table holding a related entity).
So this works:
select
day
, (
select count(*)
from `table` t
where date(createdOn) < day
)
from unnest((select generate_date_array(date('2022-01-01'), current_date(), interval 1 day) as day)) day
order by day desc
But if I need the subquery to join with another table, like in:
select
day
, (
select as struct *
from (
select
id
, status
, rank() over (partition by id order by createdOn desc) as rank
from `table1`
where date(createdOn) < day
qualify rank = 1
) t1
left join (
select
id
, other
, rank() over (partition by id order by createdOn desc) as rank
from `table2`
where date(createdOn) < day
qualify rank = 1
) t2 on t2.other = t1.id
)
from unnest((select generate_date_array(date('2022-01-01'), current_date(), interval 1 day) as day)) day
order by day desc
I get an error saying Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. Another SO question about that error (Avoid correlated subqueries error in BigQuery) solves the issue by moving the correlated query to a join in the top query - which misses what I am trying to achieve.
Took me a while, but I figured out a way to do this using the answer in Bigquery: WHERE clause using column from outside the subquery.
Basically, it requires to flip the order of the queries, here's how it's done:
select *
from (
select *
from `table1` t1
JOIN (select day from unnest((select generate_timestamp_array(timestamp('2022-01-01'), current_timestamp(), interval 1 day) as day)) day) day
ON (t1.createdOn) < day.day
QUALIFY ROW_NUMBER() OVER (PARTITION BY day, t1.id ORDER BY t1.createdOn desc) = 1
)
left join (
select
* -- aggregate here
from (
SELECT
id, other, createdOn
FROM `table2` t2
JOIN (select day from unnest((select generate_timestamp_array(timestamp('2022-01-01'), current_timestamp(), interval 1 day) as day)) day) day
ON (t2.createdOn) < day.day
QUALIFY ROW_NUMBER() OVER (PARTITION BY day, t2.id ORDER BY t2.createdOn desc) = 1
) snapshot
group by rs.other, day
) t2 on t2.other = t1.id and t2.day = t1.day
group by t1.day

How to get max date among others ids for current id using BigQuery?

I need to get max date for each row over other ids. Of course I can do this with CROSS JOIN and JOIN .
Like this
WITH t AS (
SELECT 1 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-09-01','2021-09-09', INTERVAL 1 DAY)) rep_date
UNION ALL
SELECT 2 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-08-20','2021-09-03', INTERVAL 1 DAY)) rep_date
UNION ALL
SELECT 3 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-08-25','2021-09-05', INTERVAL 1 DAY)) rep_date
)
SELECT id, rep_date, MAX(rep_date) OVER (PARTITION BY id) max_date, max_date_over_others FROM t
JOIN (
SELECT t.id, MAX(max_date) max_date_over_others FROM t
CROSS JOIN (
SELECT id, MAX(rep_date) max_date FROM t
GROUP BY 1
) t1
WHERE t1.id <> t.id
GROUP BY 1
) USING (id)
But it's too wired for huge tables. So I'm looking for the some simpler way to do this. Any ideas?
Your version is good enough I think. But if you want to try other options - consider below approach. It might looks more verbose from first look - but should be more optimal and cheaper to compare with your version with cross join
temp as (
select id,
greatest(
ifnull(max(max_date_for_id) over preceding_ids, '1970-01-01'),
ifnull(max(max_date_for_id) over following_ids, '1970-01-01')
) as max_date_for_rest_ids
from (
select id, max(rep_date) max_date_for_id
from t
group by id
)
window
preceding_ids as (order by id rows between unbounded preceding and 1 preceding),
following_ids as (order by id rows between 1 following and unbounded following)
)
select *
from t
join temp
using (id)
Assuming your original table data just has columns id and dt - wouldn't this solve it? I'm using the fact that if an id has the max dt of everything, then it gets the second-highest over the other id values.
WITH max_dates AS
(
SELECT
id,
MAX(dt) AS max_dt
FROM
data
GROUP BY
id
),
with_top1_value AS
(
SELECT
*,
MAX(dt) OVER () AS max_overall_dt_1,
MIN(dt) OVER () AS min_overall_dt
FROM
max_dates
),
with_top2_values AS
(
SELECT
*,
MAX(CASE WHEN dt = max_overall_dt_1 THEN min_overall_dt ELSE dt END) AS max_overall_dt2
FROM
with_top1_value
),
SELECT
*,
CASE WHEN dt = max_overall_dt1 THEN max_overall_dt2 ELSE max_overall_dt1 END AS max_dt_of_others
FROM
with_top2_values

using LAG to compare the data from today and 7 days ago (not between)

I am currently trying to compare aggregated numbers from today and exactly 7 days ago (not between today and 7 days ago, but instead simply comparing these two discrete dates).
I already have a way of doing it using a lot of subqueries, but the performance is bad, and I am now trying to optimize.
This is what I have come up with so far (sample query, not with real table names and columns due to confidentiality):
Select current_date, previous_date, current_sum, previous_sum, percentage
From (Select date as current_date, sum(numbers) as current_sum,
lag (sum(numbers)) over (partition by date order by date) as previous_sum,
(Select max(date)-7 From t1 ) as previous_date,
(current_sum - previous_sum)*100/current_sum as percentage
From t1 where date>=sysdate-7 group by date,previous_date)
But I am definitely doing something wrong since in the output the previous_sum appears null, and naturally the percentage too.
Any ideas on what I am doing wrong? I haven't used LAG before so it must be something there.
Thanks!
Using Join of pre-aggregated subqueries.
with agg as (
select sum(numbers) as sum_numbers, date from t1 group by date
)
select curr.sum_numbers as current_sum,
prev.sum_numbers as prev_sum,
curr.date as curr_date,
prev.date as prev_date
from agg curr
left join agg prev on curr.date-7=prev.date
Using lag:
with agg as (
select sum(numbers) as sum_numbers, date from t1 group by date
)
select sum_numbers as current_sum,
lag(sum_numbers, 7) over(order by date) as prev_sum,
a.date as curr_date,
lag(a.date,7) over(order by date) as prev_date
from agg a
If you want exactly 2 dates only (today and today-7) then it can be done much simpler using conditional aggregation and filter:
select sum(case when date = trunc(sysdate) then numbers else null end) as current_sum,
sum(case when date = trunc(sysdate-7) then numbers else null end) as previous_sum,
trunc(sysdate) as curr_date,
trunc(sysdate-7) as prev_date,
(current_sum - previous_sum)*100/current_sum as percentage
from t1 where date = trunc(sysdate) or date = trunc(sysdate-7)
You can do this with window (analytic) functions, which should be the fastest method. Your actually aggregation query is a bit unclear, but I think it is:
select date as current_date, sum(numbers) as current_sum
from t1
group by date;
If you have values for all dates, then use:
select date as current_date, sum(numbers) as current_sum,
lag(sum(numbers), 7) over (order by date) as prev_7_sum
from t1
group by date;
If you don't have data for all days, then use a window frame:
select date as current_date, sum(numbers) as current_sum,
max(sum(numbers), 7) over (order by date range between '7' day preceding and '7' day preceding) as prev_7_sum
from t1
group by date;

How to return 0 value if no record exists in bigquery

I need find out for which date record does not exits in BigQuery table.
Query pls find
select cast(creat_ts as date) as create,IFNULL(count(*) ,0)
FROM table
where cast(creat_ts as date)='2020-06-23' group by 1 )
Below is for BigQuery Standard SQL
#standardSQL
SELECT DISTINCT day
FROM UNNEST(GENERATE_DATE_ARRAY('2020-06-01', '2020-06-30')) day
LEFT JOIN `project.dataset.table` t
ON CAST(creat_ts AS DATE) = day
WHERE creat_ts IS NULL
You could try something like this:
with calendar as (
select * from unnest(generate_date_array('2020-01-01', '2020-07-01', interval 1 day)) date
),
temp as (
select cast(b.create_ts as date) as date from `project.dataset.table` b
),
daily_count as (
select
date,
count(date.temp) as ct
from calendar
left join temp using(date)
group by 1
)
select * from daily_count
where ct = 0
order by 1

SQL Get last 7 days from event date

The best way to explain what I need is showing, so, here it is:
Currently I have this query
select
date_
,count(*) as count_
from table
group by date_
which returns me the following database
Now I need to get a new column, that shows me the count off all the previous 7 days, considering the row date_.
So, if the row is from day 29/06, I have to count all ocurrencies of that day ( my query is already doing it) and get all ocurrencies from day 22/06 to 29/06
The result should be something like this:
If you have values for all dates, without gaps, then you can use window functions with a rows frame:
select
date,
count(*) cnt
sum(count(*)) over(order by date rows between 7 preceding and current row) cnt_d7
from mytable
group by date
order by date
you can try something like this:
select
date_,
count(*) as count_,
(select count(*)
from table as b
where b.date_ <= a.date_ and b.date_ > a.date - interval '7 days'
) as count7days_
from table as a
group by date_
If you have gaps, you can do a more complicated solution where you add and subtract the values:
with t as (
select date_, count(*) as count_
from table
group by date_
union all
select date_ + interval '8 day', -count(*) as count_
from table
group by date_
)
select date_,
sum(sum(count_)) over (order by date_ rows between unbounded preceding and current row) - sum(count_)
from t;
The - sum(count_) is because you do not seem to want the current day in the cumulated amount.
You can also use the nasty self-join approach . . . which should be okay for 7 days:
with t as (
select date_, count(*) as count_
from table
group by date_
)
select t.date_, t.count_, sum(tprev.count_)
from t left join
t tprev
on tprev.date_ >= t.date_ - interval '7 day' and
tprev.date_ < t.date_
group by t.date_, t.count_;
The performance will get worse and worse as "7" gets bigger.
Try with subquery for the new column:
select
table.date_ as groupdate,
count(table.date_) as date_count,
(select count(table.date_)
from table
where table.date_ <= groupdate and table.date_ >= groupdate - interval '7 day'
) as total7
from table
group by groupdate
order by groupdate