(SQL BigQuery) Using Lag but data contains missing months - sql

I have the following table with monthly data. But we do not have the third month.
DATE
FREQUENCY
2021-01-01
6000
2021-02-01
4533
2021-04-01
7742
2021-05-01
1547
2021-06-01
9857
I want to get the frequency of the previous month into the following table.
DATE
FREQUENCY
PREVIOUS_MONTH_FREQ
2021-01-01
6000
NULL
2021-02-01
4533
6000
2021-04-01
7742
NULL
2021-05-01
1547
7742
2021-06-01
9857
1547
I want the 2021-04-01 record to have NULL for the PREVIOUS_MONTH_FREQ since there is no data for the previous month.
I got so far as...
SELECT DATE,
FREQUENCY,
LAG(FREQUENCY) OVER(ORDER BY DATE) AS PREVIOUS_MONTH_FREQ
FROM Table1

Use a CASE expression to check if the previous row contains data of the previous month:
SELECT DATE,
FREQUENCY,
CASE WHEN DATE_SUB(DATE, INTERVAL 1 MONTH) = LAG(DATE) OVER(ORDER BY DATE)
THEN LAG(FREQUENCY) OVER(ORDER BY DATE)
END AS PREVIOUS_MONTH_FREQ
FROM Table1
See the demo.

In BigQuery, you can use a RANGE window specification. This only trick is that you need a number rather than a date:
select t.*,
max(frequency) over (order by date_diff(date, date '2000-01-01', month)
range between 1 preceding and 1 preceding
) as prev_frequence
from t;
The '2000-01-01' is an arbitrary date. This turns the date column into the number of months since that date. The actual date is not important.

Related

Snowflake SQL, converting number YYYYWW into date

in snowflake i have numeric column with values like 202201, 202305,202248 that refers to year week combination. How can i convert it into first or last day of tha week?
for example 202251 will be '2022-12-19' or '2022-12-25' (as the week starts on Monday)
thx for help
i have tried
select distinct week_id
,to_date(concat
(
substr(week_id,1,4)
,substr(week_id,5,2)
)
,'YYYYWW'
) as Date_value
from MyTable
but i got only error msg as Can't Parse '202249' as date with format 'YYYYWW'
If we swap to TRY_TO_DATE form, it will not explode, which can make for simpler debugging.
And then when we try 'YYYYWW' we see if fails:
select week_id
,try_to_date(week_id, 'YYYYWW') as Date_value
from values
('202201'),
('202305'),
('202248'),
('202249')
t(week_id);
WEEK_ID
DATE_VALUE
202201
null
202305
null
202248
null
202249
null
swapping to a substring for just the year, and a number for the week:
select week_id
,try_to_date(left(week_id,4), 'YYYY') as just_year
,try_to_number(substr(week_id,5,2)) as week_num
from values
('202201'),
('202305'),
('202248'),
('202249')
t(week_id);
now we get those parts as something workable.
WEEK_ID
JUST_YEAR
WEEK_NUM
202201
2022-01-01
1
202305
2023-01-01
5
202248
2022-01-01
48
202249
2022-01-01
49
Now we can use DATEADD and the WEEK like so:
select week_id
,try_to_date(left(week_id,4), 'YYYY') as just_year
,try_to_number(substr(week_id,5,2)) as week_num
,dateadd(week, week_num, just_year) as answer
from values
('202201'),
('202305'),
('202248'),
('202249')
t(week_id);
WEEK_ID
JUST_YEAR
WEEK_NUM
ANSWER
202201
2022-01-01
1
2022-01-08
202305
2023-01-01
5
2023-02-05
202248
2022-01-01
48
2022-12-03
202249
2022-01-01
49
2022-12-10
which can all be merge into one like:
select week_id
,dateadd(week, try_to_number(substr(week_id,5,2)), try_to_date(left(week_id,4), 'YYYY')) as answer
from values
('202201'),
('202305'),
('202248'),
('202249')
t(week_id);

Query a 30 day interval for every 30 day interval in the last year

I want to query every 30 day interval in 2021, but I don't know how to do it without a for loop in SQL.
Here's psuedo code of what I want to do with a table called _table and a date column called application_date:
for _day in range(335):
select '2021-01-01' + _day as start_date, count(*) as _count
from _table
where '2021-01-01' + _day <= application_date <= ('2021-01-01' + _day + interval '30' day )
It would output something like this:
start_date
_count
2021-01-01
{number of rows between 2021-01-01 and 2021-01-31}
2021-01-02
{number of rows between 2021-01-02 and 2021-02-01}
...
...
2021-11-31
{number of rows between 2021-11-31 and 2021-12-30}
2021-12-01
{number of rows between 2021-12-01 and 2021-12-31}
Assuming that you have rows for each day you can group data by date, count it in the group and then use sum window function with range of 30 rows (current + next 30 rows, note that {rows between 2021-01-01 and 2021-01-31} have interval of 31 day, not 30):
-- sample data
WITH dataset(start_date) AS (
VALUES (date '2021-01-01'),
(date '2021-01-01'),
(date '2021-01-01'),
(date '2021-01-02'),
(date '2021-01-03'),
(date '2021-01-03')
)
-- query
select start_date
, sum(cnt) over (order by start_date ROWS BETWEEN CURRENT ROW AND 30 FOLLOWING) rolling_count_31_days
from (
select start_date
, count(*) cnt
from dataset
where year(start_date) = 2021
group by start_date
)
Output:
start_date
rolling_count_31_days
2021-01-01
6
2021-01-02
3
2021-01-03
2
If some dates are missing - checkout this or this answer describing how to insert missing dates and insert dates into the group result with cnt set to 0.
Note that Trino (the new name for PrestoSQL) updated support for RANGE frame type and you can implement this without need to insert missing rows.

Select Sum of Grouped Values over Date Range (Window Function)

I have a table of names, dates and numeric values. I want to know the total first date entry and the total sum of numeric values for the first 90 days after the first date.
Eg
name
date
value
Joe
2020-10-30
3
Bob
2020-12-23
5
Joe
2021-01-03
7
Joe
2021-05-30
2
I want a query that returns
name
min_date
sum_first_90_days
Joe
2020-10-30
10
Bob
2020-12-23
5
So far I have
SELECT name, min(date) min_date,
sum(value) over (partition by name
order by date
rows between min(date) and dateadd(day,90,min(date))
) as first_90_days_sum
FROM table
but it's not executing. What's a good approach here? How can I set up a window function to use a dynamic date range for each partition?
You can use window functions and aggregation:
select name, sum(value)
from (select t.*,
min(date) over (partition by name) as min_date
from t
) t
where date <= min_date + interval '90 day'
group by name;

SQL BigQuery: For the current month, count number of distinct CustomerIDs in the previous 3 month

The following is the table with distinct CustomerID and Trunc_Date(Date,MONTH) called Date.
DATE
CustomerID
2021-01-01
111
2021-01-01
112
2021-02-01
111
2021-03-01
113
2021-03-01
115
2021-04-01
119
For a given month M, I want to get the count of distinct CustomerIDs of the three previous months combined. Eg. for the month of July (7), I want to get the distinct count of CustomerIDs from the month of April (4), May (5) and until June (6). I do not want the customer in July (7) to be included for the record for July.
So the output will be like:
DATE
CustomerID Count
2021-01-01
535
2021-02-01
657
2021-03-01
777
2021-04-01
436
2021-05-01
879
2021-06-01
691
Consider below
select distinct date,
( select count(distinct id)
from t.prev_3_month_customers id
) customerid_count
from (
select *,
array_agg(customerid) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_customers,
from (
select *, date_diff(date, '2000-01-01', month) pos
from `project.dataset.table`
)
) t
If applied to sample data in your question - output is
You can also solve this problem by creating a record for each month in the three following months and then aggregating:
select date_add(date, interval n month) as month,
count(distinct customerid)
from t cross join
unnest(generate_array(1, 3, 1)) n
group by month;
BigQuery ran out of memory running this since we have lots of data
In cases like this - the most scalable and performant approach is to use HyperLogLog++ functions as in example below
select distinct date,
( select hll_count.merge(sketch)
from t.prev_3_month_sketches sketch
) customerid_count
from (
select *,
array_agg(customers_hhl) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_sketches,
from (
select date_diff(date, '2000-01-01', month) pos,
min(date) date,
hll_count.init(customerid) customers_hhl
from `project.dataset.table`
group by pos
)
) t
If applied to sample data in your question - output is
Note: HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

How to aggregate percentile_disc() function over date time

I have tables like the following:
recorddate score
2021-05-01 0
2021-05-01 1
2021-05-01 2
2021-05-02 3
2021-05-02 4
2021-05-03 5
2021-05-07 6
And want to get the 60th percentile for score per week. I tried:
select distinct
recorddate
, PERCENTILE_disc(0.60) WITHIN GROUP (ORDER BY score)
OVER (PARTITION BY recorddate) AS top60
from tbl;
It returned something like this:
recorddate top60
2021-05-01 1
2021-05-02 4
2021-05-03 5
2021-05-07 6
But my desired result is like weekly aggregation (7 days).
For example for the week ending on 2021-05-07:
recorddate top60
2021-05-01 ~ 2021-05-07 2
Is there a solution for this?
I think you want this:
SELECT date_trunc('week', recorddate) AS week
, percentile_disc(0.60) WITHIN GROUP(ORDER BY score) AS top60
FROM tbl
GROUP BY 1;
That's the discrete value at the 60th percentile for each week (where actual data exists) - where 60 % of the rows in the same group (in the week) are the same or smaller. To be precise, in the words of the manual:
the first value within the ordered set of aggregated argument values whose position in the ordering equals or exceeds the specified fraction.
Adding your format on top of it:
SELECT to_char(week_start, 'YYYY-MM-DD" ~ "')
|| to_char(week_start + interval '6 days', 'YYYY-MM-DD') AS week
, top60
FROM (
SELECT date_trunc('week', recorddate) AS week_start
, percentile_disc(0.60) WITHIN GROUP(ORDER BY score) AS top60
FROM tbl
GROUP BY 1
) sub;
I would rather call it something like "percentile_60".