How to aggregate percentile_disc() function over date time - sql

I have tables like the following:
recorddate score
2021-05-01 0
2021-05-01 1
2021-05-01 2
2021-05-02 3
2021-05-02 4
2021-05-03 5
2021-05-07 6
And want to get the 60th percentile for score per week. I tried:
select distinct
recorddate
, PERCENTILE_disc(0.60) WITHIN GROUP (ORDER BY score)
OVER (PARTITION BY recorddate) AS top60
from tbl;
It returned something like this:
recorddate top60
2021-05-01 1
2021-05-02 4
2021-05-03 5
2021-05-07 6
But my desired result is like weekly aggregation (7 days).
For example for the week ending on 2021-05-07:
recorddate top60
2021-05-01 ~ 2021-05-07 2
Is there a solution for this?

I think you want this:
SELECT date_trunc('week', recorddate) AS week
, percentile_disc(0.60) WITHIN GROUP(ORDER BY score) AS top60
FROM tbl
GROUP BY 1;
That's the discrete value at the 60th percentile for each week (where actual data exists) - where 60 % of the rows in the same group (in the week) are the same or smaller. To be precise, in the words of the manual:
the first value within the ordered set of aggregated argument values whose position in the ordering equals or exceeds the specified fraction.
Adding your format on top of it:
SELECT to_char(week_start, 'YYYY-MM-DD" ~ "')
|| to_char(week_start + interval '6 days', 'YYYY-MM-DD') AS week
, top60
FROM (
SELECT date_trunc('week', recorddate) AS week_start
, percentile_disc(0.60) WITHIN GROUP(ORDER BY score) AS top60
FROM tbl
GROUP BY 1
) sub;
I would rather call it something like "percentile_60".

Related

Project data and cumulative sum forward

I am trying to push the last value of a cumulative dataset forward to present time.
Initialise test data:
drop table if exists test_table;
create table test_table
as select data_date::date, floor(random() * 10) as data_value
from
generate_series('2021-08-25'::date, '2021-08-31'::date, '1 day') data_date;
The above test data produces something like this:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
What I wish to do, is push the last data value (2021-08-31 7) forward to present time. For example, say today's date was 2021-09-03, I would want the result to be something like:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
2021-09-01 7 41
2021-09-02 7 48
2021-09-03 7 55
You need to get the value of the last date in the table. Common table expression is a good way to do that:
with cte as (
select data_value as last_val
from test_table
order by data_date desc
limit 1)
select
gen_date::date as data_date,
coalesce(data_value, last_val) as data_value,
sum(coalesce(data_value, last_val)) over (order by gen_date) as cumulative_sum
from generate_series('2021-08-25'::date, '2021-09-03', '1 day') as gen_date
left join test_table on gen_date = data_date
cross join cte
Test it in db<>fiddle.
You may use union and a scalar subquery to find the latest value of data_value for for the new rows. cumulative_value is re-evaluated.
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION all
select rd, (select data_value from test_table where data_date = '2021-08-31')
from generate_series('2021-09-01'::date, '2021-09-03', '1 day') rd
) t
order by data_date;
And here it is a bit smarter w/o fixed date literals.
with cte(latest_date) as (select max(data_date) from test_table)
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION ALL
select rd::date, (select data_value from test_table, cte where data_date = latest_date)
from generate_series((select latest_date from cte) + 1, CURRENT_DATE, '1 day') rd
) t
order by data_date;
SQL Fiddle here.

How to get values from the previous row?

I have a table like this:
ID
NUMBER
TIMESTAMP
1
1
05/28/2020 09:00:00
2
2
05/29/2020 10:00:00
3
1
05/31/2020 21:00:00
4
1
06/01/2020 21:00:00
And I want to show data like this:
ID
NUMBER
TIMESTAMP
RANGE
1
1
05/28/2020 09:00:00
0 Days
2
2
05/29/2020 10:00:00
0 Days
3
1
05/31/2020 21:00:00
3,5 Days
4
1
06/01/2020 21:00:00
1 Days
So it takes 3,5 Days to process the number 1 process.
I tried:
select a.id, a.number, a.timestamp, ((a.timestamp-b.timestamp)/24) as days
from my_table a
left join (select number,timestamp from my_table) b
on a.number=b.number
Didn't work as expected. How to do this properly?
Use the window function lag().
With standard interval output:
SELECT *, timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)
FROM tbl
ORDER BY id;
If you need decimal number like in your example:
SELECT *, round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2) || ' days'
FROM tbl
ORDER BY id;
If you also need to display '0 days' instead of NULL like in your example:
SELECT *, COALESCE(round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2), 0) || ' days'
FROM tbl
ORDER BY id;
db<>fiddle here

Select Sum of Grouped Values over Date Range (Window Function)

I have a table of names, dates and numeric values. I want to know the total first date entry and the total sum of numeric values for the first 90 days after the first date.
Eg
name
date
value
Joe
2020-10-30
3
Bob
2020-12-23
5
Joe
2021-01-03
7
Joe
2021-05-30
2
I want a query that returns
name
min_date
sum_first_90_days
Joe
2020-10-30
10
Bob
2020-12-23
5
So far I have
SELECT name, min(date) min_date,
sum(value) over (partition by name
order by date
rows between min(date) and dateadd(day,90,min(date))
) as first_90_days_sum
FROM table
but it's not executing. What's a good approach here? How can I set up a window function to use a dynamic date range for each partition?
You can use window functions and aggregation:
select name, sum(value)
from (select t.*,
min(date) over (partition by name) as min_date
from t
) t
where date <= min_date + interval '90 day'
group by name;

SQL BigQuery: For the current month, count number of distinct CustomerIDs in the previous 3 month

The following is the table with distinct CustomerID and Trunc_Date(Date,MONTH) called Date.
DATE
CustomerID
2021-01-01
111
2021-01-01
112
2021-02-01
111
2021-03-01
113
2021-03-01
115
2021-04-01
119
For a given month M, I want to get the count of distinct CustomerIDs of the three previous months combined. Eg. for the month of July (7), I want to get the distinct count of CustomerIDs from the month of April (4), May (5) and until June (6). I do not want the customer in July (7) to be included for the record for July.
So the output will be like:
DATE
CustomerID Count
2021-01-01
535
2021-02-01
657
2021-03-01
777
2021-04-01
436
2021-05-01
879
2021-06-01
691
Consider below
select distinct date,
( select count(distinct id)
from t.prev_3_month_customers id
) customerid_count
from (
select *,
array_agg(customerid) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_customers,
from (
select *, date_diff(date, '2000-01-01', month) pos
from `project.dataset.table`
)
) t
If applied to sample data in your question - output is
You can also solve this problem by creating a record for each month in the three following months and then aggregating:
select date_add(date, interval n month) as month,
count(distinct customerid)
from t cross join
unnest(generate_array(1, 3, 1)) n
group by month;
BigQuery ran out of memory running this since we have lots of data
In cases like this - the most scalable and performant approach is to use HyperLogLog++ functions as in example below
select distinct date,
( select hll_count.merge(sketch)
from t.prev_3_month_sketches sketch
) customerid_count
from (
select *,
array_agg(customers_hhl) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_sketches,
from (
select date_diff(date, '2000-01-01', month) pos,
min(date) date,
hll_count.init(customerid) customers_hhl
from `project.dataset.table`
group by pos
)
) t
If applied to sample data in your question - output is
Note: HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

(SQL BigQuery) Using Lag but data contains missing months

I have the following table with monthly data. But we do not have the third month.
DATE
FREQUENCY
2021-01-01
6000
2021-02-01
4533
2021-04-01
7742
2021-05-01
1547
2021-06-01
9857
I want to get the frequency of the previous month into the following table.
DATE
FREQUENCY
PREVIOUS_MONTH_FREQ
2021-01-01
6000
NULL
2021-02-01
4533
6000
2021-04-01
7742
NULL
2021-05-01
1547
7742
2021-06-01
9857
1547
I want the 2021-04-01 record to have NULL for the PREVIOUS_MONTH_FREQ since there is no data for the previous month.
I got so far as...
SELECT DATE,
FREQUENCY,
LAG(FREQUENCY) OVER(ORDER BY DATE) AS PREVIOUS_MONTH_FREQ
FROM Table1
Use a CASE expression to check if the previous row contains data of the previous month:
SELECT DATE,
FREQUENCY,
CASE WHEN DATE_SUB(DATE, INTERVAL 1 MONTH) = LAG(DATE) OVER(ORDER BY DATE)
THEN LAG(FREQUENCY) OVER(ORDER BY DATE)
END AS PREVIOUS_MONTH_FREQ
FROM Table1
See the demo.
In BigQuery, you can use a RANGE window specification. This only trick is that you need a number rather than a date:
select t.*,
max(frequency) over (order by date_diff(date, date '2000-01-01', month)
range between 1 preceding and 1 preceding
) as prev_frequence
from t;
The '2000-01-01' is an arbitrary date. This turns the date column into the number of months since that date. The actual date is not important.