I have tables like the following:
recorddate score
2021-05-01 0
2021-05-01 1
2021-05-01 2
2021-05-02 3
2021-05-02 4
2021-05-03 5
2021-05-07 6
And want to get the 60th percentile for score per week. I tried:
select distinct
recorddate
, PERCENTILE_disc(0.60) WITHIN GROUP (ORDER BY score)
OVER (PARTITION BY recorddate) AS top60
from tbl;
It returned something like this:
recorddate top60
2021-05-01 1
2021-05-02 4
2021-05-03 5
2021-05-07 6
But my desired result is like weekly aggregation (7 days).
For example for the week ending on 2021-05-07:
recorddate top60
2021-05-01 ~ 2021-05-07 2
Is there a solution for this?
I think you want this:
SELECT date_trunc('week', recorddate) AS week
, percentile_disc(0.60) WITHIN GROUP(ORDER BY score) AS top60
FROM tbl
GROUP BY 1;
That's the discrete value at the 60th percentile for each week (where actual data exists) - where 60 % of the rows in the same group (in the week) are the same or smaller. To be precise, in the words of the manual:
the first value within the ordered set of aggregated argument values whose position in the ordering equals or exceeds the specified fraction.
Adding your format on top of it:
SELECT to_char(week_start, 'YYYY-MM-DD" ~ "')
|| to_char(week_start + interval '6 days', 'YYYY-MM-DD') AS week
, top60
FROM (
SELECT date_trunc('week', recorddate) AS week_start
, percentile_disc(0.60) WITHIN GROUP(ORDER BY score) AS top60
FROM tbl
GROUP BY 1
) sub;
I would rather call it something like "percentile_60".
Related
I am trying to push the last value of a cumulative dataset forward to present time.
Initialise test data:
drop table if exists test_table;
create table test_table
as select data_date::date, floor(random() * 10) as data_value
from
generate_series('2021-08-25'::date, '2021-08-31'::date, '1 day') data_date;
The above test data produces something like this:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
What I wish to do, is push the last data value (2021-08-31 7) forward to present time. For example, say today's date was 2021-09-03, I would want the result to be something like:
data_date data_value cumulative_value
2021-08-25 1 1
2021-08-26 7 8
2021-08-27 8 16
2021-08-28 7 23
2021-08-29 2 25
2021-08-30 2 27
2021-08-31 7 34
2021-09-01 7 41
2021-09-02 7 48
2021-09-03 7 55
You need to get the value of the last date in the table. Common table expression is a good way to do that:
with cte as (
select data_value as last_val
from test_table
order by data_date desc
limit 1)
select
gen_date::date as data_date,
coalesce(data_value, last_val) as data_value,
sum(coalesce(data_value, last_val)) over (order by gen_date) as cumulative_sum
from generate_series('2021-08-25'::date, '2021-09-03', '1 day') as gen_date
left join test_table on gen_date = data_date
cross join cte
Test it in db<>fiddle.
You may use union and a scalar subquery to find the latest value of data_value for for the new rows. cumulative_value is re-evaluated.
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION all
select rd, (select data_value from test_table where data_date = '2021-08-31')
from generate_series('2021-09-01'::date, '2021-09-03', '1 day') rd
) t
order by data_date;
And here it is a bit smarter w/o fixed date literals.
with cte(latest_date) as (select max(data_date) from test_table)
select *, sum(data_value) over (rows between unbounded preceding and current row) as cumulative_value
from
(
select data_date, data_value from test_table
UNION ALL
select rd::date, (select data_value from test_table, cte where data_date = latest_date)
from generate_series((select latest_date from cte) + 1, CURRENT_DATE, '1 day') rd
) t
order by data_date;
SQL Fiddle here.
I have a table like this:
ID
NUMBER
TIMESTAMP
1
1
05/28/2020 09:00:00
2
2
05/29/2020 10:00:00
3
1
05/31/2020 21:00:00
4
1
06/01/2020 21:00:00
And I want to show data like this:
ID
NUMBER
TIMESTAMP
RANGE
1
1
05/28/2020 09:00:00
0 Days
2
2
05/29/2020 10:00:00
0 Days
3
1
05/31/2020 21:00:00
3,5 Days
4
1
06/01/2020 21:00:00
1 Days
So it takes 3,5 Days to process the number 1 process.
I tried:
select a.id, a.number, a.timestamp, ((a.timestamp-b.timestamp)/24) as days
from my_table a
left join (select number,timestamp from my_table) b
on a.number=b.number
Didn't work as expected. How to do this properly?
Use the window function lag().
With standard interval output:
SELECT *, timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)
FROM tbl
ORDER BY id;
If you need decimal number like in your example:
SELECT *, round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2) || ' days'
FROM tbl
ORDER BY id;
If you also need to display '0 days' instead of NULL like in your example:
SELECT *, COALESCE(round((extract(epoch FROM timestamp - lag(timestamp) OVER(PARTITION BY number ORDER BY id)) / 86400)::numeric, 2), 0) || ' days'
FROM tbl
ORDER BY id;
db<>fiddle here
I have a table of names, dates and numeric values. I want to know the total first date entry and the total sum of numeric values for the first 90 days after the first date.
Eg
name
date
value
Joe
2020-10-30
3
Bob
2020-12-23
5
Joe
2021-01-03
7
Joe
2021-05-30
2
I want a query that returns
name
min_date
sum_first_90_days
Joe
2020-10-30
10
Bob
2020-12-23
5
So far I have
SELECT name, min(date) min_date,
sum(value) over (partition by name
order by date
rows between min(date) and dateadd(day,90,min(date))
) as first_90_days_sum
FROM table
but it's not executing. What's a good approach here? How can I set up a window function to use a dynamic date range for each partition?
You can use window functions and aggregation:
select name, sum(value)
from (select t.*,
min(date) over (partition by name) as min_date
from t
) t
where date <= min_date + interval '90 day'
group by name;
The following is the table with distinct CustomerID and Trunc_Date(Date,MONTH) called Date.
DATE
CustomerID
2021-01-01
111
2021-01-01
112
2021-02-01
111
2021-03-01
113
2021-03-01
115
2021-04-01
119
For a given month M, I want to get the count of distinct CustomerIDs of the three previous months combined. Eg. for the month of July (7), I want to get the distinct count of CustomerIDs from the month of April (4), May (5) and until June (6). I do not want the customer in July (7) to be included for the record for July.
So the output will be like:
DATE
CustomerID Count
2021-01-01
535
2021-02-01
657
2021-03-01
777
2021-04-01
436
2021-05-01
879
2021-06-01
691
Consider below
select distinct date,
( select count(distinct id)
from t.prev_3_month_customers id
) customerid_count
from (
select *,
array_agg(customerid) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_customers,
from (
select *, date_diff(date, '2000-01-01', month) pos
from `project.dataset.table`
)
) t
If applied to sample data in your question - output is
You can also solve this problem by creating a record for each month in the three following months and then aggregating:
select date_add(date, interval n month) as month,
count(distinct customerid)
from t cross join
unnest(generate_array(1, 3, 1)) n
group by month;
BigQuery ran out of memory running this since we have lots of data
In cases like this - the most scalable and performant approach is to use HyperLogLog++ functions as in example below
select distinct date,
( select hll_count.merge(sketch)
from t.prev_3_month_sketches sketch
) customerid_count
from (
select *,
array_agg(customers_hhl) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_sketches,
from (
select date_diff(date, '2000-01-01', month) pos,
min(date) date,
hll_count.init(customerid) customers_hhl
from `project.dataset.table`
group by pos
)
) t
If applied to sample data in your question - output is
Note: HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.
I have the following table with monthly data. But we do not have the third month.
DATE
FREQUENCY
2021-01-01
6000
2021-02-01
4533
2021-04-01
7742
2021-05-01
1547
2021-06-01
9857
I want to get the frequency of the previous month into the following table.
DATE
FREQUENCY
PREVIOUS_MONTH_FREQ
2021-01-01
6000
NULL
2021-02-01
4533
6000
2021-04-01
7742
NULL
2021-05-01
1547
7742
2021-06-01
9857
1547
I want the 2021-04-01 record to have NULL for the PREVIOUS_MONTH_FREQ since there is no data for the previous month.
I got so far as...
SELECT DATE,
FREQUENCY,
LAG(FREQUENCY) OVER(ORDER BY DATE) AS PREVIOUS_MONTH_FREQ
FROM Table1
Use a CASE expression to check if the previous row contains data of the previous month:
SELECT DATE,
FREQUENCY,
CASE WHEN DATE_SUB(DATE, INTERVAL 1 MONTH) = LAG(DATE) OVER(ORDER BY DATE)
THEN LAG(FREQUENCY) OVER(ORDER BY DATE)
END AS PREVIOUS_MONTH_FREQ
FROM Table1
See the demo.
In BigQuery, you can use a RANGE window specification. This only trick is that you need a number rather than a date:
select t.*,
max(frequency) over (order by date_diff(date, date '2000-01-01', month)
range between 1 preceding and 1 preceding
) as prev_frequence
from t;
The '2000-01-01' is an arbitrary date. This turns the date column into the number of months since that date. The actual date is not important.