BigQuery approx_quantiles WHERE, using a different WHERE for different functions - google-bigquery

We have a table and we are trying to compute quantiles for different columns in the table:
with t1 as (
select 'a' as category, 0.25 as stat1, 2 as stat1ct, 0.82 as stat2, 3 as stat2ct union all
select 'a' as category, 0.35 as stat1, 4 as stat1ct, 0.68 as stat2, 5 as stat2ct union all
select 'a' as category, 0.45 as stat1, 3 as stat1ct, 0.74 as stat2, 4 as stat2ct union all
select 'a' as category, 0.28 as stat1, 0 as stat1ct, 0.72 as stat2, 0 as stat2ct union all
select 'a' as category, 0.36 as stat1, 0 as stat1ct, 0.65 as stat2, 4 as stat2ct union all
select 'a' as category, 0.63 as stat1, 1 as stat1ct, 0.53 as stat2, 3 as stat2ct union all
select 'a' as category, 0.18 as stat1, 5 as stat1ct, 0.52 as stat2, 1 as stat2ct union all
select 'a' as category, 0.43 as stat1, 3 as stat1ct, 0.57 as stat2, 2 as stat2ct
)
select
approx_quantiles(stat1, 100) as atr2FgPct
,approx_quantiles(stat2, 100) as paint2FgPct
from t1
and this works fine. However, we would like to edit this by filtering each column based on a WHERE criteria using another column. We are looking for something like this:
select
approx_quantiles(stat1 where stat1ct > 2, 100) as atr2FgPct
,approx_quantiles(stat2 where stat2ct > 2, 100) as paint2FgPct
from t1
...where stat1 quantiles are based only on the stat1 values where stat1ct is greater than 2. If stat1ct is less than 2, then the value for stat1 should not count towards the quantiles. Is this possible to do in bigquery?

Consider below approach
select
approx_quantiles(if(stat1ct > 2, stat1, null), 100) as atr2FgPct
,approx_quantiles(if(stat2ct > 2, stat2, null), 100) as paint2FgPct
from t1
Note: APPROX_QUANTILES supports IGNORE NULLS and RESPECT NULLS
If IGNORE NULLS is specified, the NULL values are excluded from the result. If RESPECT NULLS is specified, the NULL values are included in the result. If neither is specified, the NULL values are excluded from the result. An error is raised if an array in the final query result contains a NULL element.

Related

In BigQuery, left join based on closest value

with
stats as (
select 0.460 as stat1, 1.93 as stat2 union all
select 0.482 as stat1, 2.17 as stat2 union all
select 0.531 as stat1, 2.35 as stat2 union all
select 0.477 as stat1, 1.83 as stat2 union all
select 0.515 as stat1, 1.61 as stat2
),
pctiles as (
select 1 as pctile, .45 as stat1, 1.5 as stat2 union all
select 2 as pctile, .46 as stat1, 1.6 as stat2 union all
select 3 as pctile, .47 as stat1, 1.7 as stat2 union all
select 4 as pctile, .48 as stat1, 1.8 as stat2 union all
select 5 as pctile, .49 as stat1, 1.9 as stat2 union all
select 6 as pctile, .50 as stat1, 2.0 as stat2 union all
select 7 as pctile, .51 as stat1, 2.1 as stat2 union all
select 8 as pctile, .52 as stat1, 2.2 as stat2 union all
select 9 as pctile, .53 as stat1, 2.3 as stat2 union all
select 10 as pctile, .54 as stat1, 2.4 as stat2 union all
)
Is it possible to left join pctiles onto stats, using closest values in pctiles? We are seeking to assign a 1-10 pctile for each value & column in the stats table. Looking at the first row as an example with select 0.460 as stat1, 1.93 as stat2, we see in the pctiles table that 0.46 for stat1 corresponds exactly with 2 as the pctile. For 1.93 and stat2, this is closest with 2.0 in the pctiles table which corresponds with a pctile of 6.
Our objective output for the "left join"
select 0.460 as stat1, 1.93 as stat2, 2 as pctile1, 6 as pctile2 union all
select 0.482 as stat1, 2.17 as stat2, 4 as pctile1, 8 as pctile2 union all
select 0.531 as stat1, 2.35 as stat2, 9 as pctile1, 9 as pctile2 union all
select 0.477 as stat1, 1.83 as stat2, 4 as pctile1, 4 as pctile2 union all
select 0.515 as stat1, 1.61 as stat2, 7 as pctile1, 2 as pctile2
For numbers in stats that fall exactly between two numbers in pctiles (eg. .515 is between 0.51 and 0.52), returning either pctile value 7 or 8 is fine.
I see number of options - consider below approach - somehow I ended up with it
select * except(id) from (
select id, any_value(col) col, any_value(value) value,
array_agg(pctile order by abs(value - stat1) limit 1)[offset(0)] pctile
from (select to_json_string(t) id, 'stat1' col, stat1 value from stats t) s
join pctiles p on col = 'stat1' group by id
union all
select id, any_value(col) col, any_value(value) value,
array_agg(pctile order by abs(value - stat2) limit 1)[offset(0)] pctile
from (select to_json_string(t) id, 'stat2' col, stat2 value from stats t) s
join pctiles p on col = 'stat2' group by id
)
pivot (min(value) as stat, min(pctile) as pctile for replace(col, 'stat', '') in ('1', '2'))
if apply to sample data in your question - output is

with SQL, how to drop cases after a long gap in a time series?

my data looks something like this:
CASE_TIMESTAMP
GROUP
0
2017-12-26 16:12:09+00:00
A
1
2017-12-26 16:12:44+00:00
A
2
2020-04-21 07:00:00+00:00
A
3
2020-07-01 00:05:35+00:00
A
4
2020-08-06 07:00:00+00:00
A
5
2020-08-06 07:00:00+00:00
A
6
2020-08-06 07:00:00+00:00
A
7
2020-08-25 07:00:00+00:00
B
8
2020-09-22 07:00:00+00:00
B
9
2020-09-22 07:00:00+00:00
B
10
2020-12-04 08:00:00+00:00
B
11
2020-12-04 08:00:00+00:00
B
12
2020-12-07 08:00:00+00:00
B
13
2020-12-07 08:00:00+00:00
B
14
2020-12-07 08:00:00+00:00
B
15
2020-12-08 08:00:00+00:00
B
16
2020-12-08 08:00:00+00:00
B
17
2020-12-08 08:00:00+00:00
B
Need to drop cases that occurred before a gap of more than one day, so in group a all cases before 2020-08-06 and in B all cases before 2020-12-07.
Think I need a window function, but don't know how to calculate gaps and then drop all before, any ideas?
PS.I'm on snowflake
Using QUALIFY and windowed MAX to find the latest CASE_TIMESTAMP per GRR:
CREATE TABLE t(CASE_TIMESTAMP TIMESTAMP, GRP VARCHAR)
AS
SELECT '2017-12-26 16:12:09+00:00','A'
UNION ALL SELECT '2017-12-26 16:12:44+00:00','A'
UNION ALL SELECT '2020-04-21 07:00:00+00:00','A'
UNION ALL SELECT '2020-07-01 00:05:35+00:00','A'
UNION ALL SELECT '2020-08-06 07:00:00+00:00','A'
UNION ALL SELECT '2020-08-06 07:00:00+00:00','A'
UNION ALL SELECT '2020-08-06 07:00:00+00:00','A'
UNION ALL SELECT '2020-08-25 07:00:00+00:00','B'
UNION ALL SELECT '2020-09-22 07:00:00+00:00','B'
UNION ALL SELECT '2020-09-22 07:00:00+00:00','B'
UNION ALL SELECT '2020-12-04 08:00:00+00:00','B'
UNION ALL SELECT '2020-12-04 08:00:00+00:00','B'
UNION ALL SELECT '2020-12-07 08:00:00+00:00','B'
UNION ALL SELECT '2020-12-07 08:00:00+00:00','B'
UNION ALL SELECT '2020-12-07 08:00:00+00:00','B'
UNION ALL SELECT '2020-12-08 08:00:00+00:00','B'
UNION ALL SELECT '2020-12-08 08:00:00+00:00','B'
UNION ALL SELECT '2020-12-08 08:00:00+00:00','B';
Query:
SELECT *
FROM t
QUALIFY CASE_TIMESTAMP >= MAX(CASE_TIMESTAMP) OVER(PARTITION BY GRP)
- INTERVAL '1 days';
Output:

Google Big query ML ARIMA is not forecasting correctly

I have input data as shown below. (actual data removed). I am trying to forecast next 3 months of VAL using almost 14 months of data. (Frequency is monthly). For below data, I am getting all 3 months of foretasted values as same.
In model evaluation, I am getting all FALSE values for 'Has_Drift' column. AIC is all -negative.
Can anyone help? What it is missing which is making it difficult to forecast for 3 months.
Sample input and output below.
CREATE OR REPLACE MODEL <MODEL_NAME>
OPTIONS(MODEL_TYPE = 'ARIMA',
time_series_timestamp_col='DATE_COL',
time_series_data_col='VAL',
DATA_FREQUENCY = 'MONTHLY') AS
SELECT CAST (P1_date as DATE) DATE_COL , VAL from (
SELECT '2019-06-07' DATE_COL ,0.09262066947 VAL union all
SELECT '2019-07-07',0.07495576437 union all
SELECT '2019-08-07',0.09832972783 union all
SELECT '2019-09-07',0.09959302865 union all
SELECT '2019-10-07',0.1445173433 union all
SELECT '2019-11-07',0.1116498012 union all
SELECT '2019-12-07',0.1065453852 union all
SELECT '2020-01-07',0.1403350342 union all
SELECT '2020-02-07',0.105060523 union all
SELECT '2020-03-07',0.2191159052 union all
SELECT '2020-04-07',0.07962838894 union all
SELECT '2020-05-07',0.131412274 union all
SELECT '2020-06-07',0.173012701 union all
SELECT '2020-07-07',0.1504522412 union all
SELECT '2020-08-07',0.1073950999
)
Forecast Values
SELECT forecast_timestamp, forecast_value
FROM ML.FORECAST(MODEL <MODEL_NAME>,
STRUCT(4 AS horizon, 0.6 AS confidence_level))
Row forecast_timestamp forecast_value
1 2020-08-30 00:00:00 UTC|0.12230825916400008
2 2020-09-29 00:00:00 UTC|0.12230825916400008
3 2020-10-29 00:00:00 UTC|0.12230825916400008
4 2020-11-28 00:00:00 UTC| 0.12230825916400008

How to perform rolling sum in BigQuery

I have sample data in BigQuery as -
with temp as (
select DATE("2016-10-02") date_field , 200 as salary
union all
select DATE("2016-10-09"), 500
union all
select DATE("2016-10-16"), 350
union all
select DATE("2016-10-23"), 400
union all
select DATE("2016-10-30"), 190
union all
select DATE("2016-11-06"), 550
union all
select DATE("2016-11-13"), 610
union all
select DATE("2016-11-20"), 480
union all
select DATE("2016-11-27"), 660
union all
select DATE("2016-12-04"), 690
union all
select DATE("2016-12-11"), 810
union all
select DATE("2016-12-18"), 950
union all
select DATE("2016-12-25"), 1020
union all
select DATE("2017-01-01"), 680
) ,
temp2 as (
select * , DATE("2017-01-01") as current_date
from temp
)
select * from temp2
I want to perform rolling sum on this table. As an example, I have set current date to 2017-01-01. Now, this being the current date, I want to go back 30 days and take sum of salary field. Hence, with 2017-01-01 being the current date, the total that should be returned is for the month of December , 2016, which is 690+810+950+1020. How can I do this using StandardSQL ?
Below is for BigQuery Standard SQL for Rolling last 30 days SUM
#standardSQL
SELECT *,
SUM(salary) OVER(
ORDER BY UNIX_DATE(date_field)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING
) AS rolling_30_days_sum
FROM `project.dataset.your_table`
You can test, play with above using sample data from your question as below
#standardSQL
WITH temp AS (
SELECT DATE("2016-10-02") date_field , 200 AS salary UNION ALL
SELECT DATE("2016-10-09"), 500 UNION ALL
SELECT DATE("2016-10-16"), 350 UNION ALL
SELECT DATE("2016-10-23"), 400 UNION ALL
SELECT DATE("2016-10-30"), 190 UNION ALL
SELECT DATE("2016-11-06"), 550 UNION ALL
SELECT DATE("2016-11-13"), 610 UNION ALL
SELECT DATE("2016-11-20"), 480 UNION ALL
SELECT DATE("2016-11-27"), 660 UNION ALL
SELECT DATE("2016-12-04"), 690 UNION ALL
SELECT DATE("2016-12-11"), 810 UNION ALL
SELECT DATE("2016-12-18"), 950 UNION ALL
SELECT DATE("2016-12-25"), 1020 UNION ALL
SELECT DATE("2017-01-01"), 680
)
SELECT *,
SUM(salary) OVER(
ORDER BY UNIX_DATE(date_field)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING
) AS rolling_30_days_sum
FROM temp
-- ORDER BY date_field
with result
Row date_field salary rolling_30_days_sum
1 2016-10-02 200 null
2 2016-10-09 500 200
3 2016-10-16 350 700
4 2016-10-23 400 1050
5 2016-10-30 190 1450
6 2016-11-06 550 1440
7 2016-11-13 610 1490
8 2016-11-20 480 1750
9 2016-11-27 660 1830
10 2016-12-04 690 2300
11 2016-12-11 810 2440
12 2016-12-18 950 2640
13 2016-12-25 1020 3110
14 2017-01-01 680 3470
This is not exactly a "rolling sum", but it's the exact answer to "I want to go back 30 days and take sum of salary field. Hence, with 2017-01-01 being the current date, the total that should be returned is for the month of December"
with temp as (
select DATE("2016-10-02") date_field , 200 as salary
union all
select DATE("2016-10-09"), 500
union all
select DATE("2016-10-16"), 350
union all
select DATE("2016-10-23"), 400
union all
select DATE("2016-10-30"), 190
union all
select DATE("2016-11-06"), 550
union all
select DATE("2016-11-13"), 610
union all
select DATE("2016-11-20"), 480
union all
select DATE("2016-11-27"), 660
union all
select DATE("2016-12-04"), 690
union all
select DATE("2016-12-11"), 810
union all
select DATE("2016-12-18"), 950
union all
select DATE("2016-12-25"), 1020
union all
select DATE("2017-01-01"), 680
) ,
temp2 as (
select * , DATE("2017-01-01") as current_date_x
from temp
)
select SUM(salary)
from temp2
WHERE date_field BETWEEN DATE_SUB(current_date_x, INTERVAL 30 DAY) AND DATE_SUB(current_date_x, INTERVAL 1 DAY)
3470
Note that I wasn't able to use current_date as a variable name, as it gets replaced by the actual current date.

How to get the sum of row in oracle?

I have a written a query which is giving following output
But my actual need is something different
TRANSACTION_DATE DETAILS LEAVE_CREDITED LEAVE_DEBITED
29-Sep-2012 Sep-2012-Sep-2012 0.11
01-Oct-2012 Oct-2012-Dec-2012 2.5
01-Jan-2013 Jan-2013-Mar-2013 2.5
31-Mar-2013 LAPSE - 540007 1.9
01-Apr-2013 Apr-2013-Jun-2013 2.5
30-Apr-2013 Lev_102935703 0.11
There should be a 5th column such that
Its value should be LASTBALANCE+(Leave_Credited)-(Leave_Debited)
In this case
BALANCE
0.11-0= 0.11
0.11+(2.5-0)= 2.61
2.61+(2.5-0)= 5.11
5.11+(0-1.9)= 3.02
Please help.
My query is something like
SELECT TRUNC(NVL(C.UPDATED_DATE, C.CREATED_DATE)) TRANSACTION_DATE,
TO_CHAR(C.SERVICE_START_DATE, 'Mon-YYYY') || '-' ||
TO_CHAR(C.SERVICE_END_DATE, 'Mon-YYYY') Details,
C.LEAVE_CREDITED,
NULL LEAVE_DEBITED
FROM LEAVE.GES_LEV_CREDIT_DETAILS C, LEAVE.GES_LEV_CREDIT_MASTER CM
WHERE C.LEV_CREDIT_ID = CM.LEV_CREDIT_ID
AND C.PERSON_ID = 12345
AND CM.COUNTRY_LEAVE_TYPE_ID = 5225
AND c.leave_credited<>0
UNION
SELECT TRUNC(NVL(d.UPDATED_DATE, d.CREATED_DATE)) TRANSACTION_DATE,
d.reference,
NULL,
d.no_of_days LEAVE_DEBITED
FROM LEAVE.GES_LEV_CREDIT_DETAILS C,
LEAVE.GES_LEV_CREDIT_MASTER CM,
leave.ges_lev_debit_req_dtls D
WHERE C.LEV_CREDIT_ID = CM.LEV_CREDIT_ID
AND C.LEV_CREDIT_DETAIL_ID = D.LEV_CREDIT_DETL_ID
AND C.PERSON_ID = 12345
AND CM.COUNTRY_LEAVE_TYPE_ID = 5225
5.11+(0-1.9)= 3.02
Shouldn't it be 3.21.
Use Analytic SUM() OVER() for both credited and debited columns and then take the difference of them.
Let's see a working test case, I have built your table using WITH clause, in reality you just need to use your table instead of DATA:
SQL> WITH DATA AS(
2 SELECT to_date('29-Sep-2012', 'dd-Mon-yyyy') TRANSACTION_DATE, 0.11 LEAVE_CREDITED, NULL LEAVE_DEBITED FROM dual UNION ALL
3 SELECT to_date('01-Oct-2012', 'dd-Mon-yyyy') TRANSACTION_DATE, 2.5 LEAVE_CREDITED, NULL LEAVE_DEBITED FROM dual UNION ALL
4 SELECT to_date('01-Jan-2013', 'dd-Mon-yyyy') TRANSACTION_DATE, 2.5 LEAVE_CREDITED, NULL LEAVE_DEBITED FROM dual UNION ALL
5 SELECT to_date('31-Mar-2013', 'dd-Mon-yyyy') TRANSACTION_DATE, NULL LEAVE_CREDITED, 1.9 LEAVE_DEBITED FROM dual UNION ALL
6 SELECT to_date('01-Apr-2013', 'dd-Mon-yyyy') TRANSACTION_DATE, 2.5 LEAVE_CREDITED, NULL LEAVE_DEBITED FROM dual UNION ALL
7 SELECT to_date('30-Apr-2013', 'dd-Mon-yyyy') TRANSACTION_DATE, null LEAVE_CREDITED, 0.11 LEAVE_DEBITED FROM dual
8 )
9 SELECT t.*,
10 SUM(NVL(leave_credited,0)) OVER(ORDER BY TRANSACTION_DATE)
11 -
12 SUM(NVL(LEAVE_DEBITED,0)) OVER(ORDER BY TRANSACTION_DATE) LASTBALANCE
13 FROM DATA t
14 /
TRANSACTI LEAVE_CREDITED LEAVE_DEBITED LASTBALANCE
--------- -------------- ------------- -----------
29-SEP-12 .11 .11
01-OCT-12 2.5 2.61
01-JAN-13 2.5 5.11
31-MAR-13 1.9 3.21
01-APR-13 2.5 5.71
30-APR-13 .11 5.6
6 rows selected.
SQL>
Your query would look like:
SELECT t.*,
SUM(NVL(leave_credited,0)) OVER(ORDER BY TRANSACTION_DATE)
-
SUM(NVL(LEAVE_DEBITED,0)) OVER(ORDER BY TRANSACTION_DATE) LASTBALANCE
FROM table_name t
/
I am using Oracle.
You can use sum() over (order by transaction_date) to get a running
total - which will handle the lapsed leaves etc. ->
Output:
Script:
select to_date(transaction_date) transaction_date, details,
leave_credited, leave_debited,
sum(leave_credited - leave_debited) over (order by to_date(transaction_date) asc) final_balance
from
(select '29-Sep-2012' transaction_date,
'Sep-2012-Sep-2012' details,
0.11 leave_credited,
0 leave_debited
from dual
union all
select '01-Oct-2012' , 'Oct-2012-Dec-2012' , 2.5 , 0 from dual
union all
select '01-Jan-2013' , 'Jan-2013-Mar-2013' , 2.5 , 0 from dual
union all
select '31-Mar-2013' , 'LAPSE - 540007' , 0, 1.9 from dual
union all
select '01-apr-2013' , 'apr-2013-jun-2013' , 2.5, 0 from dual
union all
select '30-Apr-2013' , 'Lev_102935703' , 0, 0.11 from dual
);