I create a time series model:
CREATE OR REPLACE MODEL mymodel
OPTIONS(
MODEL_TYPE='ARIMA_PLUS',
TIME_SERIES_TIMESTAMP_COL='date',
TIME_SERIES_DATA_COL='cost',
TIME_SERIES_ID_COL='grp',
HOLIDAY_REGION='GB'
) AS
SELECT grp, date, cost FROM mydata
I then evaluate it:
SELECT *
FROM
ML.EVALUATE(MODEL mymodel,
(SELECT grp, date, cost FROM mydata),
STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level,
12 AS horizon))
But all my metrics and forecasted value have value NaN. Could it be I am using the wrong input data (ie (SELECT grp, date, cost FROM mydata) in the above)? The docs say it is the data that contains the evaluation data which is unclear to me.
Related
I want to calculate the cumulative product across rows in Snowflake.
Basically I have monthly rates that multiplied accumulate across time.
(Some databases have the product() SQL function for that).
A trick suggested by Sterling Paramore: Add logs, and then exponentiate it:
with data as (select $1 x from values (1),(2),(3),(4),(5))
select x
, sum(x) over(order by x) sum
, exp(sum(ln(x)) over(order by x)) mult
from data
If a built-in function doesn't exist, it's usually possible to roll something custom using a User-Defined Table Function.
In this case:
CREATE OR REPLACE FUNCTION CUMULATIVE_PRODUCT(VALUE double)
RETURNS TABLE (PRODUCT double)
LANGUAGE JAVASCRIPT
AS '{
initialize: function(argumentInfo, context) {
this.cumulativeProduct = 1;
},
processRow: function f(row, rowWriter, context){
this.cumulativeProduct = this.cumulativeProduct*row.VALUE;
rowWriter.writeRow({PRODUCT: this.cumulativeProduct});
}
}';
Example table:
create temp table sample_numbers as (
select 1 as index, 5.1::double as current_value
union all
select 2 as index, 4.3::double as current_value
union all
select 3 as index, 3.7::double as current_value
union all
select 4 as index, 3.9::double as current_value
)
invoking the UDTF:
select index,current_value,PRODUCT as cumulative_product
from sample_numbers,table(CUMULATIVE_PRODUCT(current_value) over ())
Note the empty over() clause which forces Snowflake to do a single sequential run over the data instead of splitting it into parallel chunks
The output from AI Platform for tabular dataset looks something like this:
or
{
"classes": ["a","b","c"],
"scores": [0.9,0.1,0.0]
}
There are two arrays within in a record field. predicted_label.classes is the label, and predicted_label.scores is the score produced by AI Platform.
I would like to select the class based on the highest score. i.e in the above example I would like to have an output like row=0, class="a", score=0.9
UNNEST does not immediately solve my issue from my understanding, as it requires the input to be an array. I believe if the ouput was a repeated RECORD it would be easier.
What SQL query will enable me to extract the right label from the AI Platform batch results?
Try this:
with testdata as (
select struct(["a", "b", "c"] as classes, [0.9, 0.1, 0.0] as scores) as predicted_label
)
select (
select struct(offset, class, score)
from unnest(predicted_label.classes) as class with offset
join unnest(predicted_label.scores) as score with offset
using (offset)
order by score desc
limit 1
) as highest
from testdata
You should design your prediction list so that each label and score is represented as a key-value pair.
That BigQuery table looks like this array.
prediction RECORD REPEATED
prediction.label STRING REQUIRED
prediction.score FLOAT REQUIRED
Why?
This a correct representation of your real world situation.
You need no further verification that both list keep the elements in the correct pairing order (on write and on read).
With two loose lists you create a pitfall that will hounds you.
SQL example
with this_model as (
select [
STRUCT ('a' as label, 0.9 as score)
, STRUCT ('b' as label, 0.1 as score)
, STRUCT ('c' as label, 0.0 as score)
] as prediction
)
select pair.label, pair.score
from this_model, UNNEST(prediction) pair
order by pair.score desc
limit 1;
I am using this below query to derive outliers form my data. using distinct is creating too much shuffle and the end tasks are taking huge amount of time to complete. are there any optimization that can be done to speed it up?
query = """SELECT
DISTINCT NAME,
PERIODICITY,
PERCENTILE(CAST(AMOUNT AS INT), 0.997) OVER(PARTITION BY NAME, PERIODICITY) as OUTLIER_UPPER_THRESHOLD,
CASE
WHEN PERIODICITY = "WEEKLY" THEN 100
WHEN PERIODICITY = "BI_WEEKLY" THEN 200
WHEN PERIODICITY = "MONTHLY" THEN 250
WHEN PERIODICITY = "BI_MONTHLY" THEN 400
WHEN PERIODICITY = "QUARTERLY" THEN 900
ELSE 0
END AS OUTLIER_LOWER_THRESHOLD
FROM base"""
I would suggest rephrasing this so you can filter before aggregating:
SELECT NAME, PERIODICITY, OUTLIER_LOWER_THRESHOLD,
MIN(AMOUNT)
FROM (SELECT NAME, PERIODICITY,
RANK() OVER (PARTITION BY NAME, PERIODICITY ORDER BY AMOUNT) as sequm,
COUNT(*) OVER (PARTITION BY NAME, PERIODICITY) as cnt,
(CASE . . . END) as OUTLIER_LOWER_THRESHOLD
FROM base
) b
WHERE seqnum >= 0.997 * cnt
GROUP BY NAME, PERIODICITY, OUTLIER_LOWER_THRESHOLD;
Note: This ranks duplicate amounts based on the lowest rank. That means that some NAME/PERIODICITY pairs may not be in the results. They can easily be added back in using a LEFT JOIN.
The easiest way to deal with a large shuffle, independent of what the shuffle is, is to use a larger cluster. It's the easiest way because you don't have to think much about it. Machine time is usually much cheaper than human time refactoring code.
The second easiest way to deal with a large shuffle that is the union of some independent and constant parts is to break it into smaller shuffles. In your case, you could run separate queries for each periodicity, filtering the data down before the shuffle and then union the results.
If the first two approaches are not applicable for some reason, it's time to refactor. In your case you are doing two shuffles: first to compute OUTLIER_UPPER_THRESHOLD which you associate with every row and then to distinct the rows. In other words, you are doing a manual, two-phase GROUP BY. Why don't you just group by NAME, PERIODICITY and compute the percentile?
I am trying to get the standard errors of the betas in a linear regression in bigquery ML, sorry if I have missed something basic, but I cannot find the answer to this question
#standard sql
CREATE OR REPLACE MODEL `DATASET.test_lm`
OPTIONS(model_type='LINEAR_REG', input_label_cols= ["y"]) AS
select * from unnest(ARRAY<STRUCT<y INT64, x float64>> [(1,2.028373),
(2,2.347660),(3,3.429958),(4,5.250539),(5,5.976455)])
you can get weights without variance with
select * from ml.weights(model `DATASET.test_ml`)
Also, you can calculate the standard errors directly like this
with dat as (
select * from unnest(ARRAY<STRUCT<y INT64, x float64>> [(1,2.028373), (2,2.347660),(3,3.429958),(4,5.250539),(5,5.976455)])),
#get the residual standard error, using simple df-2
rse_dat as (
select sqrt(sum(e2)/((select count(1) from dat)-2)) as rse from (
select pow(y - predicted_y, 2) as e2 from ml.predict(model `DATASET.test_lm`,
(select * from dat)))),
#get the variance of x
xvar_dat as (
select sum(pow(x - (select avg(x) as xbar from dat),2)) as xvar from dat)
#calulate standard error
select sqrt((select pow(rse,2) from rse_dat)/(select xvar from xvar_dat) as beta_x_se )
But this gets to be heavy lift for many covariates. Is there a direct way to get this get this pretty basic statistic for confidence intervals?
You could use ML.ADVANCED_WEIGHTS now, which gives standard errors.
https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-advanced-weights
Please help me to optimize my SQL query.
I have a table with the fields: date, commodity_id, exp_month_id, exp_year, price, where the first 4 fields are the primary key. The months are designated with the alphabet-ordered letters: e.g. F (for Jan), G (for Feb.), H (for March), etc. Thus the letter of more distant from Jan. month will be larger than the letter of the less distant month (F < G < H < ...). Some commodity_ids have all 12 months in the table, some only 5 or 3, which are constant for all years.
I need to calculate the difference between prices (gradient) of the neighboring records in terms of exp_month_id, exp_year. As the first step, I want to define for every couple (exp_month_id, exp_year) the valid couple (next_month_id, next_year). The main problem here, that if the current exp_month_id is the last in the year, then next_year = exp_year + 1 and next_month_id should be the first one in the year.
I have written the following query to do the job:
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id
FROM futures
ORDER BY exp_month_id
)
SELECT DISTINCT f.commodity_id,
f.exp_month_id,
f.exp_year,
(
WITH [temp] AS (
SELECT exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id
)
SELECT exp_month_id
FROM [temp]
WHERE exp_month_id > f.exp_month_id
UNION ALL
SELECT exp_month_id
FROM [temp]
LIMIT 1
)
AS next_month_id,
(
SELECT CASE WHEN EXISTS (
SELECT commodity_id,
exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id AND
exp_month_id > f.exp_month_id
LIMIT 1
)
THEN f.exp_year ELSE f.exp_year + 1 END
)
AS next_year
FROM futures AS f
This query serves as a base for a dynamic table (view) which is subsequently used for calculating the gradient. However, the execution of this query takes more than one second and thus the whole process takes minutes. I wonder if you could help me optimizing the query.
Note: The following requires Sqlite 3.25 or newer for window function support:
Lack of sample data (Preferably as a CREATE TABLE and INSERT statements for easy importing) and expected results makes it hard to test, but if your end goal is computing the difference in prices between expiration dates (Making your question a bit of an XY problem, maybe something like:
SELECT date, commodity_id, price, exp_year, exp_month_id
, price - lag(price, 1) OVER (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id) AS "change from last price"
FROM futures;
Thanks to the hint of #Shawn to use window functions I could rewrite the query in much shorter form:
CREATE VIEW "futures_nextmonths_win" AS
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id,
exp_year
FROM futures)
SELECT commodity_id,
exp_month_id,
exp_year,
lead(exp_month_id) OVER w AS next_month_id,
lead(exp_year) OVER w AS next_year
FROM trading_months
WINDOW w AS (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id);
which is also slightly faster then the original one.