BigQuery ML Standard Scaler "failed to calculate mean" - sql

Trying to build a logistic regression using BigQuery ML, I get the following error:
Failed to calculate mean since the entries in corresponding column 'x' are all NULLs.
Here's a reproducible query - make sure to change your dataset name:
CREATE MODEL `samples.TEST_MODELS_001`
TRANSFORM (
flag,
split_col,
ML.standard_scaler(SAFE_CAST(x as FLOAT64)) OVER() as x
)
OPTIONS
( MODEL_TYPE='LOGISTIC_REG',
AUTO_CLASS_WEIGHTS=TRUE,
INPUT_LABEL_COLS=['flag'],
EARLY_STOP=true,
DATA_SPLIT_METHOD='CUSTOM',
DATA_SPLIT_COL='split_col',
L2_REG = 0.3) AS
SELECT
*
,train_test_split = 0 as split_col
FROM (
select
0 as train_test_split, 1 as flag, "" as x
union all
select 0, 0, "0"
union all
select 0, 1, "1"
union all
select 1, 1, ""
union all
select 1, 0, ""
union all
select 1, 1, "1"
)
The problem seems to be related to scaling because if I use ML.MIN_MAX_SCALER instead of ML.STANDARD_SCALER it works as expected. Not sure why this is happening as clearly not all values of x are NULLs inside the train-test split groups.
I'm wondering if this actually a bug or if I'm doing something wrong here.

If you use the ML.STANDARD_SCALER function outside the TRANSFORM, it correctly returns the result. According to the documentation on this function:
When this is used in a TRANSFORM clause, the STDDEV and MEAN calculated to standardize the expression are automatically used in prediction.
Which means, that it had to calculate a MEAN and STDDEV to get the result in the first place, so it seems it should work.
I reported it as a BigQuery issue here. I suggest to subscribe to the issue tracker in order to receive notifications whenever there's an update from the BigQuery team.
Update
This was answered in the issue tracker.
The ML.STANDARD_SCALER function is applied over the training data, after the split. This means that the correct SQL that applies is as follows:
-- training data: 1, null, null
select ml.standard_scaler(x) over() from (select 1 as x)
union all select null as x
union all select null as x
-- Result:
-- null
-- null
-- null
That's the reason why the message mentioned the null columns. You can further see that this is the case by adding one record to "0" value to x column for the training data.
CREATE MODEL `samples.TEST_MODELS_001`
TRANSFORM (
flag,
split_col,
ML.standard_scaler(SAFE_CAST(x as FLOAT64)) OVER() as x
)
OPTIONS
( MODEL_TYPE='LOGISTIC_REG',
AUTO_CLASS_WEIGHTS=TRUE,
INPUT_LABEL_COLS=['flag'],
EARLY_STOP=true,
DATA_SPLIT_METHOD='CUSTOM',
DATA_SPLIT_COL='split_col',
L2_REG = 0.3) AS
SELECT
*
,train_test_split = 0 as split_col
FROM (
select
0 as train_test_split, 1 as flag, "" as x
union all
select 0, 0, "0"
union all
select 0, 1, "1"
union all
select 1, 1, ""
union all
select 1, 0, ""
union all
select 1, 1, "1"
union all
select 1, 1, "0"
)

Related

How to create fibonacci with recursive query effectively

you can test my code here :
https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=61a67764f626bfadb1e9594e1ae08229
code to print the first n terms of the fibonacci sequel:
with fibo(s2,s1,n) as(
select 1 ,1 ,1 from dual
union all
select s1+s2,s2,n+1 from fibo where n<12
)
select s2 from fibo;
It works but it use probably twice as much memory as needed. Each line contains the nth and n-1th terms of the sequel before the selection
Therefore i tried with Lag function
with fibo(s,n) as(
select 1,1 from dual
union all
select LAG(s, 1, 0) OVER ( ORDER BY s) +LAG(s, 2, 0) OVER ( ORDER BY s),n+1 from fibo where n<12
)
Select * from fibo
But i get only a sequel of 1. (same things with lead function)
I have tried to understand what happens with this :
with test(s,d1,d2,n) as(
select 1,0,0,1 from dual
union all
select
2*s,LAG(s, 1, 0) OVER (a ORDER BY s) ,
LAG(s, 2, 0) OVER ( ORDER BY s),n+1
from test where n<12
)
select * from test
It seems that lag returns always 0. Is it impossible to use lag and lead in recursive query? Or do i do something false?
"Recursive" queries are not recursive, they are iterative.
The iteration starts with the anchor (the part(s) of the UNION ALL that doesn't refer to the CTE).
For each following iteration, the result set of the previous iteration (aliased by the CTE) is used as the input.
The iteration stops when the result set is empty.
In your specific attempt the anchor returns a single record and so is every following query.
Obviously LAG(s, 2, 0) will always return 0

Random data sampling with oracle sql, data generation

i need to generate some sample data from a population. I want to do this with an SQL query on an Oracle 11g database.
Here is a simple working example with population size 4 and sample size 2:
with population as (
select 1 as val from dual union all
select 2 from dual union all
select 3 from dual union all
select 4 from dual)
select val from (
select val, dbms_random.value(0,10) AS RANDORDER
from population
order by randorder)
where rownum <= 2
(the oracle sample() funtion didn't work in connection with the WITH-clause for me)
But now I, I want to "upscale" or multiply my sample data. So that I can get something like 150 % sample data of the population data (population size 4 and sample size 6, e.g.)
Is there a good way to achieve this with an SQL query?
You could use CONNECT BY:
with population(val, RANDOMORDER) as (
select level, dbms_random.value(0,10) AS RANDORDER
from dual
connect by level <= 6
ORDER BY RANDORDER
)
select val
FROM population
WHERE rownum <= 4;
db<>fiddle demo
The solution depends, if you want all rows from first initial set(s) and random additional rows from last one then use:
with params(size_, sample_) as (select 4, 6 from dual)
select val
from (
select mod(level - 1, size_) + 1 val, sample_,
case when level <= size_ * floor(sample_ / size_) then 0
else dbms_random.value()
end rand
from params
connect by level <= size_ * ceil(sample_ / size_)
order by rand)
where rownum <= sample_
But if you allow possibility of result like (1, 1, 2, 2, 3, 3), where some values may not appear at all in output (here 4) then use this:
with params(size_, sample_) as (select 4, 6 from dual)
select val
from (
select mod(level - 1, size_) + 1 val, sample_, dbms_random.value() rand
from params
connect by level <= size_ * ceil(sample_ / size_)
order by rand)
where rownum <= sample_
How it works? We build set of (1, 2, 3, 4) as many times as it results from division sample / size. Then we assign random values. In first case I assign 0 to first set(s), so they will be in output for sure, and random values to last set. In second case randoms are assigned to all rows.

Conditional select statement

Consider the following table (snapshot):
I would like to write a query to select rows from the table for which
At least 4 out of 7 column values (VAL, EQ, EFF, ..., SY) are not NULL..
Any idea how to do that?
Nothing fancy here, just count the number of non-null per row:
SELECT *
FROM Table1
WHERE
IIF(VAL IS NULL, 0, 1) +
IIF(EQ IS NULL, 0, 1) +
IIF(EFF IS NULL, 0, 1) +
IIF(SIZE IS NULL, 0, 1) +
IIF(FSCR IS NULL, 0, 1) +
IIF(MSCR IS NULL, 0, 1) +
IIF(SY IS NULL, 0, 1) >= 4
Just noticed you tagged sql-server-2005. IIF is sql server 2012, but you can substitue CASE WHEN VAL IS NULL THEN 1 ELSE 0 END.
How about this? Turning your columns into "rows" and use SQL to count not nulls:
select *
from Table1 as t
where
(
select count(*) from (values
(t.VAL), (t.EQ), (t.EFF), (t.SIZE), (t.FSCR), (t.MSCR), (t.SY)
) as a(val) where a.val is not null
) >= 4
I like this solution because it's splits data from data processing - after you get this derived "table with values", you can do anithing to it, and it's easy to change logic in the future. You can sum, count, do any aggregates you want. If it was something like case when t.VAL then ... end + ..., than you have to change logic many times.
For example, suppose you want to sum all not null elements greater than 2. In this solution you just changing count to sum, add where clause and you done. If it was iif(Val is null, 0, 1) +, first you have to think what should be done to this and then change every item to, for example, case when Val > 2 then Val else 0 end.
sql fiddle demo
Since the values are either numeric or NULL you can use ISNUMERIC() for this:
SELECT *
FROM YourTable
WHERE ISNUMERIC(VAL)+ISNUMERIC(EQ)+ISNUMERIC(EFF)+ISNUMERIC(SIZE)
+ISNUMERIC(FSCR)+ISNUMERIC(MSCR)+ISNUMERIC(SY) >= 4

Oracle SQL: sum( case when then quantity else 0 END) OVER (partition by...) = can't get right GROUP BY statement

I'm trying to select several different sums, one of them being OVER (Partition by column_also_in_select_plan).
However I cannot seem to ever be able to get the GROUP BY statement right.
Example:
Select 1, 2, 3, sum(4) over (partition by 3), sum(case when 6 = etc...)
FROM table
Where filters
GROUP BY ?
Thanks for any tips :)
It doesn't really make much sense to be doing aggregation and using window functions at the same time, so I'm not surprised you're confused. In the above example, you probably want to move the windowing to an outer query, that is:
select 1, 2, 3, sum(sum4) over(partition by 3), ...
from (
select 1, 2, 3, sum(4) as sum4
from table
where filters
group by 1, 2, 3
) x

Performing a prefix computation using SQL without defined procedures

I have a table with a column of integers - I need a way to generate the "prefix" of this column in another table.
For e.g.
I have 1, 0, 0, 0, 1, 0, 1, 0, 0 as the input
I need 1, 1, 1, 1, 2, 2, 3, 3, 3 as the output
This needs to be done in SQLite's SQL dialect , no user defined functions or stored procedures are possible.
try something like this:
select value,
(select sum(t2.value) from table t2 where t2.id <= t1.id ) as accumulated
from table t1
from: SQLite: accumulator (sum) column in a SELECT statement
So to insert from input table to output table you need following query:
INSERT INTO output
SELECT id,
(SELECT sum(i1.value) FROM input AS i1 WHERE i1.rowid <= i2.rowid) as VALUE
FROM input AS i2