Performing a prefix computation using SQL without defined procedures - sql

I have a table with a column of integers - I need a way to generate the "prefix" of this column in another table.
For e.g.
I have 1, 0, 0, 0, 1, 0, 1, 0, 0 as the input
I need 1, 1, 1, 1, 2, 2, 3, 3, 3 as the output
This needs to be done in SQLite's SQL dialect , no user defined functions or stored procedures are possible.

try something like this:
select value,
(select sum(t2.value) from table t2 where t2.id <= t1.id ) as accumulated
from table t1
from: SQLite: accumulator (sum) column in a SELECT statement

So to insert from input table to output table you need following query:
INSERT INTO output
SELECT id,
(SELECT sum(i1.value) FROM input AS i1 WHERE i1.rowid <= i2.rowid) as VALUE
FROM input AS i2

Related

BigQuery ML Standard Scaler "failed to calculate mean"

Trying to build a logistic regression using BigQuery ML, I get the following error:
Failed to calculate mean since the entries in corresponding column 'x' are all NULLs.
Here's a reproducible query - make sure to change your dataset name:
CREATE MODEL `samples.TEST_MODELS_001`
TRANSFORM (
flag,
split_col,
ML.standard_scaler(SAFE_CAST(x as FLOAT64)) OVER() as x
)
OPTIONS
( MODEL_TYPE='LOGISTIC_REG',
AUTO_CLASS_WEIGHTS=TRUE,
INPUT_LABEL_COLS=['flag'],
EARLY_STOP=true,
DATA_SPLIT_METHOD='CUSTOM',
DATA_SPLIT_COL='split_col',
L2_REG = 0.3) AS
SELECT
*
,train_test_split = 0 as split_col
FROM (
select
0 as train_test_split, 1 as flag, "" as x
union all
select 0, 0, "0"
union all
select 0, 1, "1"
union all
select 1, 1, ""
union all
select 1, 0, ""
union all
select 1, 1, "1"
)
The problem seems to be related to scaling because if I use ML.MIN_MAX_SCALER instead of ML.STANDARD_SCALER it works as expected. Not sure why this is happening as clearly not all values of x are NULLs inside the train-test split groups.
I'm wondering if this actually a bug or if I'm doing something wrong here.
If you use the ML.STANDARD_SCALER function outside the TRANSFORM, it correctly returns the result. According to the documentation on this function:
When this is used in a TRANSFORM clause, the STDDEV and MEAN calculated to standardize the expression are automatically used in prediction.
Which means, that it had to calculate a MEAN and STDDEV to get the result in the first place, so it seems it should work.
I reported it as a BigQuery issue here. I suggest to subscribe to the issue tracker in order to receive notifications whenever there's an update from the BigQuery team.
Update
This was answered in the issue tracker.
The ML.STANDARD_SCALER function is applied over the training data, after the split. This means that the correct SQL that applies is as follows:
-- training data: 1, null, null
select ml.standard_scaler(x) over() from (select 1 as x)
union all select null as x
union all select null as x
-- Result:
-- null
-- null
-- null
That's the reason why the message mentioned the null columns. You can further see that this is the case by adding one record to "0" value to x column for the training data.
CREATE MODEL `samples.TEST_MODELS_001`
TRANSFORM (
flag,
split_col,
ML.standard_scaler(SAFE_CAST(x as FLOAT64)) OVER() as x
)
OPTIONS
( MODEL_TYPE='LOGISTIC_REG',
AUTO_CLASS_WEIGHTS=TRUE,
INPUT_LABEL_COLS=['flag'],
EARLY_STOP=true,
DATA_SPLIT_METHOD='CUSTOM',
DATA_SPLIT_COL='split_col',
L2_REG = 0.3) AS
SELECT
*
,train_test_split = 0 as split_col
FROM (
select
0 as train_test_split, 1 as flag, "" as x
union all
select 0, 0, "0"
union all
select 0, 1, "1"
union all
select 1, 1, ""
union all
select 1, 0, ""
union all
select 1, 1, "1"
union all
select 1, 1, "0"
)

SQL Statement Insert Into

I'm getting the following no matter what I do any help would be awesome.
Msg 116, Level 16, State 1, Line 15
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
Msg 109, Level 15, State 1, Line 1
There are more columns in the INSERT statement than values specified in the VALUES clause. The number of values in the VALUES clause must match the number of columns specified in the INSERT statement.
My query
[tableA].[PROJECTID],
[tableA].[STUDYID],
[tableA].[SUBJNO],
[tableA].[CASENUMBER],
[tableA].[CASESTATUS],
[tableA].[MODIFIEDBY]
)VALUES((
SELECT b.PROJECTID,
((SELECT TOP 1 a.STUDYID FROM [PRODVIEW] a WHERE a.DYNAME = b.DYNAME and
a.ProjID = b.PROJID)) as STUDYID,
b.SUBJNO,
(b.SUBJNO + '_' + b.SEQUENCE) as CaseNumber,
'READY' as CASESTATUS,
b.UPLOADEDBY
FROM [dbo].[TableB] b WHERE VIEWED = 0
AND b.UPLOADEDDATE >= DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0)))
If you want to use a SELECT as the source of the data for an INSERT, then don't use VALUES, which is for inserting literal data:
INSERT INTO yourTable ([PROJECTID], [STUDYID], [SUBJNO], [CASENUMBER], [CASESTATUS],
[MODIFIEDBY])
SELECT
b.PROJECTID,
(SELECT TOP 1 a.STUDYID FROM [PRODVIEW] a
WHERE a.DYNAME = b.DYNAME and a.ProjID = b.PROJID),
b.SUBJNO,
(b.SUBJNO + '_' + b.SEQUENCE),
'READY',
b.UPLOADEDBY
FROM [dbo].[TableB] b
WHERE
VIEWED = 0 AND
b.UPLOADEDDATE >= DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0);
There is probably a way to write your query without using a correlated subquery in the select clause, e.g. via a join. Actually, your subquery with TOP makes no sense because there is no ORDER BY clause.
Also note that you don't need to use aliases in the SELECT statement. In fact, they will just be ignored, since the INSERT determines the target columns.

combine multiple select queries into one to avoid multiple pass over a huge table

A very simplified setup of the problem at hand.
Table A has columns rz_id and sHashA. Table A is very big.
Table B has columns scode and sHashB. There can be many sHashB values
corresponding to a particular scode value. Table B is relatively much
smaller than table A.
For each of the scode value (about 200 of them) I have to execute a query like the following (scode is 500 in this case).
select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 500);
For each of the scode value I write a query like the above so that I end up with 200 queries like so
select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 500);
select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 501);
select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 502);
.
.
.
select count(distinct rz_id) from A where substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where scode = 700);
The problem is that this ends up going over the big table 200 times
which is time consuming. I want to be able to achieve this with a
single pass (single query).
I thought of make a table with as many rows as table A and as many
additional columns as table B via a query like
select /*+ streamtable(a) */ a.*, if(substr(sHashA, 1, 5) in (select
substr(sHashB, 1, 5) from B where scode = 500, 1, 0) as scode_500,
if(substr(sHashA, 1, 5) in (select substr(sHashB, 1, 5) from B where
scode = 501, 1, 0) as scode_501, ... if(substr(sHashA, 1, 5) in
(select substr(sHashB, 1, 5) from B where scode = 700, 1, 0) as
scode_700 from A a;
This would output a 0 or 1 in each of the 200 columns corresponding to scode per row of table A. Later I could sum up the columns to get a count. Since I am also interested in estimating the overlap of counts between any two scodes I thought of the above table.
But I get parse error and I suspect queries are not allowed inside of
IF statements.
The question finally is this then: how do i reduce all those queries in to a single query so that I end up going through rows of huge table only once? Please also suggest alternate ways of handling this count keeping in mind that I am also intersted in the overlap.
What about something like this;
select count(distinct A.rz_id), B.scode
from A,B
where substr(A.sHashA, 1, 5) = substr(B.sHashB, 1,5)
and B.scode in (500,501,...)
group by B.scode
Single pass gets all data

Conditional select statement

Consider the following table (snapshot):
I would like to write a query to select rows from the table for which
At least 4 out of 7 column values (VAL, EQ, EFF, ..., SY) are not NULL..
Any idea how to do that?
Nothing fancy here, just count the number of non-null per row:
SELECT *
FROM Table1
WHERE
IIF(VAL IS NULL, 0, 1) +
IIF(EQ IS NULL, 0, 1) +
IIF(EFF IS NULL, 0, 1) +
IIF(SIZE IS NULL, 0, 1) +
IIF(FSCR IS NULL, 0, 1) +
IIF(MSCR IS NULL, 0, 1) +
IIF(SY IS NULL, 0, 1) >= 4
Just noticed you tagged sql-server-2005. IIF is sql server 2012, but you can substitue CASE WHEN VAL IS NULL THEN 1 ELSE 0 END.
How about this? Turning your columns into "rows" and use SQL to count not nulls:
select *
from Table1 as t
where
(
select count(*) from (values
(t.VAL), (t.EQ), (t.EFF), (t.SIZE), (t.FSCR), (t.MSCR), (t.SY)
) as a(val) where a.val is not null
) >= 4
I like this solution because it's splits data from data processing - after you get this derived "table with values", you can do anithing to it, and it's easy to change logic in the future. You can sum, count, do any aggregates you want. If it was something like case when t.VAL then ... end + ..., than you have to change logic many times.
For example, suppose you want to sum all not null elements greater than 2. In this solution you just changing count to sum, add where clause and you done. If it was iif(Val is null, 0, 1) +, first you have to think what should be done to this and then change every item to, for example, case when Val > 2 then Val else 0 end.
sql fiddle demo
Since the values are either numeric or NULL you can use ISNUMERIC() for this:
SELECT *
FROM YourTable
WHERE ISNUMERIC(VAL)+ISNUMERIC(EQ)+ISNUMERIC(EFF)+ISNUMERIC(SIZE)
+ISNUMERIC(FSCR)+ISNUMERIC(MSCR)+ISNUMERIC(SY) >= 4

How to do equivalent of "limit distinct"?

How can I limit a result set to n distinct values of a given column(s), where the actual number of rows may be higher?
Input table:
client_id, employer_id, other_value
1, 2, abc
1, 3, defg
2, 3, dkfjh
3, 1, ldkfjkj
4, 4, dlkfjk
4, 5, 342
4, 6, dkj
5, 1, dlkfj
6, 1, 34kjf
7, 7, 34kjf
8, 6, lkjkj
8, 7, 23kj
desired output, where limit distinct=5 distinct values of client_id:
1, 2, abc
1, 3, defg
2, 3, dkfjh
3, 1, ldkfjkj
4, 4, dlkfjk
4, 5, 342
4, 6, dkj
5, 1, dlkfj
Platform this is intended for is MySQL.
You can use a subselect
select * from table where client_id in
(select distinct client_id from table order by client_id limit 5)
This is for SQL Server. I can't remember, MySQL may use a LIMIT keyword instead of TOP. That may make the query more efficient if you can get rid of the inner most subquery by using the LIMIT and DISTINCT in the same subquery. (It looks like Vinko used this method and that LIMIT is correct. I'll leave this here for the second possible answer though.)
SELECT
client_id,
employer_id,
other_value
FROM
MyTable
WHERE
client_id IN
(
SELECT TOP 5
client_id
FROM
(
SELECT DISTINCT
client_id
FROM
MyTable
) SQ
ORDER BY
client_id
)
Of course, add in your own WHERE clause and ORDER BY clause in the subquery.
Another possibility (compare performance and see which works out better) is:
SELECT
client_id,
employer_id,
other_value
FROM
MyTable T1
WHERE
T1.code IN
(
SELECT
T2.code
FROM
MyTable T2
WHERE
(SELECT COUNT(*) FROM MyTable T3 WHERE T3,code < T2.code) < 5
)
-- Using Common Table Expression in Microsoft SQL Server.
-- LIMIT function does not exist in MS SQL.
WITH CTE
AS
(SELECT DISTINCT([COLUMN_NAME])
FROM [TABLE_NAME])
SELECT TOP (5) [[COLUMN_NAME]]
FROM CTE;
This works for ‍‍MS SQL if anyone is on that platform:
SET ROWCOUNT 10;
SELECT DISTINCT
column1, column2, column3,...
FROM
Table1
WHERE ...