Get standard errors in BigQuery ML linear regression - sql

I am trying to get the standard errors of the betas in a linear regression in bigquery ML, sorry if I have missed something basic, but I cannot find the answer to this question
#standard sql
CREATE OR REPLACE MODEL `DATASET.test_lm`
OPTIONS(model_type='LINEAR_REG', input_label_cols= ["y"]) AS
select * from unnest(ARRAY<STRUCT<y INT64, x float64>> [(1,2.028373),
(2,2.347660),(3,3.429958),(4,5.250539),(5,5.976455)])
you can get weights without variance with
select * from ml.weights(model `DATASET.test_ml`)
Also, you can calculate the standard errors directly like this
with dat as (
select * from unnest(ARRAY<STRUCT<y INT64, x float64>> [(1,2.028373), (2,2.347660),(3,3.429958),(4,5.250539),(5,5.976455)])),
#get the residual standard error, using simple df-2
rse_dat as (
select sqrt(sum(e2)/((select count(1) from dat)-2)) as rse from (
select pow(y - predicted_y, 2) as e2 from ml.predict(model `DATASET.test_lm`,
(select * from dat)))),
#get the variance of x
xvar_dat as (
select sum(pow(x - (select avg(x) as xbar from dat),2)) as xvar from dat)
#calulate standard error
select sqrt((select pow(rse,2) from rse_dat)/(select xvar from xvar_dat) as beta_x_se )
But this gets to be heavy lift for many covariates. Is there a direct way to get this get this pretty basic statistic for confidence intervals?

You could use ML.ADVANCED_WEIGHTS now, which gives standard errors.
https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-advanced-weights

Related

SQL ORDER BY before TABLESAMPLE BERNOULLI, or in Python

I am performing a query on PostrgreSQL using Python (pyscopg2)
The data is point geometries, stored in patched of 600 points per patch.
I am trying to streamline and speed up the process, previously I would do the following:
explode the geometry
order by x, y, z
save the result to a new table
Use TABLESAMPLE BERNOULLI(1) to sample the data to 1%
save back to the database
To speed things up I'm trying to reduce the amount of writing to the database and keep the data in python as much as possible.
The old code:
Exploding the patches
query = sql.SQL("""INSERT INTO {}.{} (x,y,z)
SELECT
st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
from "public".{} order by x,y,z;""").format(
*map(sql.Identifier, (schema_name, table_name2, table_name1)))
sampling the data:
query2 = ("CREATE TABLE {}.{} AS (SELECT * FROM {}.{} TABLESAMPLE BERNOULLI ({}))".format(
schema, table_name_base, schema, imported_table_name_base, sample_base))
This works, but I would like to either:
A) Perform this as a single query, so explode --> order by --> sample.
B) Perform the explode in SQL, then sample in python.
For A) I have attempted to nest/subquery but PostgreSQL will not allow TABLESAMPLE to work on anything that isn't a table or a view.
For B) I use data = gpd.read_postgis(query, con=conn) to get the data directly into a geopandas dataframe, so sorting is then easy, but how do I perform the equivalent of TABLESAMPLE BERNOULLI to a geopandas dataframe?
Option A is my preferred option, but it might be useful to test option B incase I end up allowing different sampling methods.
Edit:
This is the visual result of:
query = """
SELECT
PC_EXPLODE(pa)::geometry as geom,
st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
FROM {}.{}
TABLESAMPLE BERNOULLI ({})
ORDER BY x,y,z, geom
;
""".format(schema, pointcloud, sample)
I am a little lost. A random sample is a random sample and doesn't depend on the ordering. If you want a sample that depends on the ordering, then use an nth sample. That would be:
select t.*
from (select t.*,
row_number() over (order by x, y, z) as seqnum
from (select st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
from "public".{}
) t
) t
where seqnum % 100 = 1;
Or perhaps you just want to take the sample and then order afterwards, which you can also do with a subquery.

BigQuery to create features, is it suitable or will it run out of Memory?

Let's imagine this scenario. I have a BitCoin transaction dataset ~ 1TB is size.
I would like to create features to train a machine learning application. For example, a simple feature can be:
WITH btc AS (SELECT * FROM bitcoin.transactions),
price_feature AS (SELECT datetime, AVG(price) from btc GROUP BY 1)
SELECT * FROM price_features
However, what if I have 100000s of such features? I know that for each new feature, I can run:
WITH btc AS (SELECT * FROM bitcoin.transactions),
some_feature AS ()
SELECT * FROM some_feature
But this will cost me 1TB=5$ * 100000, if I was to create 1 query per feature.
Is it possible to use BQ for something like:
WITH btc AS (SELECT * FROM bitcoin.transactions),
feature_1 AS (),
feature_2 AS (),
feature_... AS (),
feature_n AS ()
SELECT * FROM feature_1
UNION ALL
SELECT * FROM feature_2
...
The problem is that even independent features sometimes take ~1-2 mins of BQ running time. Will I be able to put them all into 1 query? I feel like I have all sorts of memory problems etc. to face.

Finding standard deviation using basic math functions

I am trying to get the standard deviation from a table containing income values, using the basic math functions below in postgresql.
This is what I tried:
SELECT sqrt(sum(power(income - (sum(income) / count(income)), 2)) / (count(*) - 1)) FROM income_data
however, I keep getting the following error:
ERROR: aggregate function calls cannot be nested
Has anyone run into this issue? I feel like the logic for obtaining the standard deviation should work, although haven't had any luck thus far, I appreciate any suggestions on how to resolve.
You should calculate a mean in a separate query, e.g. in a with statement:
with mean as (
select sum(income) / count(income) as mean
from income_data
)
select sqrt(sum(power(income - mean, 2)) / (count(*) - 1))
from income_data
cross join mean;
or in a derived table:
select sqrt(sum(power(income - mean, 2)) / (count(*) - 1))
from income_data
cross join (
select sum(income) / count(income) as mean
from income_data
) s;

How to overcome the limitation of `PERCENTILE_CONT` that the argument should be constant?

I want to find POINTCOUNT values that cut the input set ADS.PREDICTOR into equally large groups. The parameter POINTCOUNT can have different value for different predictors, so I don't want to hard-code it in the code.
Unfortunately the code below fails with ORA-30496: Argument should be a constant... How can I overcome this (except for 300 lines of code with hard-coded threshold quantiles, of course)?
define POINTCOUNT=300;
select
*
from (
select
percentile_disc(MYQUNTILE)
within group (
order by PREDICTOR ) as THRESHOLD
from ADS
inner join (
select (LEVEL - 1)/(&POINTCOUNT-1) as MYQUANTILE
from dual
connect by LEVEL <= &POINTCOUNT
)
on 1=1
)
group by THRESHOLD
I want to draw a ROC curve. The curve will be plotted in Excel as a linear interpolation between pairs of points (X, Y) calculated in Oracle.
Each point (X, Y) is calculated using a threshold value.
I will get the best approximation of the ROC curve for a give number of the pairs of points if the distance between each adjacent pair of (X, Y) is uniform.
if I cut the domain of the predicted values into N values that separate 1/Nth quantiles, I should get a fairly good set of the threshold values.
PERCENTILE_CONT() only requires that the percentile value be constant within each group. You do not have a group by in your subquery, so I think this might fix your problem:
select MYQUANTILE,
percentile_disc(MYQUANTILE) within group (order by PREDICTOR
) as THRESHOLD
from ADS cross join
(select (LEVEL - 1)/(&POINTCOUNT-1) as MYQUANTILE
from dual
connect by LEVEL <= &POINTCOUNT
)
GROUP BY MYQUANTILE;
Also, note that CROSS JOIN is the same as INNER JOIN . . . ON 1=1.

Thin out many ST_Points

I have many (1.000.000) ST_Points in a postgres-db with postgis
extension. When i show them on a map, browsers are getting very busy.
For that I would like to write an sql-statement which filters the high density to only one point.
When a User zoom out of 100 ST_Points, postgres should give back only one.
But only if these Points are close together.
I tried it with this statement:
select a.id, count(*)
from points as a, points as b
where st_dwithin(a.location, b.location, 0.001)
and a.id != b.id
group by a.id
I would call it thin-out but didnot find anything - maybe because I'm
not a native english speaker.
Does anybody have some suggestions?
I agree with tcarobruce that clustering is the term you are looking for. But it is available in postgis.
Basically clustering can be achieved by reducing the number of decimals in the X and Y and grouping upon them;
select
count(*),
round(cast (ST_X(geom) as numeric),3)
round(cast (ST_Y(geom) as numeric),3)
from mytable
group by
round(cast (ST_X(geom) as numeric),3),
round(cast (ST_Y(geom) as numeric),3)
Which will result in a table with coordinates and the number of real points at that coordinate. In this particular sample, it leaves you with rounding on 3 decimals, 0.001 like in your initial statement.
You can cluster nearby Points together using ST_ClusterDBSCAN
Then keep all single points and for example:
Select one random Point per cluster
or
Select the centroid of each Point cluster.
I use eps 300 to cluster points together that are within 300 meters.
create table buildings_grouped as
SELECT geom, ST_ClusterDBSCAN(geom, eps := 300, minpoints := 2) over () AS cid
FROM buildings
1:
create table buildings_grouped_keep_random as
select geom, cid from buildings_grouped
where cid is null
union
select * from
(SELECT DISTINCT ON (cid) *
FROM buildings_grouped
ORDER BY cid, random()) sub
2:
create table buildings_grouped_keep_centroid as
select geom, cid from buildings_grouped
where cid is null
union
select st_centroid(st_union(geom)) geom, cid
from buildings_grouped
where cid is not null
group by cid
The term you are looking for is "clustering".
There are client-side libraries that do this, as well as commercial services that do it server-side.
But it's not something PostGIS does natively. (There's a ticket for it.)
You'll probably have to write your own solution, and precompute your clusters ahead of time.
ST_ClusterDBSCAN- and KMeans- based clustering works but it is VERY SLOW! for big data sets. So it is practically unusable. PostGIS functions like ST_SnapToGrid and ST_RemoveRepeatedPoints is faster and can help in some cases. But the best approach, I think, is using PDAL thinning filters like sample filter. I recommend using it with PG Point Cloud.
Edit:
ST_SnapToGrid is pretty fast and useful. Here is the example query for triangulation with optimizations:
WITH step1 AS
(
SELECT geometry, ST_DIMENSION(geometry) AS dim FROM table
)
, step2 AS
(
SELECT ST_SIMPLIFYVW(geometry, :tolerance) AS geometry FROM step1 WHERE dim > 0
UNION ALL
(WITH q1 AS
(
SELECT (ST_DUMP(geometry)).geom AS geometry FROM step1 WHERE dim = 0
)
SELECT ST_COLLECT(DISTINCT(ST_SNAPTOGRID(geometry, :tolerance))) FROM q1)
)
, step3 AS
(
SELECT ST_COLLECT(geometry) AS geometry FROM step2
)
SELECT ST_DELAUNAYTRIANGLES(geometry, :tolerance, 0)::BYTEA AS geometry
FROM step3
OFFSET :offset LIMIT :limit;