bigquery ML: Running a regression per group and all combined - google-bigquery

I have a BQ table that includes multiple groups. I'd like to run a regression per group and all combined.
Regression looks like:
for group id = 1 -> predicted_metric ~ metric 1 + metric 2
for group id = 2 -> predicted_metric ~ metric 1 + metric 2
for group id = 3 -> predicted_metric ~ metric 1 + metric 2
...
for group id = 40 -> predicted_metric ~ metric 1 + metric 2
Is it possible to run this regressions and getting the coefficient estimates in a table?

BQML currently doesn't support specifying different group ids in a single CREATE MODEL statement for regression.
As an alternative, you could try using BQ procedural language to run multiple(40 in your case) CREATE MODEL statements with WHERE clause(WHERE group id = 1,2,3,...,40) in one query. Similarly, you could run ML.WEIGHTS on the 40 trained models to get the coefficient.

Related

Create SQL query with dynamic WHERE statement

I'm using Postgresql as my database, in case that's helpful, although I'd like to find a pure SQL approach instead of a Postgresql specific implementation.
I have a large set of test data obtained from manufacturing a piece of electronics and I'd like to take that set of data and extract from it which units met certain criteria during test, ideally using a separate table that contains the test criteria from each step of manufacturing.
As a simple example, let's say I check the temperature readback from the unit in two different steps of the test. In step 1, the temperature should be in the range of 20C-30C while step 2 should be in the range of 50C-60C.
Let's assume the following table structure with a set of example data (table name 'test_data'):
temperature step serial_number
25 1 1
55 2 1
19 1 2
20 2 2
and let's assume the following table that contains the above mentioned pass criteria (table name 'criteria'):
temperature_upper temperature_lower step
20 30 1
50 60 2
At the moment, using a static approach, I can just use the following query:
SELECT * FROM test_data WHERE
( test_data.step = 1 AND test_data.temperature > 20 AND test_data.temperature < 30 ) OR
( test_data.step = 2 AND test_data.temperature > 50 AND test_data.temperature < 60 );
which would effectively yield the following table:
temperature step serial_number
25 1 1
55 2 1
I'd like to make my select query more dynamic and instead of begin statically defined, make it construct itself off of a list of results from the test_criteria table. The hope is to grow this into a complex query where temperature, voltage and current might be checked in step 1 but only current in step 2, for example.
Thanks for any insight!
You can solve using a join between the tables
SELECT t.*
FROM test_data t
INNER JOIN criteria c ON t.step = c.step
AND t.temperature > c.temperature_upper
AND t.temperature < c.temperature_lower
OR if you want >= and <=
SELECT t.*
FROM test_data t
INNER JOIN criteria c ON t.step = c.step
AND t.temperature netween c.temperature_upper AND c.temperature_lower

BigQuery full table to partition

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.
That means I will have 270 partitions. What is the best way to move this data to partition table.
I dont want to run 270 queries which is very costly operation. So looking for optimized solution.
I have multiple tables like this. I need to migrate all these tables to partition tables.
Thanks,
I see three options
Direct Extraction out of original table:
Actions (how many queries to run) = Days [to extract] = 270
Full Scans (how much data scanned measured in full scans of original table) = Days = 270
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00
Hierarchical(recursive) Extraction: (described in Mosha’s answer)
Actions = 2^log2(Days) – 2 = 510
Full Scans = 2*log2(Days) = 18
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60
Clustered Extraction: (I will describe it in a sec)
Actions = Days + 1 = 271
Full Scans = [always]2 = 2
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40
Summary
Method Actions Total Full Scans Total Cost
Direct Extraction 270 270 $459.00
Hierarchical(recursive) Extraction 510 18 $30.60
Clustered Extraction 271 2 $3.40
Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward
Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible!
From $460 down to $31!
Almost 15 times down!
In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges
- take a look at third option
“Clustered Extraction” Explanation
Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan
Step 2
After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created)
This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table
While querying column – we will need to de-serialize row’s value and parse it back to original scheme
Very simplified example: (using BigQuery Standard SQL here)
The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)
So, assume original table (theTable) looks somehow like below
SELECT 1 AS id, "101" AS x, 1 AS ts UNION ALL
SELECT 2 AS id, "102" AS x, 1 AS ts UNION ALL
SELECT 3 AS id, "103" AS x, 1 AS ts UNION ALL
SELECT 4 AS id, "104" AS x, 1 AS ts UNION ALL
SELECT 5 AS id, "105" AS x, 1 AS ts UNION ALL
SELECT 6 AS id, "106" AS x, 2 AS ts UNION ALL
SELECT 7 AS id, "107" AS x, 2 AS ts UNION ALL
SELECT 8 AS id, "108" AS x, 2 AS ts UNION ALL
SELECT 9 AS id, "109" AS x, 2 AS ts UNION ALL
SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
SELECT 20 AS id, "120" AS x, 4 AS ts
Step 1 – transform table and write result into tempTable
SELECT
num,
MAX(IF(ts=1, ser, NULL)) AS ts_1,
MAX(IF(ts=2, ser, NULL)) AS ts_2,
MAX(IF(ts=3, ser, NULL)) AS ts_3,
MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
SELECT
ts,
CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser,
ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
FROM theTable
)
GROUP BY num
tempTable will look like below:
num ts_1 ts_2 ts_3 ts_4
1 1|101|1 6|106|2 10|110|3 19|119|4
2 2|102|1 7|107|2 11|111|3 20|120|4
3 3|103|1 8|108|2 12|112|3 null
4 4|104|1 9|109|2 13|113|3 null
5 5|105|1 null 14|114|3 null
6 null null 15|115|3 null
7 null null 16|116|3 null
8 null null 17|117|3 null
9 null null 18|118|3 null
Here, I am using simple concatenation for serialization
Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2
SELECT
r[OFFSET(0)] AS id,
r[OFFSET(1)] AS x,
r[OFFSET(2)] AS ts
FROM (
SELECT SPLIT(ts_2, "|") AS r
FROM tempTable
WHERE NOT ts_2 IS NULL
)
The result will look like below (which is expected):
id x ts
6 106 2
7 107 2
8 108 2
9 109 2
I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving
I will add another fourth option to #Mikhail's answer
DML QUERY
Action = 1 query to run
Full scans = 1
Cost = $5 x 0.34 = 1.7$ (x270 times cheaper than solution #1 \o/)
With the new DML feature of BiQuery you can convert a none partitioned table to a partitioned one while doing only one full scan of the source table
To illustrate my solution I will use one of BQ's public tables, namely bigquery-public-data:hacker_news.comments. below is the tables schema
name | type | description
_________________________________
id | INTGER | ...
_________________________________
by | STRING | ...
_________________________________
author | STRING | ...
_________________________________
... | |
_________________________________
time_ts | TIMESTAMP | human readable timestamp in UTC YYYY-MM-DD hh:mm:ss /!\ /!\ /!\
_________________________________
... | |
_________________________________
We are going to partition the comments table based on time_ts
#standardSQL
CREATE TABLE my_dataset.comments_partitioned
PARTITION BY DATE(time_ts)
AS
SELECT *
FROM `bigquery-public-data:hacker_news.comments`
I hope it helps :)
If your data was in sharded tables (i.e. with YYYYmmdd suffix), you could've used "bq partition" command. But with data in a single table - you will have to scan it multiple times applying different WHERE clauses on your partition key column.
The only optimization I can think of is to do it hierarchically, i.e. instead of 270 queries which will do 270 full table scans - first split table in half, then each half in half etc. This way you will need to pay for 2*log_2(270) = 2*9 = 18 full scans.
Once the conversion is done - all the temporary tables can be deleted to eliminate extra storage costs.

How to scale Pivoting in BigQuery?

Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists).
Simplified schema is:
UserGUID String, ArtistGUID String
I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user
There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example
Can this approach be scaled for my example?
I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table
STEP 1 - Aggregate plays by user / artist
SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays
FROM [mydataset.stats] GROUP BY 1, 2
STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names
Combined with first step – it will be:
SELECT u.uid AS uid, a.aid AS aid, plays
FROM (
SELECT userGUID, artistGUID, COUNT(1) AS plays
FROM [mydataset.stats]
GROUP BY 1, 2
) AS s
JOIN (
SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID
Let’s write output to table - mydataset.aggs
STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time.
In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000.
To be on safe side I decided to use 2000 features at a time
Below script is used for dynamically generating query that then run to create partitioned tables
SELECT 'SELECT uid,' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid)
)
+ ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)
Above query produces yet another query like below:
SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid
This should be run and written to mydataset.pivot_1_2000
Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on
STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table
Same as in above steps. First we need generate query and then run it
So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)
Output string from above should be run and result written to mydataset.pivot_1_4000
Then we repeat STEP 4 like below
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)
Result to be written to mydataset.pivot_1_6000
The resulted table has following schema:
uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int
NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

Efficiently running an SQL query over multiple inputs

Hi I've got a simulation snapshot that is currently stored in an PostgreSQL database as a table the schema for the snapshot table is
simdb=> \d isonew_4.snapshot_102
Table "isonew_4.snapshot_102"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
x | real |
y | real |
z | real |
vx | real |
vy | real |
vz | real |
pot | real |
mass | real |
Indexes:
"snapshot_102_id_idx" btree (id) WITH (fillfactor=100)
I've got a query that calculates the mass enclosed for a single radius fine:
SELECT SUM(mass) AS mass
FROM isonew_4.snapshot_102 AS s
WHERE SQRT(s.x^2 + s.y^2 + s.z^2) < {radius}
However I would like to run this over a number number of different radii.
Since the table has around 100 million rows it's something that I would prefer to do as a SQL query rather than grabbing all of the particles and using something like numpy.histogram in python to do the binning on my machine locally.
Method #1
This query might work, with for example 10,20 and 25 as the successive values for the radius:
WITH r(radius) as (values (10),(20),(25))
SELECT radius, SUM(mass) AS mass
FROM isonew_4.snapshot_102 AS s CROSS JOIN r
WHERE SQRT(s.x^2 + s.y^2 + s.z^2) < radius
GROUP BY radius;
The output has two columns: radius and corresponding sum(mass).
Method #2
If the query is too slow because of the CROSS JOIN with the list (presumably, EXPLAIN or better EXPLAIN ANALYZE would tell for sure), a different approach that certainly guarantees a single scan of the big table is to gather all results in a single row, one column per radius, with a generated query looking like this:
SELECT
sum(case when r < 10 then s.mass else 0 end) as radius10,
sum(case when r < 20 then s.mass else 0 end) as radius20,
sum(case when r < 25 then s.mass else 0 end) as radius25
FROM (select mass,SQRT(x^2 + y^2 + z^2) as r from isonew_4.snapshot_102) AS s
Method #3
If it's not practical, another completely different approach that might be worth trying would be to pre-compute SQRT(x^2 + y^2 + z^2) in a btree functional index in the hope that the SQL engine can use it with the inequality comparison. Whether this happens and if the query would be faster or not depends mainly on the data distribution.
create index radius_idx on isonew_4.snapshot_102(SQRT(x^2 + y^2 + z^2));
Then use the first query, either repeated with single radius each time, or method #1 with the GROUP BY and all values at once. If the values are very selective, the execution might be way faster than even a single large sequential scan.

SQL Help: Complex query

table : metrics
columns:
1. name : Name
2. instance: A name can have several instances
(Name: John, Instances: John at work, John at concert)
3. metric: IQ, KQ, EQ
4. metric_value: Any numeric
Objective of the query
Find out the metrics whose metric_value is 0 for all instances for all names.
Nature of data
A name's metric 'M' for instance 'X' could be 10. But for the same name and the same metric instance 'Y' could be 0. In this case, 'M' should NOT be returned.
Edit:
Sample data:
NAME INSTANCE METRIC VALUE
John At work IQ 0
John At home EQ 10
John At a concert KQ 0
Jim At work IQ 0
Jim At home KQ 0
Tina At home IQ 100
Tina At work EQ 0
Tina At work KQ 0
In this case, only KQ should be returned since it is always zero for all Names and their instances.
Are you looking for something like this?
SELECT metric
FROM metrics
GROUP BY metric
HAVING SUM(metric_value) = 0
Here is SQLFiddle demo
UPDATE If metric_value can have negative values then use this one
SELECT metric
FROM metrics
GROUP BY metric
HAVING SUM(ABS(metric_value)) = 0
Here is updated SQLFiddle demo
Even though this looks suspiciously like homework.... see if this gives you what you're after:
SELECT DISTINCT M1.Metric
FROM METRICS M1
WHERE NOT EXISTS (
SELECT * FROM METRICS M2
WHERE M2.Metric <> 0
AND M1.Metric = M2.Metric
)
A list based on your data:
SELECT name, metric FROM metrics GROUP BY name, metric HAVING SUM(metric_value) = 0
Most of the other answers assume that metric is positive. The OP says it can be any numeric. Here are two methods for this to work.
Check on the sum of the absolute values:
SELECT metric
FROM metrics
GROUP BY metric
HAVING SUM(abs(metric_value)) = 0
Explicitly check that there are no non-zero values:
SELECT metric
FROM metrics
GROUP BY metric
HAVING SUM(case when metric_value <> 0 and metric_value is not null then 1 else 0 end) = 0