Processing AI Platform Batch Results BigQuery - sql

The output from AI Platform for tabular dataset looks something like this:
or
{
"classes": ["a","b","c"],
"scores": [0.9,0.1,0.0]
}
There are two arrays within in a record field. predicted_label.classes is the label, and predicted_label.scores is the score produced by AI Platform.
I would like to select the class based on the highest score. i.e in the above example I would like to have an output like row=0, class="a", score=0.9
UNNEST does not immediately solve my issue from my understanding, as it requires the input to be an array. I believe if the ouput was a repeated RECORD it would be easier.
What SQL query will enable me to extract the right label from the AI Platform batch results?

Try this:
with testdata as (
select struct(["a", "b", "c"] as classes, [0.9, 0.1, 0.0] as scores) as predicted_label
)
select (
select struct(offset, class, score)
from unnest(predicted_label.classes) as class with offset
join unnest(predicted_label.scores) as score with offset
using (offset)
order by score desc
limit 1
) as highest
from testdata

You should design your prediction list so that each label and score is represented as a key-value pair.
That BigQuery table looks like this array.
prediction RECORD REPEATED
prediction.label STRING REQUIRED
prediction.score FLOAT REQUIRED
Why?
This a correct representation of your real world situation.
You need no further verification that both list keep the elements in the correct pairing order (on write and on read).
With two loose lists you create a pitfall that will hounds you.
SQL example
with this_model as (
select [
STRUCT ('a' as label, 0.9 as score)
, STRUCT ('b' as label, 0.1 as score)
, STRUCT ('c' as label, 0.0 as score)
] as prediction
)
select pair.label, pair.score
from this_model, UNNEST(prediction) pair
order by pair.score desc
limit 1;

Related

Unnest STRUCT with double array in BigQuery

I'm trying to find a way to share some sample data ('Some_Data' below) when sharing a SQL snippet in order to allow for demonstrating the functionality.
I know I could do this with a series of UNION ALL statements, but I'd like to reduce the line count (not that this matters in this example, but imagine we have 20+ rows).
I thought I could do this using the arrays and structures and I've managed to get this working, but the code looks really messy and I was hoping someone could suggest how I could simplify this?
WITH
Some_Data AS (
SELECT
[STRUCT
(
["a", "b"] AS letters,
[1, 2] AS numbers
)
] AS Data_Sample
)
SELECT t_numbers, t_letters FROM
((
SELECT letters, numbers
FROM Some_Data s
CROSS JOIN UNNEST(s.Data_Sample)
)) t
CROSS JOIN UNNEST(t.numbers) as t_numbers WITH OFFSET nm
LEFT JOIN UNNEST(t.letters) as t_letters WITH OFFSET lt
ON nm = lt
Output (as expected):
Row
t_numbers
t_letters
1
1
a
2
2
b
I was hoping someone could suggest how I could simplify this?
Consider below approach (looks to me much simpler and less verbose)
select number, letter
from Some_Data t, t.Data_Sample el,
el.letters letter with offset
join el.numbers number with offset
using(offset)
if applied to sample data in your question - output is

Converting arrays to nested fields in BigQuery

I'm streaming Stackdriver logs into Bigquery, and they end up in a textPayload field in the following format:
member_id_hashed=123456789,
member_age -> Float(37.0,244),
operations=[92967,93486,86220,92814,92943,93279,...],
scores=[3.214899,2.3641025E-5,2.5823574,2.3818345,3.9919448,0.0,...],
[etc]
I then define a query/view on the table with the raw logging entries as follows:
SELECT
member_id_hashed as member_id, member_age,
split(operations,',') as operation,
split(scores,',') as score
FROM
(
SELECT
REGEXP_EXTRACT(textPayload, r'member_id=([0-9]+)') as member_id_hashed,
REGEXP_EXTRACT(textPayload, r'member_age -> Float\(([0-9]+)') as member_age,
REGEXP_EXTRACT(textPayload, r'operations=\[(.+)') as operations,
REGEXP_EXTRACT(textPayload, r'scores=\[(.+)') as scores
from `myproject.mydataset.mytable`
)
resulting in one row with two single fields and two arrays:
Ideally, for further analysis, I would like the two arrays to be nested (e.g. operation.id and operation.score) or flatten the arrays line by line while keeping the positions (i.e. line 1 of array 1 should appear next to line 1 of array 2, etc):
Can anybody point me to a way to make nested fields out of the arrays, or to flatten the arrays? I tried unnesting and joining, but that would give me too many possible cross-combinations in the result.
Thanks for your help!
You can zip the two arrays like this. It unnests the array with operation IDs and gets the index of each element, then selects the corresponding element of the array with scores. Note that this assumes that the arrays have the same number of elements. If they don't, you could use SAFE_OFFSET instead of OFFSET in order to get NULL if there are more IDs than scores, for instance.
SELECT
member_id_hashed, member_age,
ARRAY(
SELECT AS STRUCT id, split(scores,',')[OFFSET(off)] AS score
FROM UNNEST(split(operations,',')) AS id WITH OFFSET off
ORDER BY off
) AS operations
FROM (
SELECT
REGEXP_EXTRACT(textPayload, r'member_id=([0-9]+)') as member_id,
REGEXP_EXTRACT(textPayload, r'member_age -> Float\(([0-9]+)') as member_age,
REGEXP_EXTRACT(textPayload, r'operations=\[(.+)') as operations,
REGEXP_EXTRACT(textPayload, r'scores=\[(.+)') as scores
from `myproject.mydataset.mytable`
)

How to overcome the limitation of `PERCENTILE_CONT` that the argument should be constant?

I want to find POINTCOUNT values that cut the input set ADS.PREDICTOR into equally large groups. The parameter POINTCOUNT can have different value for different predictors, so I don't want to hard-code it in the code.
Unfortunately the code below fails with ORA-30496: Argument should be a constant... How can I overcome this (except for 300 lines of code with hard-coded threshold quantiles, of course)?
define POINTCOUNT=300;
select
*
from (
select
percentile_disc(MYQUNTILE)
within group (
order by PREDICTOR ) as THRESHOLD
from ADS
inner join (
select (LEVEL - 1)/(&POINTCOUNT-1) as MYQUANTILE
from dual
connect by LEVEL <= &POINTCOUNT
)
on 1=1
)
group by THRESHOLD
I want to draw a ROC curve. The curve will be plotted in Excel as a linear interpolation between pairs of points (X, Y) calculated in Oracle.
Each point (X, Y) is calculated using a threshold value.
I will get the best approximation of the ROC curve for a give number of the pairs of points if the distance between each adjacent pair of (X, Y) is uniform.
if I cut the domain of the predicted values into N values that separate 1/Nth quantiles, I should get a fairly good set of the threshold values.
PERCENTILE_CONT() only requires that the percentile value be constant within each group. You do not have a group by in your subquery, so I think this might fix your problem:
select MYQUANTILE,
percentile_disc(MYQUANTILE) within group (order by PREDICTOR
) as THRESHOLD
from ADS cross join
(select (LEVEL - 1)/(&POINTCOUNT-1) as MYQUANTILE
from dual
connect by LEVEL <= &POINTCOUNT
)
GROUP BY MYQUANTILE;
Also, note that CROSS JOIN is the same as INNER JOIN . . . ON 1=1.

Query to calculate term frequency * inverse document frequency

I have 2 tables in my Oracle database:
DF (term, doccount)
TF (abstractid, term, freq)
One for Document frequency(DF) having terms and documentCount and another table for term frequency called TF havind the documentID, terms, Frequency.
I want to calculate TF*IDF where TF = number of times that a term appears in an article (frequency column from table TF) and IDF = log (132225)-log(docCount)+1
I want to store my result in a table (TFIDF) having documentID, Terms and the calculated TF*IDF
Any ideas?
You need to join your TF and DF tables and then insert into the destination TFIDF table.
Try this:
insert into TFIDF (documentID, terms, tf_idf)
select abstractID, df.term, (log(10, 132225)-log(10, doccount)+1)*(tf.freq)
from tf, df
where tf.term = df.term;

Thin out many ST_Points

I have many (1.000.000) ST_Points in a postgres-db with postgis
extension. When i show them on a map, browsers are getting very busy.
For that I would like to write an sql-statement which filters the high density to only one point.
When a User zoom out of 100 ST_Points, postgres should give back only one.
But only if these Points are close together.
I tried it with this statement:
select a.id, count(*)
from points as a, points as b
where st_dwithin(a.location, b.location, 0.001)
and a.id != b.id
group by a.id
I would call it thin-out but didnot find anything - maybe because I'm
not a native english speaker.
Does anybody have some suggestions?
I agree with tcarobruce that clustering is the term you are looking for. But it is available in postgis.
Basically clustering can be achieved by reducing the number of decimals in the X and Y and grouping upon them;
select
count(*),
round(cast (ST_X(geom) as numeric),3)
round(cast (ST_Y(geom) as numeric),3)
from mytable
group by
round(cast (ST_X(geom) as numeric),3),
round(cast (ST_Y(geom) as numeric),3)
Which will result in a table with coordinates and the number of real points at that coordinate. In this particular sample, it leaves you with rounding on 3 decimals, 0.001 like in your initial statement.
You can cluster nearby Points together using ST_ClusterDBSCAN
Then keep all single points and for example:
Select one random Point per cluster
or
Select the centroid of each Point cluster.
I use eps 300 to cluster points together that are within 300 meters.
create table buildings_grouped as
SELECT geom, ST_ClusterDBSCAN(geom, eps := 300, minpoints := 2) over () AS cid
FROM buildings
1:
create table buildings_grouped_keep_random as
select geom, cid from buildings_grouped
where cid is null
union
select * from
(SELECT DISTINCT ON (cid) *
FROM buildings_grouped
ORDER BY cid, random()) sub
2:
create table buildings_grouped_keep_centroid as
select geom, cid from buildings_grouped
where cid is null
union
select st_centroid(st_union(geom)) geom, cid
from buildings_grouped
where cid is not null
group by cid
The term you are looking for is "clustering".
There are client-side libraries that do this, as well as commercial services that do it server-side.
But it's not something PostGIS does natively. (There's a ticket for it.)
You'll probably have to write your own solution, and precompute your clusters ahead of time.
ST_ClusterDBSCAN- and KMeans- based clustering works but it is VERY SLOW! for big data sets. So it is practically unusable. PostGIS functions like ST_SnapToGrid and ST_RemoveRepeatedPoints is faster and can help in some cases. But the best approach, I think, is using PDAL thinning filters like sample filter. I recommend using it with PG Point Cloud.
Edit:
ST_SnapToGrid is pretty fast and useful. Here is the example query for triangulation with optimizations:
WITH step1 AS
(
SELECT geometry, ST_DIMENSION(geometry) AS dim FROM table
)
, step2 AS
(
SELECT ST_SIMPLIFYVW(geometry, :tolerance) AS geometry FROM step1 WHERE dim > 0
UNION ALL
(WITH q1 AS
(
SELECT (ST_DUMP(geometry)).geom AS geometry FROM step1 WHERE dim = 0
)
SELECT ST_COLLECT(DISTINCT(ST_SNAPTOGRID(geometry, :tolerance))) FROM q1)
)
, step3 AS
(
SELECT ST_COLLECT(geometry) AS geometry FROM step2
)
SELECT ST_DELAUNAYTRIANGLES(geometry, :tolerance, 0)::BYTEA AS geometry
FROM step3
OFFSET :offset LIMIT :limit;