How to run an array UDF on a BigQuery column - google-bigquery

I would like to process a column in a table as an array using a User Defined Function (UDF) written in Javascript. The function prototype of the UDF looks like this:
CREATE TEMPORARY FUNCTION Summary(items ARRAY<FLOAT64>, seed FLOAT64)
RETURNS ARRAY<FLOAT64>
I would like to turn the output array into an additional column.
If the name of the column of my table, which contains FLOAT64 elements, is computed_items then how would I create a new column, item_summaries containing one-value-per-row of the output of the function applied on computed_items.

Please always try to make the UDF run on each row and not of the array of all rows. If this is not possible, please check to total amount of rows to not exceed the array cluster size.
Your UDF has an input array items, which are the values of a column. And a seed value. Please add to your UDF an array of row id as input. The output of the UDF cannot be an array as rows, but an array of structure of id and item value.
Please see the JS of the UDF for details. The calculation of the new item value is here as an example old value plus seed value.
In case your table has an unique row_number, you can skip the first part and your table is already h1.
Your table is given as tbl and we generate some date. For testing we mix up the order of the table with order by rand(). The helper table adds a row_number as compute_id. In the next steps we need to query this table twice and need to obtain the same mapping of the row_number. This can be archived by making storing this table in a recursive table h1.
The table h2 is using your UDF Summary and as inputs the rows are build up as arrays using the array_agg function and the output array is named Summary_. Next the output Summary_ is unnested and the columns are renamed.
The final table h3 joins the array from the Summary to the table using row_numbers in column compute_id.
create temp function Summary(items ARRAY<FLOAT64>,id ARRAY<int64>, seed FLOAT64)
RETURNS ARRAY<struct<computed_items FLOAT64,id int64>>
language js as
"""
var calculated_items=[];
for(let i in items){
calculated_items[i]=items[i]+seed;
}
var out=[];
for(let i in calculated_items){
out[i]={computed_items:calculated_items[i],id:id[i]}
}
return out;
""";
with recursive
tbl as (select x, x+1000 as computed_items from unnest(generate_array(0,1000))x order by rand()
),
helper as (select row_number() over () as compute_id, x, computed_items from tbl),
h1 as (select * from helper union all select * from h1 where false),
h2 as (select t0.computed_items, t0.id as compute_id from (select Summary(array_agg(computed_items*1.0),array_agg(compute_id),10000.0) Summary_ from h1) as XX,unnest(XX.Summary_) as t0),
h3 as (select * from h1 left join h2 using (compute_id))
select * from h3
order by x

Related

Select rows according to another table with a comma-separated list of items

Have a table test.
select b from test
b is a text column and contains Apartment,Residential
The other table is a parcel table with a classification column. I'd like to use test.b to select the right classifications in the parcels table.
select * from classi where classification in(select b from test)
this returns no rows
select * from classi where classification =any(select '{'||b||'}' from test)
same story with this one
I may make a function to loop through the b column but I'm trying to find an easier solution
Test case:
create table classi as
select 'Residential'::text as classification
union
select 'Apartment'::text as classification
union
select 'Commercial'::text as classification;
create table test as
select 'Apartment,Residential'::text as b;
You don't actually need to unnest the array:
SELECT c.*
FROM classi c
JOIN test t ON c.classification = ANY (string_to_array(t.b, ','));
db<>fiddle here
The problem is that = ANY takes a set or an array, and IN takes a set or a list, and your ambiguous attempts resulted in Postgres picking the wrong variant. My formulation makes Postgres expect an array as it should.
For a detailed explanation see:
How to match elements in an array of composite type?
IN vs ANY operator in PostgreSQL
Note that my query also works for multiple rows in table test. Your demo only shows a single row, which is a corner case for a table ...
But also note that multiple rows in test may produce (additional) duplicates. You'd have to fold duplicates or switch to a different query style to get de-duplicate. Like:
SELECT c.*
FROM classi c
WHERE EXISTS (
SELECT FROM test t
WHERE c.classification = ANY (string_to_array(t.b, ','))
);
This prevents duplication from elements within a single test.b, as well as from across multiple test.b. EXISTS returns a single row from classi per definition.
The most efficient query style depends on the complete picture.
You need to first split b into an array and then get the rows. A couple of alternatives:
select * from nj.parcels p where classification = any(select unnest(string_to_array(b, ',')) from test)
select p.* from nj.parcels p
INNER JOIN (select unnest(string_to_array(b, ',')) from test) t(classification) ON t.classification = p.classification;
Essential to both is the unnest surrounding string_to_array.

Unnest string array and transpose in Big query

I'm using Bigquery, I've a table A with string array and I need to cast to int64/string ( if possible ) so I can join with table B which of Int64/string
The main ask here is:
I've a table A, where I've string array mapped with Ref ID as below:
I'm trying to get unnest and my desired output should be as below.
I did tried below script:
SELECT a0_string_arrat,
ref_id
FROM TableA AS t,
t.String_array AS a0_String_array
But the challenge with above script is, I've close to 1000 Ref IDs, but my output is resulting only 100
If I try the below, I'm able to get all 1000 rows.
SELECT string_array,
ref_id
FROM TableA
The end goal is to I need to unnest and cast to Int64/string. The above script is not working for my need. can someone help on this.
You can use CROSS JOIN + UNNEST() in order to get the values from the array attributed to each ref_id:
select
ref_id,
unnested_numbers
from tablea
cross join unnest(string_array) as unnested_numbers
order by 2, 1
This should give you the desired output that you specified.

JSONB array contains like OR and AND operators

Consider a table temp (jsondata jsonb)
Postgres provides a way to query jsonb array object for contains check using
SELECT jsondata
FROM temp
WHERE (jsondata->'properties'->'home') ? 'football'
But, we can't use LIKE operator for array contains. One way to get LIKE in the array contains is using -
SELECT jsondata
FROM temp,jsonb_array_elements_text(temp.jsondata->'properties'->'home')
WHERE value like '%foot%'
OR operation with LIKE can be achieved by using -
SELECT DISTINCT jsondata
FROM temp,jsonb_array_elements_text(temp.jsondata->'properties'->'home')
WHERE value like '%foot%' OR value like 'stad%'
But, I am unable to perform AND operation with LIKE operator in JSONB array contains.
After unnesting the array with jsonb_array_elements() you can check values meeting one of the conditions and sum them in groups by original rows, example:
drop table if exists temp;
create table temp(id serial primary key, jsondata jsonb);
insert into temp (jsondata) values
('{"properties":{"home":["football","stadium","16"]}}'),
('{"properties":{"home":["football","player","16"]}}'),
('{"properties":{"home":["soccer","stadium","16"]}}');
select jsondata
from temp
cross join jsonb_array_elements_text(temp.jsondata->'properties'->'home')
group by jsondata
-- or better:
-- group by id
having sum((value like '%foot%' or value like 'stad%')::int) = 2
jsondata
---------------------------------------------------------
{"properties": {"home": ["football", "stadium", "16"]}}
(1 row)
Update. The above query may be expensive with a large dataset. There is a simplified but faster solution. You can cast the array to text and apply like to it, e.g.:
select jsondata
from temp
where jsondata->'properties'->>'home' like all('{%foot%, %stad%}');
jsondata
---------------------------------------------------------
{"properties": {"home": ["football", "stadium", "16"]}}
(1 row)
I have the following, but it was a bit fiddly. There's probably a better way but this is working I think.
The idea is to find the matching JSON array entries, then collect the results. In the join condition we check the "matches" array has the expected number of entries.
CREATE TABLE temp (jsondata jsonb);
INSERT INTO temp VALUES ('{"properties":{"home":["football","stadium",16]}}');
SELECT jsondata FROM temp t
INNER JOIN LATERAL (
SELECT array_agg(value) AS matches
FROM jsonb_array_elements_text(t.jsondata->'properties'->'home')
WHERE value LIKE '%foo%' OR value LIKE '%sta%'
LIMIT 1
) l ON array_length(matches, 1) = 2;
jsondata
-------------------------------------------------------
{"properties": {"home": ["football", "stadium", 16]}}
(1 row)
demo: db<>fiddle
I would cast the array into text. Then you are able to search for keywords with every string operator.
Disadvantage: because it was an array the text contains characters like braces and commas. So it's not that simple to search for keyword with a certain beginning (ABC%): You always have to search like %ABC%
SELECT jsondata
FROM (
SELECT
jsondata,
jsondata->'properties'->>'home' as a
FROM
temp
)s
WHERE
a LIKE '%stad%' AND a LIKE '%foot%'

How to do calculations on json data in Postgres

I'm storing AdWords report data in Postgres. Each report is stored in a table named Reports, which has a jsonb column named 'data'. Each report has json stored in its 'data' field that looks that looks like this:
[
{
match_type: "exact",
search_query: "gm hubcaps",
conversions: 2,
cost: 1.24
},
{
match_type: "broad",
search_query: "gm auto parts",
conversions: 34,
cost: 21.33
},
{
match_type: "phrase",
search_query: "silverdo headlights",
conversions: 63,
cost: 244.05
}
]
What I want to do is query off these data hashes and sum up the total number of conversions for a given report. I've looked though the Postgresql docs and it looks like you can only really do calculations on hashes, not arrays of hashes like this. Is what I'm trying to do possible in postgres? Do I need to make a temp table out of this array and do calculations off that? Or can I use a stored procedure?
I'm using Postgresql 9.4
EDIT
The reason I'm not just using a regular, normalized table is that this is just one example of how report data could be structured. In my project, reports have to allow arbitrary keys, because they are populated by users uploading CSV's with any columns they like. It's basically just a way to get around having arbitrarily many, user-created tables.
What I want to do is query off these data hashes and sum up the conversions
The fastest way should be with jsonb_populate_recordset(). But you need a registered row type for it.
CREATE TEMP TABLE report_data (
-- match_type text -- commented out, because we only need ..
-- , search_query text -- .. conversions for this query
conversions int
-- , cost numeric
);
A temp table is one way to register a row type ad-hoc. More explanation in this related answer:
jsonb query with nested objects in an array
Assuming a table report with report_id as PK for lack of inforamtion.
SELECT r.report_id, sum(d.conversions) AS sum_conversions
FROM report r
LEFT JOIN LATERAL jsonb_populate_recordset(null::report_data, r.data) d ON true
-- WHERE r.report_id = 12345 -- only for given report?
GROUP BY 1;
The LEFT JOIN ensures you get a result, even if data is NULL or empty or the JSON array is empty.
For a sum from a single row in the underlying table, this is faster:
SELECT d.sum_conversions
FROM report r
LEFT JOIN LATERAL (
SELECT sum(conversions) AS sum_conversions
FROM jsonb_populate_recordset(null::report_data, r.data)
) d ON true
WHERE r.report_id = 12345; -- enter report_id here
Alternative with jsonb_array_elements() (no need for a registered row type):
SELECT d.sum_conversions
FROM report r
LEFT JOIN LATERAL (
SELECT sum((value->>'conversions')::int) AS sum_conversions
FROM jsonb_array_elements(r.data)
) d ON true
WHERE r.report_id = 12345; -- enter report_id here
Normally you would implement this as plain, normalized table. I don't see the benefit of JSON here (except that your application seems to require it, like you added).
You could use unnest:
select sum(conv) from
(select d->'conversion' as conv from
(select unnest(data) as d from <your table>) all_data
) all_conv
Disclaimer: I don't have Pg 9.2 so I couldn't test it myself.
EDIT: this is assuming that the array you mentioned is a Postgresql array, i.e. that the data type of your data column is character varying[]. If you mean the data is a json array, you should be able to use json_array_elements instead of unnest.

Iterate through a list to get strings in SQL

I have a SQL table as shown below. I want to generate strings using the 2 fields in my table.
A B
M1 tiger
M1 cat
M1 dog
M3 lion
I want to read in this table, count the number of rows, and store it in string variables like String1 = M1_tiger, String2 = M1_cat, etc. What's the best way to do this?
You could do a concat type query.
SELECT (Table.A + '_' + Table.B) AS A_B, COUNT(*) AS RowsCount FROM Table
I'm asuming the your table name is "Table", the result where you will find the strings you want would be the column named A_B, each record will have two things in each record, one would be the string you asked for, the other column would always be the same thing, the total number of records on you table.
The count part is kinda easy but check this link so you can use the specific count you need: http://www.w3schools.com/sql/sql_func_count.asp
You can try this:
SELECT CONCAT(A, '_', B) FROM yourtable
When you say "read in this table", do you mean read it into a programming language like C#? Or do you want to dynamically create sql variables?
You may want to use a table variable to store your strings rather than individual variables. Regarding getting the row number, you could use something like:
WITH CTE AS
(
SELECT A, B,
ROW_NUMBER() OVER (order by OrderDate) AS 'RowNumber'
FROM MyTable
)
SELECT A,B,RowNumber FROM CTE
See this answer for more on how you may choose to use the table variable.
SQL: Dynamic Variable Names
If your are using Oracle, you can also do it like:
select A ||'_'||B
from yourTable
Solution for PostgreSQL
CREATE SEQUENCE one;
SELECT array_to_string(array_agg(concat('String',nextval('one'),' = ',A,'_',B)), ', ')
AS result
FROM test_table;
DROP SEQUENCE one;
Explanation:
Create a temporary sequence 'one' in order to use nextval function.
nextval('sequence') - advance sequence and return new value.
concat('input1', ...) - concatenate all arguments.
array_agg('input1', ...); - input values, including nulls,
concatenated into an array.
array_to_string('array', 'delimiter') - concatenates array elements
using supplied delimiter and optional null string.
Drop the sequence 'one'.
The output of the query (for two test rows in test_table):
result
-------------------------------------------
String1 = M1_tiger, String2 = M1_cat
(1 row)