Converting arrays to nested fields in BigQuery - google-bigquery

I'm streaming Stackdriver logs into Bigquery, and they end up in a textPayload field in the following format:
member_id_hashed=123456789,
member_age -> Float(37.0,244),
operations=[92967,93486,86220,92814,92943,93279,...],
scores=[3.214899,2.3641025E-5,2.5823574,2.3818345,3.9919448,0.0,...],
[etc]
I then define a query/view on the table with the raw logging entries as follows:
SELECT
member_id_hashed as member_id, member_age,
split(operations,',') as operation,
split(scores,',') as score
FROM
(
SELECT
REGEXP_EXTRACT(textPayload, r'member_id=([0-9]+)') as member_id_hashed,
REGEXP_EXTRACT(textPayload, r'member_age -> Float\(([0-9]+)') as member_age,
REGEXP_EXTRACT(textPayload, r'operations=\[(.+)') as operations,
REGEXP_EXTRACT(textPayload, r'scores=\[(.+)') as scores
from `myproject.mydataset.mytable`
)
resulting in one row with two single fields and two arrays:
Ideally, for further analysis, I would like the two arrays to be nested (e.g. operation.id and operation.score) or flatten the arrays line by line while keeping the positions (i.e. line 1 of array 1 should appear next to line 1 of array 2, etc):
Can anybody point me to a way to make nested fields out of the arrays, or to flatten the arrays? I tried unnesting and joining, but that would give me too many possible cross-combinations in the result.
Thanks for your help!

You can zip the two arrays like this. It unnests the array with operation IDs and gets the index of each element, then selects the corresponding element of the array with scores. Note that this assumes that the arrays have the same number of elements. If they don't, you could use SAFE_OFFSET instead of OFFSET in order to get NULL if there are more IDs than scores, for instance.
SELECT
member_id_hashed, member_age,
ARRAY(
SELECT AS STRUCT id, split(scores,',')[OFFSET(off)] AS score
FROM UNNEST(split(operations,',')) AS id WITH OFFSET off
ORDER BY off
) AS operations
FROM (
SELECT
REGEXP_EXTRACT(textPayload, r'member_id=([0-9]+)') as member_id,
REGEXP_EXTRACT(textPayload, r'member_age -> Float\(([0-9]+)') as member_age,
REGEXP_EXTRACT(textPayload, r'operations=\[(.+)') as operations,
REGEXP_EXTRACT(textPayload, r'scores=\[(.+)') as scores
from `myproject.mydataset.mytable`
)

Related

Unnesting repeated records to a single row in Big Query

I have a dataset that includes repeated records. When I unnest them I get 2 rows. 1 per nested record.
Before unnest raw data:
After unnest using this query:
SELECT
eventTime
participant.id
FROM
`public.table`,
UNNEST(people) AS participant
WHERE
verb = 'event'
These are actually 2 rows that are expanded to 4. I've been trying to unnest into a single row so I have 3 columns,
eventTime, buyer.Id, seller.Id.
I've been trying to use REPLACE to build a struct of the unnested content but I cannot figure out how to do it. Any pointer , documentation or steps that could help me out?
Consider below approach
SELECT * EXCEPT(key) FROM (
SELECT
eventTime,
participant.id,
personEventRole,
TO_JSON_STRING(t) key
FROM `public.table` t,
UNNEST(people) AS participant
WHERE verb = 'event'
)
PIVOT (MIN(id) FOR personEventRole IN ('buyer', 'seller'))
if applied to sample data in your question - output is

function to sum all first value of Results SQL

I have a table with "Number", "Name" and "Result" Column. Result is a 2D text Array and I need to create a Column with the name "Average" that sum all first values of Result Array and divide by 2, can somebody help me Pls, I must use the create function for this. Its look like this:
Table1
Number
Name
Result
Average
01
Kevin
{{2.0,10},{3.0,50}}
2.5
02
Max
{{1.0,10},{4.0,30},{5.0,20}}
5.0
Average = ((2.0+3.0)/2) = 2.5
= ((1.0+4.0+5.0)/2) = 5.0
First of all: You should always avoid storing arrays in the table (or generate them in a subquery if not extremely necessary). Normalize it, it makes life much easier in nearly every single use case.
Second: You should avoid more-dimensional arrays. The are very hard to handle. See Unnest array by one level
However, in your special case you could do something like this:
demo:db<>fiddle
SELECT
number,
name,
SUM(value) FILTER (WHERE idx % 2 = 1) / 2 -- 2
FROM mytable,
unnest(avg_result) WITH ORDINALITY as elements(value, idx) -- 1
GROUP BY number, name
unnest() expands the array elements into one element per record. But this is not an one-level expand: It expand ALL elements in depth. To keep track of your elements, you could add an index using WITH ORDINALITY.
Because you have nested two-elemented arrays, the unnested data can be used as follows: You want to sum all first of two elements, which is every second (the odd ones) element. Using the FILTER clause in the aggregation helps you to aggregate only exact these elements.
However: If that's was a result of a subquery, you should think about doing the operation BEFORE array aggregation (if this is really necessary). This makes things easier.
Assumptions:
number column is Primary key.
result column is text or varchar type
Here are the steps for your requirements:
Add the column in your table using following query (you can skip this step if column is already added)
alter table table1 add column average decimal;
Update the calculated value by using below query:
update table1 t1
set average = t2.value_
from
(
select
number,
sum(t::decimal)/2 as value_
from table1
cross join lateral unnest((result::text[][])[1:999][1]) as t
group by 1
) t2
where t1.number=t2.number
Explanation: Here unnest((result::text[][])[1:999][1]) will return the first value of each child array (considering you can have up to 999 child arrays in your 2D array. You can increase or decrease it as per your requirement)
DEMO
Now you can create your function as per your requirement with above query.

Filtering arrays in hive

I have a hive table in that I am having a column called paid_value in an array format for each record.
Now I want to filter the array such that the value must be between 1000 and 10000 for each record.
I don't know how to do it.
I know array_contains(Array<T>, value) function but this doesn't solve my problem since it accepts only one value as a checking criteria but I want like 'between 1000 and 10000'.
You can use LATERAL VIEW EXPLODE to explode the array and then do the filter subsequently. But if your array size is huge, your process will be slow.
Other option definitely needs a UDF to do the filter.
The other workaround I can think of is doing with a brickhouse UDF is:
-- this will give you an array of numbers between start(st) and end(ed)
select collect_set(pe.i+1) as range_array
from
(SELECT 1000 as st, 1100 as ed) t
LATERAL VIEW posexplode(split(space(ed-st),' ')) pe AS i,x;
Then I use the brickhouse udf bhouse_intersect_array
select count(1)
from range_array cross join <source_tablename>
where size(bhouse_intersect_array(source_array, range_array)) > 0

Unnesting structs in BigQuery

What is the correct way to flatten a struct of two arrays in BigQuery? I have a dataset like the one pictured here (the struct.destination and struct.visitors arrays are ordered - i.e. the visitor counts correspond specifically to the destinations in the same row):
I want to reorganize the data so that I have a total visitor count for each unique combination of origins and destinations. Ideally, the end result will look like this:
I tried using UNNEST twice in a row - once on struct.destination and then on struct.visitors, but this produces the wrong result (each destination gets mapped to every value in the array of visitor counts when it should only get mapped to the value in the same row):
SELECT
origin,
unnested_destination,
unnested_visitors
FROM
dataset.table,
UNNEST(struct.destination) AS unnested_destination,
UNNEST(struct.visitors) AS unnested_visitors
You have one struct that is repeated. So, I think you want:
SELECT origin,
s.destination,
s.visitors
FROM dataset.table t CROSS JOIN
UNNEST(t.struct) s;
EDIT:
I see, you have a struct of two arrays. You can do:
SELECT origin, d.destination, v.visitors
FROM dataset.table t CROSS JOIN
UNNEST(struct.destination) s WITH OFFSET nd LEFT JOIN
UNNEST(struct.visitors) v WITH OFFSET nv
ON nd = nv
Difficult to test by not having the underlying data to test on, so I created my own query with your dataset. As far as I can tell destination|visitors is not in an ARRAY-format, but rather in a STRUCT-format, so you do not need UNNEST it. Also view this thread please :)
SELECT
origin,
COUNT(struct.destination),
COUNT(struct.visitors)
FROM dataset.table
GROUP BY 1

Flattening an array in SQL

I am trying to flatten this array so that each neighbor has its own column.
How do I write a query that allows me to flatten this array when I don't know the elements in the array?
SELECT deviceid,
neighbors
FROM
`etl.routing_table_nodes`
WHERE
Parent = 'QMI-YSK'
And results currently look like:
Row deviceid neighbors
1 OHX-ZSI DMR-RLE
WMI-YEK
2 OHX-ZFI DMR-RLE
QMI-YSK
Bigquery screenshot
Try
SELECT
deviceid, unnested_neighbors
FROM
`etl.routing_table_nodes` table,
UNNEST(table.neighbors) as unnested_neighbors
WHERE
unnested_neighbors = 'QMI-YSK'