Alias nested struct columns - google-bigquery

How can I alias field1 as index & & field 2 as value
The query gives me an error:
#standardsql
with q1 as (select 1 x, ARRAY<struct<id string, cd ARRAY<STRUCT<index STRING,value STRING>>>>
[struct('h1',[('1','a')
,('2','b')
])
,('h2',[('3','c')
,('4','d')
])] hits
)
Select * from q1
ORDER by x
Error: Array element type STRUCT<STRING, ARRAY<STRUCT<STRING, STRING>>> does not coerce to STRUCT<id STRING, cd ARRAY<STRUCT<index STRING, value STRING>>> at [5:26]
Thanks a lot for your time in responding
Cheers!

#standardsql
WITH q1 AS (
SELECT
1 AS x,
[
STRUCT('h1' AS id, [STRUCT('1' AS index, 'a' AS value), ('2','b')] AS cd),
STRUCT('h2', [STRUCT('3' AS index, 'c' AS value), ('4','d')] AS cd)
] AS hits
)
SELECT *
FROM q1
-- ORDER BY x
or below might be even more "readable"
#standardsql
WITH q1 AS (
SELECT
1 AS x,
[
STRUCT('h1' AS id, [STRUCT<index STRING, value STRING>('1', 'a'), ('2','b')] AS cd),
STRUCT('h2', [STRUCT<index STRING, value STRING>('3', 'c'), ('4','d')] AS cd)
] AS hits
)
SELECT *
FROM q1
-- ORDER BY x

When I try to simulate data in BigQuery using the Standard Version I usually try to name all variables and aliases everywhere possible. For instance, your data works if you build it like so:
with q1 as (
select 1 x, ARRAY<struct<id string, cd ARRAY<STRUCT<index STRING,value STRING>>>> [struct('h1' as id,[STRUCT('1' as index,'a' as value) ,STRUCT('2' as index ,'b' as value)] as cd), STRUCT('h2',[STRUCT('3' as index,'c' as value) ,STRUCT('4' as index,'d' as value)] as cd)] hits
)
select * from q1
order by x
Notice I've built structs and put aliases inside of them in order for this to work (if you remove the aliases and the structs it might not work, but I found that this seems to be rather intermittent. If you fully describe the variables then it works all the time).
Also as a recommendation, I try to build simulated data piece by piece. First I create the struct and test it to see if BigQuery accepts it. After the validator is green, then I proceed to add more values. If you try to build everything at once you might find this a somewhat challenging task.

Related

hive convert array<map<string, string>> to string

I have a column in hive table which type is array<map<string, string>>, I am struggling how to convert this column into string using hql?
I found post here Convert Map<string,string> to just string in hive to convert map<string, string> to string. However, I still failed to convert array<map<string, string>> to string.
Building off of the original answer you linked to, you can first explode the array into the individual maps using posexplode to maintain a positition column. Then you can use the method from the original post, but additionally group by the position column to convert each map to a string. Then you collect your maps into the final string. Here’s an example:
with test_data as (
select stack(2,
1, array(map('m1key1', 'm1val1', 'm1key2', 'm1val2'), map('m2key1', 'm2val1', 'm2key2', 'm2val2')),
2, array(map('m1key1', 'm1val1', 'm1key2', 'm1val2'), map('m2key1', 'm2val1', 'm2key2', 'm2val2'))
) as (id, arr_col)
),
map_data as (
select id, arr_col as original_arr, m.pos as map_num, m.val as map_col
from test_data d
lateral view posexplode(arr_col) m as pos, val
),
map_strs as (
select id, original_arr, map_num,
concat('{',concat_ws(',',collect_set(concat(m.key,':', m.val))),'}') map_str
from map_data d
lateral view explode(map_col) m as key, val
group by id, original_arr, map_num
)
select id, original_arr, concat('[', concat_ws(',', collect_set(map_str)), ']') as arr_str
from map_strs
group by id, original_arr;

erroneous function call with case when inside UDF

I'm noticing some really weird behavior when using a CASE WHEN statement inside a UDF in one of my queries in google bigquery. The result is really bizarre, so either I'm missing something really obvious, or there's some weird behavior in the query execution.
(side note: if there's a more efficient way to implement the query logic below I'm all ears, my query takes forever)
I'm processing some log lines, where each line contains a data: string and topics: array<string> field that are used for decoding. Each type of log line will have different length of topics, and requires different decoding logic. I use a CASE WHEN inside a UDF to switch to different methods of decoding. I originally got a strange error of indexing too far into an array. This would either mean non-conformant data, or the wrong decoder getting called at some point. I verified that all the data conformed to the spec, so it must be the latter.
I've narrowed it down to an erroneous / extraneous decoder being executed inside my CASE WHEN for the wrong type.
The weirdest thing is that when I go and insert fixed value instead of my decoding functions, the return value of the CASE WHEN doesn't indicate that it's an erroneous match. Somehow the first function is getting called when I use functions, but when debugging I get the value from the proper value from the second WHEN.
I pulled the logic out of the udf, and implemented it with an if(..) instead of CASE WHEN and everything decodes fine. I'm wondering what's going on here, if it's a bug in bigquery, or something weird happens when using UDFs.
Here's a stripped down version of the query
-- helper function to normalize different payloads into a flattened struct
create temporary function wrap_struct(payload array<struct<name string, value string>>) as (
(select as struct
decode_field_type1(field1) as field1,
decode_field_type1(field2) as field2,
decode_field_type2(field3) as field3,
-- a bunch more fields
from (select * from
(select p.name, p.value
from unnest(payload) as p) pivot(string_agg(value) for name in (
'field1', 'field2', 'field3', --a bunch more fields
)
)
))
);
-- this topic uses the data and topics in the decoding, and has a topics array of length 4
-- this gets called from the switch with a payload from topics2, which has a shorter topics array of length 1, causing a failure
create temporary function decode_topic1(data string, topics array<string>) as
(
wrap_struct([
struct("field1" as name, substring(topics[offset(1)], 3) as value),
struct("field2" as name, substring(topics[offset(2)], 3) as value),
struct("field3" as name, substring(topics[offset(3)], 3) as value),
struct("field4" as name, substring(data, 3, 64) as value)
])
);
--this uses only the data_payload, and has a topics array of length 1
create temporary function decode_topic2(data string, topics array<string>) as
(
wrap_struct([
struct("field1" as name, substring(data, 3, 64) as value),
struct("field5" as name, substring(data, 67, 64) as value),
struct("field6" as name, substring(data, 131, 64) as value)
])
);
create temporary function decode_event_data(data string, topics array<string>) as
(
-- first element of topics denotes the type of event
case
-- somehow the function decode_topic1 gets called when topics[0] == topic2
-- HOWEVER, when i replaced the functions with a fixed value to debug
-- i get the expected results, indicating a proper match.
-- this is not unique these topics
-- it happens with other combinations also.
when topics[offset(0)] = 'topic1' then decode_topic1(data, topics)
when topics[offset(0)] = 'topic2' then decode_topic2(data, topics)
-- a bunch more topics
else wrap_struct([])
end
);
select
id, data, topics,
decode_event_data(data, topics) as decoded_payload
from (select * from mytable
where
topics[offset(0)] = 'topic1'
or topics[offset(0)] = 'topic2'
when i change the base query to:
select
id, data, topics, decode_topic2(data, topics)
from (select * from mytable
where
topics[offset(0)] = 'topic2'
it decodes fine.
What's up with the CASE WHEN?
edit: Here's a query on a public dataset that can generate the problem:
concat('0x', substring(raw, 25, 40))
);
create temporary function decode_amount(raw string) as (
concat('0x', raw)
);
create temporary function wrap_struct(payload array<struct<name string, value string>>) as (
(select as struct
decode_address(sender) as reserve,
decode_address(`to`) as `to`,
decode_amount(amount1) as amount1,
decode_amount(amount2) as amount2,
from (select * from
(select p.name, p.value
from unnest(payload) as p) pivot(string_agg(value) for name in (
'sender', 'to', 'amount1', 'amount2'
)
)
))
);
create temporary function decode_mint(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value)
])
);
create temporary function decode_burn(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
struct("to" as name, substring(topics[offset(2)], 67, 64) as value)
])
);
select
*,
case
when topics[offset(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f' then decode_mint(data, topics)
when topics[offset(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822' then decode_burn(data, topics)
end as decoded_payload
from `public-data-finance.crypto_ethereum_kovan.logs`
where
array_length(topics) > 0
and (
(array_length(topics) = 2 and topics[offset(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f')
or (array_length(topics) = 3 and topics[offset(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822')
)
The offset(nn) is killing your function, change it for safe_offset(nn) and it will solve the problem. Also, this to field is generally empty in decode_burn, or null when decode_mint, so, at least with this data its just causing problem.
The following works and solve this issue:
create temporary function decode_burn(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
struct("to" as name, substring(topics[SAFE_OFFSET(2)], 67, 64) as value)
])
);
Edit1:
I have analyzed in detail the data sent to the functions and steps and you a right about the filters and how it is working. Seams you have reached a specific corner case (or bug ?):
As far as a could understand by the processing steps, BQ is optimizing your functions, as they are very similar, except by the fact that there is an additional field in one of them.
So, BQ engine is using the same optimized function for both data, which is causing the exception when the input is the data of rows with topics[OFFSET(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f'
As this is happening:
using the safe_offset still a good call;
Or, Create only one function and use the conditional on it, in this case the query will be processed correctly:
CREATE TEMPORARY FUNCTION decode(data_payload string, topics ARRAY<string>) AS (
wrap_struct([
STRUCT("sender" AS name,
SUBSTRING(topics[OFFSET(1)], 3) AS value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
if(topics[OFFSET(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822',
STRUCT("to" AS name, topics[OFFSET(2)] AS value),null )])
);
select *, decode(DATA,topics) ...
In parallel, you can open a case on issue tracker

Extract information from a json string in BigQuery

I am storing a table in Bigquery with the results of a classification algorithm. The table schema is INT, STRING and looks something like this :
ID
Output
1001
{'Apple Cider': 0.7, 'Coffee' : 0.2, 'Juice' : 0.1}
1002
{'Black Coffee':0.9, 'Tea':0.1}
The problem is how to fetch the first (or second or any order) element of each string together with its score. It doesn't seem likely that JSON_EXTRACT can work and most likely it can be done with Javascript. Was wondering what an elegant solution would look like here.
Consider below
select ID,
trim(split(kv, ':')[offset(0)], " '") element,
cast(split(kv, ':')[offset(1)] as float64) score,
element_position
from `project.dataset.table` t,
unnest(regexp_extract_all(trim(Output, '{}'), r"'[^':']+'\s?:\s?[^,]+")) kv with offset as element_position
If applied to sample data in your question - output is
Note: you can use less verbose unnest statement if you wish
unnest(split(trim(Output, '{}'))) kv with offset as element_position

Using Array(Tuple(LowCardinality(String), Int32)) in ClickHouse

I have a table
CREATE TABLE table (
id Int32,
values Array(Tuple(LowCardinality(String), Int32)),
date Date
) ENGINE MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (id, date)
but when executing the request
SELECT count(*)
FROM table
WHERE (arrayExists(x -> ((x.1) = toLowCardinality('pattern')), values) = 1)
I get an error
Code: 49. DB::Exception: Received from clickhouse:9000. DB::Exception: Cannot capture column 3 because it has incompatible type: got String, but LowCardinality(String) is expected..
If I replace the column 'values'
values Array(Tuple(String, Int32))
then the request is executed without errors.
What could be the problem when using Array(Tuple(LowCardinality(String), Int32))?
Until it will be fixed (see bug 7815), can be used this workaround:
SELECT uniqExact((id, date)) AS count
FROM table
ARRAY JOIN values
WHERE values.1 = 'pattern'
For the case when there are more than one Array-columns can be used this way:
SELECT uniqExact((id, date)) AS count
FROM
(
SELECT
id,
date,
arrayJoin(values) AS v,
arrayJoin(values2) AS v2
FROM table
WHERE v.1 = 'pattern' AND v2.1 = 'pattern2'
)
values Array(Tuple(LowCardinality(String), Int32)),
Do not use Tuple. It brings only cons.
It's still *2 files on the disk.
It gives twice slowdown then you extract only one tuple element
https://gist.github.com/den-crane/f20a2dce94a2926a1e7cfec7cdd12f6d
valuesS Array(LowCardinality(String)),
valuesI Array(Int32)

Cast DATETIME to STRING working on ARRAY of STRUCT on BigQuery Standard SQL

in my table I have an attribute called 'messages' with this exact data type:
ARRAY<STRUCT<created_time DATETIME ,`from` STRUCT<id STRING,
name STRING,email STRING>, id STRING, message STRING>>
and I have defined a UDF named my_func()
Because UDF function in Big Query don't support the type DATETIME I need to cast the attribute created_time.
So I tried this:
safe_cast ( messages as ARRAY<STRUCT<created_time STRING ,
'from` STRUCT<id STRING, name STRING, email STRING>,
id STRING, message STRING>>) as messages_casted
and I get this error
Casting between arrays with incompatible element types is not
supported: Invalid cast from...
There is an error in the way I cast an array struct?
There is some way to use UDF with this data structure or the only way is to flatten the array and do the cast?
My goal is to take the array in the JS execution environment in order to make the aggregation with JS code.
When working with JavaScript UDFs, you don't need to specify complex input data types explicitly. Instead, you can use the TO_JSON_STRING function. In your case, you can have the UDF take messages as a STRING, then parse it inside the UDF. You would call my_func(TO_JSON_STRING(messages)), for example.
Here is an example from the documentation:
CREATE TEMP FUNCTION SumFieldsNamedFoo(json_row STRING)
RETURNS FLOAT64
LANGUAGE js AS """
function SumFoo(obj) {
var sum = 0;
for (var field in obj) {
if (obj.hasOwnProperty(field) && obj[field] != null) {
if (typeof obj[field] == "object") {
sum += SumFoo(obj[field]);
} else if (field == "foo") {
sum += obj[field];
}
}
}
return sum;
}
var row = JSON.parse(json_row);
return SumFoo(row);
""";
WITH Input AS (
SELECT STRUCT(1 AS foo, 2 AS bar, STRUCT('foo' AS x, 3.14 AS foo) AS baz) AS s, 10 AS foo UNION ALL
SELECT NULL, 4 AS foo UNION ALL
SELECT STRUCT(NULL, 2 AS bar, STRUCT('fizz' AS x, 1.59 AS foo) AS baz) AS s, NULL AS foo
)
SELECT
TO_JSON_STRING(t) AS json_row,
SumFieldsNamedFoo(TO_JSON_STRING(t)) AS foo_sum
FROM Input AS t;
+---------------------------------------------------------------------+---------+
| json_row | foo_sum |
+---------------------------------------------------------------------+---------+
| {"s":{"foo":1,"bar":2,"baz":{"x":"foo","foo":3.14}},"foo":10} | 14.14 |
| {"s":null,"foo":4} | 4 |
| {"s":{"foo":null,"bar":2,"baz":{"x":"fizz","foo":1.59}},"foo":null} | 1.59 |
+---------------------------------------------------------------------+---------+
Because UDF function in Big Query don't support the type DATETIME I need to cast the attribute created_time.
Below is for BigQuery Standard SQL and is a very simple way of Casting specific element of ARRAY leaving everything else as is.
So in below example - it CASTs created_time from DATETIME to STRING (you can use any compatible type you need in your case though)
#standardSQL
SELECT messages,
ARRAY(SELECT AS STRUCT * REPLACE (SAFE_CAST(created_time AS STRING) AS created_time)
FROM UNNEST(messages) message
) casted_messages
FROM `project.dataset.table`
If you run it against your data - you will see original and casted messages - all elements should be same (value/type) with exception of (as expected) created_time which will be of casted type (STRING in this particular case) or NULL if not compatible
You can test / play with above using dummy data as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT [STRUCT<created_time DATETIME,
`from` STRUCT<id STRING, name STRING, email STRING>,
id STRING,
message STRING>
(DATETIME '2018-01-01 13:00:00', ('1', 'mike', 'zzz#ccc'), 'a', 'abc'),
(DATETIME '2018-01-02 14:00:00', ('2', 'john', 'yyy#bbb'), 'b', 'xyz')
] messages
)
SELECT messages,
ARRAY(SELECT AS STRUCT * REPLACE (SAFE_CAST(created_time AS STRING) AS created_time)
FROM UNNEST(messages) message
) casted_messages
FROM `project.dataset.table`