Pig: How to join data on a key in a nested bag - apache-pig

I'm simply trying to merge in the values from data2 to data1 on the 'value1'/'value2' keys seen in both data1 and data2 (note the nested structure of
Easy right? In object oriented code it's a nested for loop. But in Pig it feels like solving a rubix cube.
data1 = 'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
data2 = 'value1' 'result1'
'value2' 'result2'
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
expected: 'item1', 111, {('thing1', 222, {('value1','result1'), ('value2','result2')})}
^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^
For the curious: data1 comes from an object oriented datastore, which explains the double nesting (simple object oriented format).

It sounds like you basically just want to do a join (unclear from the question if this should be INNER, LEFT, RIGHT, or FULL. I think #SNeumann basically has the write answer, but I'll add some code to make it clearer.
Assuming the data looks like:
data1 = 'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
...
data2 = 'value1' 'result1'
'value2' 'result2'
...
I'd do something like (untested):
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
A_flattened = FOREACH A GENERATE item, d, things.thing AS thing; things.d1 AS d1, FLATTEN(things.values) AS value;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1'
--'item1', 111, 'thing1', 222, 'value2'
A_B_joined = JOIN A_flattened BY value, B BY v;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1', 'value1', 'result1'
--'item1', 111, 'thing1', 222, 'value1', 'value2', 'result2'
A_B_joined1 = FOREACH A_B_JOINED GENERATE item, d, thing, d1, A_flattened::value AS value, r AS result;
A_B_grouped = GROUP A_B_joined1 BY (value, result);
From there, rebagging however you like should be trivial.
EDIT: The above should have used '.' as the projection operator on tuples. I've switched that in. It also assumed things was a big tuple, which it isn't. It's a bag of one item. If the OP never plans to have more than one item in that bag, I'd highly recommend using a tuple instead and loading as:
A = load 'data1' as (item:chararray, d:int, things:(thing:chararray, d1:int, values:bag{(v:chararray)}));
and then using the rest of the code essentially as is (note: still untested).
If a bag is absolutely required, then the entire problem changes, and it becomes unclear what the OP wants to happen when there are multiple things objects in the bag. Bag projection is also quite a bit more complicated as noted here

I'd try to flatten the bag in A that contains values (1,2), join with B (inner, outer, whatever you're after) and then group again and project the desired structure using TOBAG, etc.

Related

BigQuery Dynamic JSON attributes as columnar data

I have a table with one of the columns containing JSON.
Col_A
Col_B
Col_C
1
Abc1
{“a”: “a_val”, “b”: “b_val”}
2
Abc2
{"a": “a_val2”, “c”: “c_val2”}
3
Abc3
{"b": “b_val3”, “c”: “c_val3”, “d”: {“x”: “x_val”, “y”: “y_val”}}
How can I put together BQ SQL to extract the columns and attributes of the JSON as additional columns. Need to go only 1 level deep into the JSON. So the output should look like:
Col_A
Col_B
A
B
C
D
1
Abc1
a_val
b_val
2
Abc2
a_val2
c_val2
3
Abc3
b_val3
c_val3
{“x”: “x_val”, “y”: “y_val”}
Consider below approach
create temp function json_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function json_path(json string, json_path string)
returns string
language js as """
try { var parsed = JSON.parse(json);
return JSON.stringify(jsonPath(parsed, json_path));
} catch (e) { return null }
"""
OPTIONS (
library="gs://my-storage/jsonpath-0.8.0.js"
);
select * from (
select t.* except(col_c), key, trim(json_path(col_c, '$.' || key), '"[]') value
from your_table t,
unnest(json_keys(col_c)) key
)
pivot (min(value) for key in ('a', 'b', 'c', 'd'))
if applied to sample data in your question
with your_table as (
select 1 col_a, 'abc1' col_b, '{"a": "a_val", "b": "b_val"}' col_c union all
select 2, 'abc2', '{"a": "a_val2", "c": "c_val2"}' union all
select 3, 'abc3', '{"b": "b_val3", "c": "c_val3", "d": {"x": "x_val", "y": "y_val"}}'
)
output is
In order to use above you need to upload jsonpath-0.8.0.js (can be download at https://code.google.com/archive/p/jsonpath/downloads) into your GCS bucket gs://my-storage/
As you can see - above solution assumes you know key names in advance
Obviously, when you know keys in advance you would simply use json_extract, but this would not work for when you don't know keys in advance!
So, if you don't know - above solution can be used as a template for dynamically generating query (with real keys in for key in ('a', 'b', 'c', 'd')) to be executed with EXECUTE IMMEDIATE

erroneous function call with case when inside UDF

I'm noticing some really weird behavior when using a CASE WHEN statement inside a UDF in one of my queries in google bigquery. The result is really bizarre, so either I'm missing something really obvious, or there's some weird behavior in the query execution.
(side note: if there's a more efficient way to implement the query logic below I'm all ears, my query takes forever)
I'm processing some log lines, where each line contains a data: string and topics: array<string> field that are used for decoding. Each type of log line will have different length of topics, and requires different decoding logic. I use a CASE WHEN inside a UDF to switch to different methods of decoding. I originally got a strange error of indexing too far into an array. This would either mean non-conformant data, or the wrong decoder getting called at some point. I verified that all the data conformed to the spec, so it must be the latter.
I've narrowed it down to an erroneous / extraneous decoder being executed inside my CASE WHEN for the wrong type.
The weirdest thing is that when I go and insert fixed value instead of my decoding functions, the return value of the CASE WHEN doesn't indicate that it's an erroneous match. Somehow the first function is getting called when I use functions, but when debugging I get the value from the proper value from the second WHEN.
I pulled the logic out of the udf, and implemented it with an if(..) instead of CASE WHEN and everything decodes fine. I'm wondering what's going on here, if it's a bug in bigquery, or something weird happens when using UDFs.
Here's a stripped down version of the query
-- helper function to normalize different payloads into a flattened struct
create temporary function wrap_struct(payload array<struct<name string, value string>>) as (
(select as struct
decode_field_type1(field1) as field1,
decode_field_type1(field2) as field2,
decode_field_type2(field3) as field3,
-- a bunch more fields
from (select * from
(select p.name, p.value
from unnest(payload) as p) pivot(string_agg(value) for name in (
'field1', 'field2', 'field3', --a bunch more fields
)
)
))
);
-- this topic uses the data and topics in the decoding, and has a topics array of length 4
-- this gets called from the switch with a payload from topics2, which has a shorter topics array of length 1, causing a failure
create temporary function decode_topic1(data string, topics array<string>) as
(
wrap_struct([
struct("field1" as name, substring(topics[offset(1)], 3) as value),
struct("field2" as name, substring(topics[offset(2)], 3) as value),
struct("field3" as name, substring(topics[offset(3)], 3) as value),
struct("field4" as name, substring(data, 3, 64) as value)
])
);
--this uses only the data_payload, and has a topics array of length 1
create temporary function decode_topic2(data string, topics array<string>) as
(
wrap_struct([
struct("field1" as name, substring(data, 3, 64) as value),
struct("field5" as name, substring(data, 67, 64) as value),
struct("field6" as name, substring(data, 131, 64) as value)
])
);
create temporary function decode_event_data(data string, topics array<string>) as
(
-- first element of topics denotes the type of event
case
-- somehow the function decode_topic1 gets called when topics[0] == topic2
-- HOWEVER, when i replaced the functions with a fixed value to debug
-- i get the expected results, indicating a proper match.
-- this is not unique these topics
-- it happens with other combinations also.
when topics[offset(0)] = 'topic1' then decode_topic1(data, topics)
when topics[offset(0)] = 'topic2' then decode_topic2(data, topics)
-- a bunch more topics
else wrap_struct([])
end
);
select
id, data, topics,
decode_event_data(data, topics) as decoded_payload
from (select * from mytable
where
topics[offset(0)] = 'topic1'
or topics[offset(0)] = 'topic2'
when i change the base query to:
select
id, data, topics, decode_topic2(data, topics)
from (select * from mytable
where
topics[offset(0)] = 'topic2'
it decodes fine.
What's up with the CASE WHEN?
edit: Here's a query on a public dataset that can generate the problem:
concat('0x', substring(raw, 25, 40))
);
create temporary function decode_amount(raw string) as (
concat('0x', raw)
);
create temporary function wrap_struct(payload array<struct<name string, value string>>) as (
(select as struct
decode_address(sender) as reserve,
decode_address(`to`) as `to`,
decode_amount(amount1) as amount1,
decode_amount(amount2) as amount2,
from (select * from
(select p.name, p.value
from unnest(payload) as p) pivot(string_agg(value) for name in (
'sender', 'to', 'amount1', 'amount2'
)
)
))
);
create temporary function decode_mint(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value)
])
);
create temporary function decode_burn(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
struct("to" as name, substring(topics[offset(2)], 67, 64) as value)
])
);
select
*,
case
when topics[offset(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f' then decode_mint(data, topics)
when topics[offset(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822' then decode_burn(data, topics)
end as decoded_payload
from `public-data-finance.crypto_ethereum_kovan.logs`
where
array_length(topics) > 0
and (
(array_length(topics) = 2 and topics[offset(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f')
or (array_length(topics) = 3 and topics[offset(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822')
)
The offset(nn) is killing your function, change it for safe_offset(nn) and it will solve the problem. Also, this to field is generally empty in decode_burn, or null when decode_mint, so, at least with this data its just causing problem.
The following works and solve this issue:
create temporary function decode_burn(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
struct("to" as name, substring(topics[SAFE_OFFSET(2)], 67, 64) as value)
])
);
Edit1:
I have analyzed in detail the data sent to the functions and steps and you a right about the filters and how it is working. Seams you have reached a specific corner case (or bug ?):
As far as a could understand by the processing steps, BQ is optimizing your functions, as they are very similar, except by the fact that there is an additional field in one of them.
So, BQ engine is using the same optimized function for both data, which is causing the exception when the input is the data of rows with topics[OFFSET(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f'
As this is happening:
using the safe_offset still a good call;
Or, Create only one function and use the conditional on it, in this case the query will be processed correctly:
CREATE TEMPORARY FUNCTION decode(data_payload string, topics ARRAY<string>) AS (
wrap_struct([
STRUCT("sender" AS name,
SUBSTRING(topics[OFFSET(1)], 3) AS value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
if(topics[OFFSET(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822',
STRUCT("to" AS name, topics[OFFSET(2)] AS value),null )])
);
select *, decode(DATA,topics) ...
In parallel, you can open a case on issue tracker

How to make a IN query with hstore?

I have a field (content) in a table containing keys and values (hstore) like this :
content: {"price"=>"15.2", "quantity"=>"3", "product_id"=>"27", "category_id"=>"2", "manufacturer_id"=>"D"}
I can easily select product having ONE category_id with :
SELECT * FROM table WHERE "content #> 'category_id=>27'"
I want to select all lines having (for example) category_id IN a list of value.
In classic SQL it would be something like this :
SELECT * FROM TABLE WHERE category_id IN (27, 28, 29, ....)
Thanks you in advance
De-reference the key and test it with IN as normal.
CREATE TABLE hstoredemo(content hstore not null);
INSERT INTO hstoredemo(content) VALUES
('"price"=>"15.2", "quantity"=>"3", "product_id"=>"27", "category_id"=>"2", "manufacturer_id"=>"D"');
then one of these. The first is cleaner, as it casts the extracted value to integer rather than doing string compares on numbers.
SELECT *
FROM hstoredemo
WHERE (content -> 'category_id')::integer IN (2, 27, 28, 29);
SELECT *
FROM hstoredemo
WHERE content -> 'category_id' IN ('2', '27', '28', '29');
If you had to test more complex hstore contains operations, say with multiple keys, you could use #> ANY, e.g.
SELECT *
FROM hstoredemo
WHERE
content #> ANY(
ARRAY[
'"category_id"=>"27","product_id"=>"27"',
'"category_id"=>"2","product_id"=>"27"'
]::hstore[]
);
but it's not pretty, and it'll be a lot slower, so don't do this unless you have to.
category_ids = ["27", "28", "29"]
Tablename.where("category_id IN(?)", category_ids)

SQL IN operator, separate input values at commas

user inputs to text_field_tag e.g. a,b,c are seen as one value1 i.e. value1 = 'a,b,c' in a SQL IN operator (value1, value2, value3,...,) instead of value1='a', value2='b' and value3='c'.
I'm using sequel's db.fetch to write the SQL. Splits and Joins and their associated regexp formats don't seem to give the form 'a','b','c', i.e. separate values in a SQL IN Operator.
Any thoughts?
Assuming you have some user input as a string:
user_input = 'a,b,c'
and a posts table with a value1 column. You can get posts with values a, b or c with the following query:
values = user_input.split(',')
#=> ["a", "b", "c"]
DB = Sequel.sqlite
#=> #<Sequel::SQLite::Database: {:adapter=>:sqlite}>
dataset = DB[:posts]
#=> #<Sequel::SQLite::Dataset: "SELECT * FROM `posts`">
dataset.where(:value1 => values).sql
#=> #<Sequel::SQLite::Dataset: "SELECT * FROM `posts` WHERE (`value1` IN ('a', 'b', 'c'))">

how to create a small constant relation(table) in pig?

is there a way to create a small constant relation(table) in pig?
I need to create a relation with only 1 tuple that contains constant values.
something along the lines of:
A = LOAD using ConstantLoader('{(1,2,3)}');
thanks, Ido
I'm not sure why you would need that but, here's an ugly solution:
A = LOAD 'some/sample/file' ;
B = FOREACH A GENERATE '' ;
C = LIMIT A 1 ;
Now, you can use 'C' as the 'empty relation' that has one empty tuple.
DEFINE GenerateRelationFromString(string) RETURNS relation {
temp = LOAD 'somefile';
tempLimit1 = LIMIT temp 1;
$relation = FOREACH tempLimit1 GENERATE FLATTEN(TOKENIZE('$string', ','));
};
usage:
fourRows = GenerateRelationFromString('1,2,3,4');
myConstantRelation = FOREACH fourRows GENERATE (
CASE $0
WHEN '1' THEN (1, 'Ivan')
WHEN '2' THEN (2, 'Boris')
WHEN '3' THEN (3, 'Vladimir')
WHEN '4' THEN (4, 'Olga')
END
) as myTuple;
This for sure is hacky, and the right way, in my mind, would be to implement a StringLoader() that would work like this:
fourRows = LOAD '1,2,3,4' USING StringLoader(',');
The argument typically used for file location would just be used as litral string input.
Fast answer: no.
I asked about it in pig-dev mailing list.