I'm noticing some really weird behavior when using a CASE WHEN statement inside a UDF in one of my queries in google bigquery. The result is really bizarre, so either I'm missing something really obvious, or there's some weird behavior in the query execution.
(side note: if there's a more efficient way to implement the query logic below I'm all ears, my query takes forever)
I'm processing some log lines, where each line contains a data: string and topics: array<string> field that are used for decoding. Each type of log line will have different length of topics, and requires different decoding logic. I use a CASE WHEN inside a UDF to switch to different methods of decoding. I originally got a strange error of indexing too far into an array. This would either mean non-conformant data, or the wrong decoder getting called at some point. I verified that all the data conformed to the spec, so it must be the latter.
I've narrowed it down to an erroneous / extraneous decoder being executed inside my CASE WHEN for the wrong type.
The weirdest thing is that when I go and insert fixed value instead of my decoding functions, the return value of the CASE WHEN doesn't indicate that it's an erroneous match. Somehow the first function is getting called when I use functions, but when debugging I get the value from the proper value from the second WHEN.
I pulled the logic out of the udf, and implemented it with an if(..) instead of CASE WHEN and everything decodes fine. I'm wondering what's going on here, if it's a bug in bigquery, or something weird happens when using UDFs.
Here's a stripped down version of the query
-- helper function to normalize different payloads into a flattened struct
create temporary function wrap_struct(payload array<struct<name string, value string>>) as (
(select as struct
decode_field_type1(field1) as field1,
decode_field_type1(field2) as field2,
decode_field_type2(field3) as field3,
-- a bunch more fields
from (select * from
(select p.name, p.value
from unnest(payload) as p) pivot(string_agg(value) for name in (
'field1', 'field2', 'field3', --a bunch more fields
)
)
))
);
-- this topic uses the data and topics in the decoding, and has a topics array of length 4
-- this gets called from the switch with a payload from topics2, which has a shorter topics array of length 1, causing a failure
create temporary function decode_topic1(data string, topics array<string>) as
(
wrap_struct([
struct("field1" as name, substring(topics[offset(1)], 3) as value),
struct("field2" as name, substring(topics[offset(2)], 3) as value),
struct("field3" as name, substring(topics[offset(3)], 3) as value),
struct("field4" as name, substring(data, 3, 64) as value)
])
);
--this uses only the data_payload, and has a topics array of length 1
create temporary function decode_topic2(data string, topics array<string>) as
(
wrap_struct([
struct("field1" as name, substring(data, 3, 64) as value),
struct("field5" as name, substring(data, 67, 64) as value),
struct("field6" as name, substring(data, 131, 64) as value)
])
);
create temporary function decode_event_data(data string, topics array<string>) as
(
-- first element of topics denotes the type of event
case
-- somehow the function decode_topic1 gets called when topics[0] == topic2
-- HOWEVER, when i replaced the functions with a fixed value to debug
-- i get the expected results, indicating a proper match.
-- this is not unique these topics
-- it happens with other combinations also.
when topics[offset(0)] = 'topic1' then decode_topic1(data, topics)
when topics[offset(0)] = 'topic2' then decode_topic2(data, topics)
-- a bunch more topics
else wrap_struct([])
end
);
select
id, data, topics,
decode_event_data(data, topics) as decoded_payload
from (select * from mytable
where
topics[offset(0)] = 'topic1'
or topics[offset(0)] = 'topic2'
when i change the base query to:
select
id, data, topics, decode_topic2(data, topics)
from (select * from mytable
where
topics[offset(0)] = 'topic2'
it decodes fine.
What's up with the CASE WHEN?
edit: Here's a query on a public dataset that can generate the problem:
concat('0x', substring(raw, 25, 40))
);
create temporary function decode_amount(raw string) as (
concat('0x', raw)
);
create temporary function wrap_struct(payload array<struct<name string, value string>>) as (
(select as struct
decode_address(sender) as reserve,
decode_address(`to`) as `to`,
decode_amount(amount1) as amount1,
decode_amount(amount2) as amount2,
from (select * from
(select p.name, p.value
from unnest(payload) as p) pivot(string_agg(value) for name in (
'sender', 'to', 'amount1', 'amount2'
)
)
))
);
create temporary function decode_mint(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value)
])
);
create temporary function decode_burn(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
struct("to" as name, substring(topics[offset(2)], 67, 64) as value)
])
);
select
*,
case
when topics[offset(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f' then decode_mint(data, topics)
when topics[offset(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822' then decode_burn(data, topics)
end as decoded_payload
from `public-data-finance.crypto_ethereum_kovan.logs`
where
array_length(topics) > 0
and (
(array_length(topics) = 2 and topics[offset(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f')
or (array_length(topics) = 3 and topics[offset(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822')
)
The offset(nn) is killing your function, change it for safe_offset(nn) and it will solve the problem. Also, this to field is generally empty in decode_burn, or null when decode_mint, so, at least with this data its just causing problem.
The following works and solve this issue:
create temporary function decode_burn(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
struct("to" as name, substring(topics[SAFE_OFFSET(2)], 67, 64) as value)
])
);
Edit1:
I have analyzed in detail the data sent to the functions and steps and you a right about the filters and how it is working. Seams you have reached a specific corner case (or bug ?):
As far as a could understand by the processing steps, BQ is optimizing your functions, as they are very similar, except by the fact that there is an additional field in one of them.
So, BQ engine is using the same optimized function for both data, which is causing the exception when the input is the data of rows with topics[OFFSET(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f'
As this is happening:
using the safe_offset still a good call;
Or, Create only one function and use the conditional on it, in this case the query will be processed correctly:
CREATE TEMPORARY FUNCTION decode(data_payload string, topics ARRAY<string>) AS (
wrap_struct([
STRUCT("sender" AS name,
SUBSTRING(topics[OFFSET(1)], 3) AS value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
if(topics[OFFSET(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822',
STRUCT("to" AS name, topics[OFFSET(2)] AS value),null )])
);
select *, decode(DATA,topics) ...
In parallel, you can open a case on issue tracker
Related
I have an Athena table with one column having JSON and key/value pairs.
Ex:
Select test_client, test_column from ABC;
test_client, test_column
john, {"d":13, "e":210}
mark, {"a":1,"b":10,"c":1}
john, {"e":100,"a":110,"b":10, "d":10}
mark, {"a":56,"c":11,"f":9, "e": 10}
And I need to sum the values corresponding to keys and return in some sort like the below: return o/p format doesn't matter. I want to sum it up.
john: d: 23, e:310, a:110, b:10
mark: a:57, b:10, c:12, f:9, e:10
It is a combination of a few useful functions in Trino:
WITH example_table AS
(SELECT 'john' as person, '{"d":13, "e":210}' as _json UNION ALL
SELECT 'mark', ' {"a":1,"b":10,"c":1}' UNION ALL
SELECT 'john', '{"e":100,"a":110,"b":10, "d":10}' UNION ALL
SELECT 'mark', '{"a":56,"c":11,"f":9, "e": 10}')
SELECT person, reduce(
array_agg(CAST(json_parse(_json) AS MAP(VARCHAR, INTEGER))),
MAP(ARRAY['a'],ARRAY[0]),
(s, x) -> map_zip_with(
s,x, (k, v1, v2) ->
if(v1 is null, 0, v1) +
if(v2 is null, 0, v2)
),
s -> s
)
FROM example_table
GROUP BY person
json_parse - Parses the string to a JSON object
CAST ... AS MAP... - Creates a MAP from the JSON object
array_agg - Aggregates the maps for each Person based on the group by
reduce - steps through the aggregated array and reduce it to a single map
map_zip_with - applies a function on each similar key in two maps
if(... is null ...) - puts 0 instead of null if the key is not present
I have a table
CREATE TABLE table (
id Int32,
values Array(Tuple(LowCardinality(String), Int32)),
date Date
) ENGINE MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (id, date)
but when executing the request
SELECT count(*)
FROM table
WHERE (arrayExists(x -> ((x.1) = toLowCardinality('pattern')), values) = 1)
I get an error
Code: 49. DB::Exception: Received from clickhouse:9000. DB::Exception: Cannot capture column 3 because it has incompatible type: got String, but LowCardinality(String) is expected..
If I replace the column 'values'
values Array(Tuple(String, Int32))
then the request is executed without errors.
What could be the problem when using Array(Tuple(LowCardinality(String), Int32))?
Until it will be fixed (see bug 7815), can be used this workaround:
SELECT uniqExact((id, date)) AS count
FROM table
ARRAY JOIN values
WHERE values.1 = 'pattern'
For the case when there are more than one Array-columns can be used this way:
SELECT uniqExact((id, date)) AS count
FROM
(
SELECT
id,
date,
arrayJoin(values) AS v,
arrayJoin(values2) AS v2
FROM table
WHERE v.1 = 'pattern' AND v2.1 = 'pattern2'
)
values Array(Tuple(LowCardinality(String), Int32)),
Do not use Tuple. It brings only cons.
It's still *2 files on the disk.
It gives twice slowdown then you extract only one tuple element
https://gist.github.com/den-crane/f20a2dce94a2926a1e7cfec7cdd12f6d
valuesS Array(LowCardinality(String)),
valuesI Array(Int32)
Test data
DROP TABLE t;
CREATE TABLE t(_id serial PRIMARY KEY, data jsonb);
INSERT INTO t(data) VALUES
('{"a":1,"b":2, "c":3}')
, ('{"a":11,"b":12, "c":13}')
, ('{"a":21,"b":22, "c":23}')
Problem statement: I want to receive an arbitrary JSONB parameter which acts as a filter on column t.data, such as
{ "b":{ "from":0, "to":20 }, "c":13 }
and use this to select matching rows from my test table t.
In this example, I want rows where b is between 0 and 20 and c = 13.
No error is required if the filter specifies a "column" (or "tag") which does not exist in t.data - it just fails to find a match.
I've used numeric values for simplicity but would like an approach which generalises to text as well.
What I have tried so far. I looked at the containment approach, which works for equality conditions, but am stumped on a generic way of handling range conditions:
select * from t
where t.data#> '{"c":13}'::jsonb;
Background: This problem arose when building a generic table-preview page on a website (for Admin users).
The page displays a filter based on various columns in whichever table is selected for preview.
The filter is then passed to a function in Postgres DB which applies this dynamic filter condition to the table.
It returns a jsonb array of the rows matching the filter specified by the user.
This jsonb array is then used to populate the Preview resultset.
The columns which make up the filter may change.
My Postgres version is 9.6 - thanks.
if you want to parse { "b":{ "from":0, "to":20 }, "c":13 } you need a parser. It is out of scope of json functions, but you can write "generic" query using AND and OR to filter by such json, eg:
https://www.db-fiddle.com/f/jAPBQggG3p7CxqbKLMbPKw/0
with filt(f) as (values('{ "b":{ "from":0, "to":20 }, "c":13 }'::json))
select *
from t
join filt on
(f->'b'->>'from')::int < (data->>'b')::int
and
(f->'b'->>'to')::int > (data->>'b')::int
and
(data->>'c')::int = (f->>'c')::int
;
Thanks for the comments/suggestions.
I will definitely look at GraphQL when I have more time - I'm working under a tight deadline at the moment.
It seems the consensus is that a fully generic solution is not achievable without a parser.
However, I got a workable first draft - it's far from ideal but we can work with it. Any comments/improvements are welcome ...
Test data (expanded to include dates & text fields)
DROP TABLE t;
CREATE TABLE t(_id serial PRIMARY KEY, data jsonb);
INSERT INTO t(data) VALUES
('{"a":1,"b":2, "c":3, "d":"2018-03-10", "e":"2018-03-10", "f":"Blah blah" }')
, ('{"a":11,"b":12, "c":13, "d":"2018-03-14", "e":"2018-03-14", "f":"Howzat!"}')
, ('{"a":21,"b":22, "c":23, "d":"2018-03-14", "e":"2018-03-14", "f":"Blah blah"}')
First draft of code to apply a jsonb filter dynamically, but with restrictions on what syntax is supported.
Also, it just fails silently if the syntax supplied does not match what it expects.
Timestamp handling a bit kludgey, too.
-- Handle timestamp & text types as well as int
-- See is_timestamp(text) function at bottom
with cte as (
select t.data, f.filt, fk.key
from t
, ( values ('{ "a":11, "b":{ "from":0, "to":20 }, "c":13, "d":"2018-03-14", "e":{ "from":"2018-03-11", "to": "2018-03-14" }, "f":"Howzat!" }'::jsonb ) ) as f(filt) -- equiv to cross join
, lateral (select * from jsonb_each(f.filt)) as fk
)
select data, filt --, key, jsonb_typeof(filt->key), jsonb_typeof(filt->key->'from'), is_timestamp((filt->key)::text), is_timestamp((filt->key->'from')::text)
from cte
where
case when (filt->key->>'from') is null then
case jsonb_typeof(filt->key)
when 'number' then (data->>key)::numeric = (filt->>key)::numeric
when 'string' then
case is_timestamp( (filt->key)::text )
when true then (data->>key)::timestamp = (filt->>key)::timestamp
else (data->>key)::text = (filt->>key)::text
end
when 'boolean' then (data->>key)::boolean = (filt->>key)::boolean
else false
end
else
case jsonb_typeof(filt->key->'from')
when 'number' then (data->>key)::numeric between (filt->key->>'from')::numeric and (filt->key->>'to')::numeric
when 'string' then
case is_timestamp( (filt->key->'from')::text )
when true then (data->>key)::timestamp between (filt->key->>'from')::timestamp and (filt->key->>'to')::timestamp
else (data->>key)::text between (filt->key->>'from')::text and (filt->key->>'to')::text
end
when 'boolean' then false
else false
end
end
group by data, filt
having count(*) = ( select count(distinct key) from cte ) -- must match on all filter elements
;
create or replace function is_timestamp(s text) returns boolean as $$
begin
perform s::timestamp;
return true;
exception when others then
return false;
end;
$$ strict language plpgsql immutable;
How can I alias field1 as index & & field 2 as value
The query gives me an error:
#standardsql
with q1 as (select 1 x, ARRAY<struct<id string, cd ARRAY<STRUCT<index STRING,value STRING>>>>
[struct('h1',[('1','a')
,('2','b')
])
,('h2',[('3','c')
,('4','d')
])] hits
)
Select * from q1
ORDER by x
Error: Array element type STRUCT<STRING, ARRAY<STRUCT<STRING, STRING>>> does not coerce to STRUCT<id STRING, cd ARRAY<STRUCT<index STRING, value STRING>>> at [5:26]
Thanks a lot for your time in responding
Cheers!
#standardsql
WITH q1 AS (
SELECT
1 AS x,
[
STRUCT('h1' AS id, [STRUCT('1' AS index, 'a' AS value), ('2','b')] AS cd),
STRUCT('h2', [STRUCT('3' AS index, 'c' AS value), ('4','d')] AS cd)
] AS hits
)
SELECT *
FROM q1
-- ORDER BY x
or below might be even more "readable"
#standardsql
WITH q1 AS (
SELECT
1 AS x,
[
STRUCT('h1' AS id, [STRUCT<index STRING, value STRING>('1', 'a'), ('2','b')] AS cd),
STRUCT('h2', [STRUCT<index STRING, value STRING>('3', 'c'), ('4','d')] AS cd)
] AS hits
)
SELECT *
FROM q1
-- ORDER BY x
When I try to simulate data in BigQuery using the Standard Version I usually try to name all variables and aliases everywhere possible. For instance, your data works if you build it like so:
with q1 as (
select 1 x, ARRAY<struct<id string, cd ARRAY<STRUCT<index STRING,value STRING>>>> [struct('h1' as id,[STRUCT('1' as index,'a' as value) ,STRUCT('2' as index ,'b' as value)] as cd), STRUCT('h2',[STRUCT('3' as index,'c' as value) ,STRUCT('4' as index,'d' as value)] as cd)] hits
)
select * from q1
order by x
Notice I've built structs and put aliases inside of them in order for this to work (if you remove the aliases and the structs it might not work, but I found that this seems to be rather intermittent. If you fully describe the variables then it works all the time).
Also as a recommendation, I try to build simulated data piece by piece. First I create the struct and test it to see if BigQuery accepts it. After the validator is green, then I proceed to add more values. If you try to build everything at once you might find this a somewhat challenging task.
I am having some trouble creating some SQL (for SQL server 2008).
I have a table of tasks that are priority ordered, comma delimited tasks:
Id = 1, LongTaskName = "a,b,c"
Id = 2, LongTaskName = "a,c"
Id = 3, LongTaskName = "b,c"
Id = 4, LongTaskName = "a"
etc...
I am trying to build a new table that groups them by the first task, along with the id:
GroupName: "a", TaskId: 1
GroupName: "a", TaskId: 2
GroupName: "a", TaskId: 4
GroupName: "b", TaskId: 3
Here is the naive, slow, linq code:
foreach(var t in Tasks)
{
var gt = new GroupedTasks();
gt.TaskId = t.Id;
var firstWord = t.LongTaskName.Split(',');
if(firstWord.Count() > 0)
{
gt.GroupName = firstWord.First();
}
else
{
gt.GroupName = t.LongTaskName;
}
GroupedTasks.InsertOnSubmit(gt);
}
I wrote a sql function to do the string split:
create function fn_Split(
#String nvarchar (4000),
#Delimiter nvarchar (10)
)
returns nvarchar(4000)
begin
declare #FirstComma int
set #FirstComma = charindex(#Delimiter,#String)
if(#FirstComma = 0)
return #String
return substring(#String, 0, #FirstComma)
end
go
However, I am getting stuck on the real sql to do the work.
I can get the group by alone:
SELECT dbo.fn_Split(LongTaskName, ',')
FROM [dbo].[Tasks]
GROUP BY dbo.fn_Split(LongTaskName, ',')
And I know I need to head down something like this:
DECLARE #RowSet TABLE (GroupName nvarchar(1024), Id nvarchar(5))
insert into #RowSet
select ???
FROM [dbo].Tasks as T
INNER JOIN
(
SELECT dbo.fn_Split(LongTaskName, ',')
FROM [dbo].[Tasks]
GROUP BY dbo.fn_Split(LongTaskName, ',')
) G
ON T.??? = G.???
ORDER BY ???
INSERT INTO dbo.GroupedTasks(GroupName, Id)
select * from #RowSet
But I am not quite groking how to reference the grouped relationships and am confused about having to call split multiple times.
Any thoughts?
If you only care about the first item in the list, there's no need really for a function. I would recommend this way. You also don't need the #RowSet table variable for any temporary holding.
INSERT dbo.GroupedTasks(GroupName, Id)
SELECT
LEFT(LongTaskName, COALESCE(NULLIF(CHARINDEX(',', LongTaskName)-1, -1), 1024)),
Id
FROM dbo.Tasks;
It is even easier if the tasks are 1-character long, you can use LEFT(LongTaskName, 1) instead of the ugly SUBSTRING/CHARINDEX mess. But I'm guessing your task names are not one character long (if this is the case, you should include some data that varies a bit so that others don't make assumptions about length).
Now, keep in mind that you'll have to do something like this to keep dbo.GroupedTasks up to date every time a dbo.Tasks row is inserted, updated or deleted. How are you going to keep these two tables in sync?
More to the point, you should consider storing the top priority task separately in the first place, either by using a computed column or separating it out before the insert. Munging data together is something that you do with hash tables and arrays in application code, but it rarely has any positive attributes inside a database. You almost always spend more time and effort extracting the data apart than you ever saved by keeping it together in the first place. This will negate the need for a second table at all.
Select Id, Split( ',', LongTaskName ) as GroupName into TasksWithGroupInfo
Does this answer your question?