How to implement generic Oracle DECODE function in BigQuery? - google-bigquery

I'm looking into implementing the Oracle DECODE function as a UDF.
Below is the outward functionaliy
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions040.htm
Below is the outward functionality and syntax of a decode in Oracle:
Oracle:
DECODE( <expr> , <search1> , <result1> [ , <search2> , <result2> ... ] [ , <default> ] )
SELECT product_id,
DECODE (warehouse_id, 1, 'Southlake',
2, 'San Francisco',
3, 'New Jersey',
4, 'Seattle',
'Non domestic')
"Location of inventory" FROM inventories;
Primarily, with BigQuery UDFs SQL or JavaScript, with BigQuery UDFs, when you define the UDF function you need to know the number of parameters you are accepting and typing. When you define the SQL UDF function, you can also accept an array of type anything, but I am not sure if it can work and SQL UDF can perform what we want with an array. It seems based on the Javascript UDF documentation all parameters are named and typed and known up front.
Is there a way to accomplish this using a BigQuery UDF, it has to be dynamic like Oracle decode and fit any scenario you put in front of it not static knowing what you are decoding

Below is for BigQuery Standard SQL
CREATE TEMP FUNCTION DECODE(expr ANY TYPE, map ANY TYPE, `default` ANY TYPE ) AS ((
IFNULL((SELECT result FROM UNNEST(map) WHERE search = expr), `default`)
));
You can see how it works using below example
#standardSQL
CREATE TEMP FUNCTION DECODE(expr ANY TYPE, map ANY TYPE, `default` ANY TYPE ) AS ((
IFNULL((SELECT result FROM UNNEST(map) WHERE search = expr), `default`)
));
WITH `project.dataset.inventories` AS (
SELECT 1 product_id, 4 warehouse_id UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5
)
SELECT product_id, warehouse_id,
DECODE(warehouse_id,
[STRUCT<search INT64, result STRING>
(1,'Southlake'),
(2,'San Francisco'),
(3,'New Jersey'),
(4,'Seattle')
], 'Non domestic') AS `Location_of_inventory`
FROM `project.dataset.inventories`
with result
Row product_id warehouse_id Location_of_inventory
1 1 4 Seattle
2 2 2 San Francisco
3 3 5 Non domestic
Another example of use is:
#standardSQL
CREATE TEMP FUNCTION DECODE(expr ANY TYPE, map ANY TYPE, `default` ANY TYPE ) AS ((
IFNULL((SELECT result FROM UNNEST(map) WHERE search = expr), `default`)
));
WITH `project.dataset.inventories` AS (
SELECT 1 product_id, 4 warehouse_id UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5
), map AS (
SELECT 1 search, 'Southlake' result UNION ALL
SELECT 2, 'San Francisco' UNION ALL
SELECT 3, 'New Jersey' UNION ALL
SELECT 4, 'Seattle'
)
SELECT product_id, warehouse_id,
DECODE(warehouse_id, kv, 'Non domestic') AS `Location_of_inventory`
FROM `project.dataset.inventories`,
(SELECT ARRAY_AGG(STRUCT(search, result)) AS kv FROM map) arr
with the same output
Update to address - "for a reusable UDF, not having to name the fields makes it closer to Oracle's implementation."
CREATE TEMP FUNCTION DECODE(expr ANY TYPE, map ANY TYPE, `default` ANY TYPE ) AS (
IFNULL((
SELECT result FROM (
SELECT NULL AS search, NULL AS result UNION ALL SELECT * FROM UNNEST(map)
)
WHERE search = expr
), `default`)
);
So now - previous examples can be used w/o explicit naming fields as in example below
#standardSQL
CREATE TEMP FUNCTION DECODE(expr ANY TYPE, map ANY TYPE, `default` ANY TYPE ) AS (
IFNULL((
SELECT result FROM (
SELECT NULL AS search, NULL AS result UNION ALL SELECT * FROM UNNEST(map)
)
WHERE search = expr
), `default`)
);
WITH `project.dataset.inventories` AS (
SELECT 1 product_id, 4 warehouse_id UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5
)
SELECT product_id, warehouse_id,
DECODE(warehouse_id,
[ (1,'Southlake'),
(2,'San Francisco'),
(3,'New Jersey'),
(4,'Seattle')
], 'Non domestic') AS `Location_of_inventory`
FROM `project.dataset.inventories`
still with same output as before

```
SELECT product_id,
case
when warehouse_id = 1 then 'Southlake'
when warehouse_id = 2 then 'San Francisco'
when warehouse_id = 3 then 'New Jersey'
when warehouse_id = 4 then 'Seattle'
else
'Non domestic'
end as "Location of inventory" FROM inventories;
```

Related

ROW type/constructor in BigQuery

Does BigQuery have the concept of a ROW, for example, similar to MySQL or Postgres or Oracle or Snowflake? I know it sort of implicitly uses it when doing an INSERT ... VALUES (...) , for example:
INSERT dataset.Inventory (product, quantity)
VALUES('top load washer', 10),
('front load washer', 20)
Each of the values would be implicitly be a ROW type of the Inventory table, but is this construction allowed elsewhere in BigQuery? Or is this a feature that doesn't exist in BQ?
I think below is a simplest / naïve example of such constructor in BigQuery
with t1 as (
select 'top load washer' product, 10 quantity, 'a' type, 'x' category union all
select 'front load washer', 20, 'b', 'y'
), t2 as (
select 1 id, 'a' code, 'x' value union all
select 2, 'd', 'z'
)
select *
from t1
where (type, category) = (select as struct code, value from t2 where id = 1)
Besides using in simple queries, it can also be use in BQ scripts - for example (another simplistic example)
declare type, category string;
create temp table t2 as (
select 1 id, 'a' code, 'x' value union all
select 2, 'd', 'z'
);
set (type, category) = (select as struct code, value from t2 where id = 1);

How to convert the following dictionary format column into different format in Hive or Presto?

I have a Hive table as below:
event_name
attendees_per_countries
a
{'US':5}
b
{'US':4, 'UK': 3, 'CA': 2}
c
{'UK':2, 'CA': 1}
And I want to get a new table like below:
country
number_of_people
US
9
UK
5
CA
4
How to write a query in Hive or Presto?
You may use the following:
If the column type for attendees_per_countries is a string, you may use the following:
WITH sample_data AS (
select
event_name,
str_to_map(
regexp_replace(attendees_per_countries,'[{|}]',''),
',',
':'
) as attendees_per_countries
FROM
raw_data
)
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
However, if the column type for attendees_per_countries is already a map then you may use the following
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
Full reproducible example below
with raw_data AS (
select 'a' as event_name, "{'US':5}" as attendees_per_countries
UNION ALL
select 'b', "{'US':4, 'UK': 3, 'CA': 2}"
UNION ALL
select 'c', "{'UK':2, 'CA': 1}"
),
sample_data AS (
select
event_name,
str_to_map(
regexp_replace(attendees_per_countries,'[{}]',''),
',',
':'
) as attendees_per_countries
FROM
raw_data
)
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
Let me know if this works for you
In presto if you have attendees_per_countries as map you can use map_values and then sum them with array_sum/reduce (I need to use later cause Athena does not support former one). If not - you can treat you data as json and cast it to MAP(VARCHAR, INTEGER) and then use the mentioned functions:
WITH dataset(event_name, attendees_per_countries) AS (
VALUES
('a', JSON '{"US":5}'),
('b', JSON '{"US":4, "UK": 3, "CA": 2}'),
('c', JSON '{"UK":2, "CA": 1}')
)
SELECT event_name as country,
reduce(
map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
0,
(agg, curr) -> agg + curr,
s -> s
) as number_of_people
FROM dataset
order by 2 desc
Output:
country
number_of_people
b
9
a
5
c
3

Big query find data that could be in multiple columns

I have a table with the following data
id|task1_name|task1_date|task2_name|task2_date
1,breakfast,1/1/20,,
2,null,null,breakfast,,1/1/20
3,null,null,lunch,,1/1/20
4,dinner,1/1/20,lunch,1/1/10
I'd like to build a view that always displayed the task names in the same column or null if it could not be found in any of the columns e.g.
id|dinner_date|lunch_date|breakfast_date
1,1/1/20, null, null
2,null, null, 1/1/20
2,1/1/20, 1/1/10, null
I've tried using a nested IF statement e.g.
SELECT *
IF(task_1_name = 'dinner', task1_date, IF(task2_date = 'dinner', task2_date, NULL)) as `dinner_date`
FROM t
But as there are 50 or so columns in the real dataset, this seems like a stupid solution and would get complex very quickly, is there a smarter way here?
One method uses case expressions:
select t.*,
(case when task1_name = 'dinner' then task1_date
when task2_name = 'dinner' then task2_date
when task3_name = 'dinner' then task3_date
end) as dinner_date
from t;
Below is for BigQuery Standard SQL and generic enough to addresses concerns expressed in question. You don't need to know in advance number of columns and tasks names (although they should not have , or : which should not be a big limitation here and can be addressed if needed)
#standardSQL
CREATE TEMP TABLE ttt AS
SELECT id,
SPLIT(k, '_')[OFFSET(0)] task,
MAX(IF(SPLIT(k, '_')[OFFSET(1)] = 'name', v, NULL)) AS name,
MAX(IF(SPLIT(k, '_')[OFFSET(1)] = 'date', v, NULL)) AS DAY
FROM (
SELECT id,
TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') k,
TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') v
FROM `project.dataset.table` t,
UNNEST(SPLIT(TRIM(TO_JSON_STRING(t), '{}'))) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') != 'id'
AND TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') != 'null'
)
GROUP BY id, task;
EXECUTE IMMEDIATE '''
SELECT id, ''' || (
SELECT STRING_AGG(DISTINCT "MAX(IF(name = '" || name || "', day, NULL)) AS " || name || "_date")
FROM ttt
) || '''
FROM ttt
GROUP BY 1
ORDER BY 1
'''
Note; the assumption here is only about columns name to be task<N>_name and task<N>_date
If to apply to sample data (similar) to yours in question
WITH `project.dataset.table` AS (
SELECT 1 id, 'breakfast' task1_name, '1/1/21' task1_date, NULL task2_name, NULL task2_date UNION ALL
SELECT 2, NULL, NULL, 'breakfast', '1/1/22' UNION ALL
SELECT 3, NULL, NULL, 'lunch', '1/1/23' UNION ALL
SELECT 4, 'dinner', '1/1/24', 'lunch', '1/1/10'
)
output is
Row id breakfast_date lunch_date dinner_date
1 1 1/1/21 null null
2 2 1/1/22 null null
3 3 null 1/1/23 null
4 4 null 1/1/10 1/1/24
Here is another solution which doesn't use dynamic SQL, doesn't rely on specific column names and works with arbitrary number of columns:
WITH table AS (
SELECT 1 id, 'breakfast' task1_name, '1/1/21' task1_date, NULL task2_name, NULL task2_date UNION ALL
SELECT 2, NULL, NULL, 'breakfast', '1/1/22' UNION ALL
SELECT 3, NULL, NULL, 'lunch', '1/1/23' UNION ALL
SELECT 4, 'dinner', '1/1/24', 'lunch', '1/1/10'
)
SELECT
REGEXP_EXTRACT(f, r'breakfast\, ([^\,\)]*)'),
REGEXP_EXTRACT(f, r'lunch\, ([^\,\)]*)'),
REGEXP_EXTRACT(f, r'dinner\, ([^\,\)]*)')
FROM (
SELECT FORMAT("%t", t) f FROM table t
)

How to query on fields from nested records without referring to the parent records in BigQuery?

I have data structured as follows:
{
"results": {
"A": {"first": 1, "second": 2, "third": 3},
"B": {"first": 4, "second": 5, "third": 6},
"C": {"first": 7, "second": 8, "third": 9},
"D": {"first": 1, "second": 2, "third": 3},
... },
...
}
i.e. nested records, where the lowest level has the same schema for all records in the level above. The schema would be similar to this:
results RECORD NULLABLE
results.A RECORD NULLABLE
results.A.first INTEGER NULLABLE
results.A.second INTEGER NULLABLE
results.A.third INTEGER NULLABLE
results.B RECORD NULLABLE
results.B.first INTEGER NULLABLE
...
Is there a way to do (e.g. aggregate) queries in BigQuery on fields from the lowest level, without knowledge of the keys on the (direct) parent level? Put differently, can I do a query on first for all records in results without having to specify A, B, ... in my query?
I would for example want to achieve something like
SELECT SUM(results.*.first) FROM table
in order to get 1+4+7+1 = 13,
but SELECT results.*.first isn't supported.
(I've tried playing around with STRUCTs, but haven't gotten far.)
Below trick is for BigQuery Standard SQL
#standardSQL
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
You can test, play with above using dummy/sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
with output
Row id sum_first sum_second sum_third
1 1 13 17 21
Is there a way to do (e.g. aggregate) queries in BigQuery on fields from the lowest level, without knowledge of the keys on the (direct) parent level?
Below is for BigQuery Standard SQL and totally avoids referencing parent records (A, B, C, D, etc.)
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
if to apply to sample data from your question as in below example
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
output is
Row id first_sum second_sum third_sum forth_sum
1 1 13 17 21 null
I adapted Mikhail's answer in order to support grouping on the values of the lowest-level fields:
#standardSQL
CREATE TEMP FUNCTION Nested_AGGREGATE(entries ANY TYPE, field_name STRING) AS ((
SELECT ARRAY(
SELECT AS STRUCT TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') AS value, COUNT(SPLIT(kv, ':')[OFFSET(1)]) AS count
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
GROUP BY TRIM(SPLIT(kv, ':')[OFFSET(1)], '"')
)
));
SELECT id,
Nested_AGGREGATE(results, 'first') AS first_agg,
Nested_AGGREGATE(results, 'second') AS second_agg,
Nested_AGGREGATE(results, 'third') AS third_agg,
FROM `project.dataset.table`
Output for WITH `project.dataset.table` AS (SELECT 1 AS id, STRUCT( STRUCT(1 AS first, 2 AS second, 3 AS third) AS A, STRUCT(4 AS first, 5 AS second, 6 AS third) AS B, STRUCT(7 AS first, 8 AS second, 9 AS third) AS C, STRUCT(1 AS first, 2 AS second, 3 AS third) AS D) AS results ):
Row id first_agg.value first_agg.count second_agg.value second_agg.count third_agg.value third_agg.count
1 1 1 2 2 2 3 2
4 1 5 1 6 1
7 1 8 1 9 1

BigQuery UDF define constant dictionary and match for a given function argument

What is the best way to define a map: Map(1 -> "One", 2 -> "Two") and define a function which will match to above function? I am thinking of defining a dictionary via Javascript and matching in the function body. An example would be great. Thanks
Below example for BigQuery Standard SQL
#standardSQL
CREATE TEMP FUNCTION MAP(expr ANY TYPE, map ANY TYPE, `default` ANY TYPE ) AS (
IFNULL((
SELECT result
FROM (SELECT NULL AS search, NULL AS result UNION ALL SELECT * FROM UNNEST(map))
WHERE search = expr), `default`)
);
WITH `project.dataset.table` AS (
SELECT 1 id, 4 location_id UNION ALL
SELECT 2, 2 UNION ALL
SELECT 3, 5
)
SELECT id, location_id,
MAP(location_id,
[ (1, 'Los Angeles'),
(2, 'San Francisco'),
(3, 'New York'),
(4, 'Seattle')
], 'Non US') AS `Location`
FROM `project.dataset.table`
with result
Row id location_id Location
1 1 4 Seattle
2 2 2 San Francisco
3 3 5 Non US