BigQuery - Count how many words in array are equal - sql

I want to count how many similar words I have in a path (which will be split at delimiter /) and return a matching array of integers.
Input data will be something like:
I want to add another column, match_count, with an array of integers. For example:
To replicate this case, this is the query I'm working with:
CREATE TEMP FUNCTION HOW_MANY_MATCHES_IN_PATH(src_path ARRAY<STRING>, test_path ARRAY<STRING>) RETURNS ARRAY<INTEGER> AS (
-- WHAT DO I PUT HERE?
);
SELECT
*,
HOW_MANY_MATCHES_IN_PATH(src_path, test_path) as dir_path_match_count
FROM (
SELECT
ARRAY_AGG(x) AS src_path,
ARRAY_AGG(y) as test_path
FROM
UNNEST([
'lib/client/core.js',
'lib/server/core.js'
]) AS x, UNNEST([
'test/server/core.js'
]) as y
)
I've tried working with ARRAY and UNNEST in the HOW_MANY_MATCHES_IN_PATH function, but I either end up with an error or an array of 4 items (in this example)

Consider below approach
create temp function how_many_matches_in_path(src_path string, test_path string) returns integer as (
(select count(distinct src)
from unnest(split(src_path, '/')) src,
unnest(split(test_path, '/')) test
where src = test)
);
select *,
array( select how_many_matches_in_path(src, test)
from t.src_path src with offset
join t.test_path test with offset
using(offset)
) dir_path_match_count
from your_table t
if to apply to sample of Input data in your question
with your_table as (
select
['lib/client/core.js', 'lib/server/core.js'] src_path,
['test/server/core.js', 'test/server/core.js'] test_path
)
output is

Related

bigquery joining a field and json unnested resulted only left hand side table

Need help to get the select statement of normal text record and json unnest answers.
I am getting only left hand normal text record. Am I missing some thing?.
#standardSQL
CREATE TEMP FUNCTION jsonunnest(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(input).map(j=>JSON.stringify(j));
""";
WITH `Impact_JSON` AS (
SELECT
Impact_Question_id,
Impact_Question_text, json,
ROW_NUMBER() OVER (PARTITION BY bdmp_id, DATE(Impact_Question_aktualisiert_am_ts)
ORDER BY
Impact_Question_aktualisiert_am_ts DESC) AS ROW
FROM
`<project.dataset.table` basetable
),
json_answers AS (
SELECT
regexp_replace(SPLIT(ANY_VALUE(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')),'_')[SAFE_OFFSET(1)], "[^0-9]+"," " ) AS Interview_ID,
regexp_replace(SPLIT(ANY_VALUE(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')),'_')[SAFE_OFFSET(3)], "[^0-9]+"," " ) AS Quest_ID,
STRING_AGG(DISTINCT(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')), ',\n')
AS Impact_antwort_id,
STRING_AGG(DISTINCT(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_daten_typ')),',\n')
AS Impact_reply_data_type,
IFNULL(JSON_EXTRACT_SCALAR(Impact, '$.Impact_topic_text'), 'Empty') AS Impact_topic_text,
IFNULL(JSON_EXTRACT_SCALAR(Impact, '$.Impact_reply_get'), 'Empty') AS Impact_reply_get,
FROM `Impact_JSON`,
UNNEST(jsonunnest(JSON_EXTRACT(json, '$.reply'))) Impact
GROUP by 5,6
),
Impact_Question_id_TBL AS (
select Impact_Question_id from `Impact_JSON` AS C
)
SELECT
Impact_Question_id
FROM
`Impact_JSON` AS T
left join
json_answers as J
ON
(SAFE_CAST(J.Interview_ID as INT64))
=
T.Impact_Question_id
The left hand side records and right hand side records in same table should be captured.
What do you expect as output?
For me the udf did not work. Otherwise I generated some sample data and shorten your query for testing. This had been a good starting point for a question!
Also I changed the join to a full outer join.
For each dataset there is a number in column Impact_Question_id. There is json column containing a nested array data structure. The json data is unnested and group by the column Impact_reply_get. The first part of the column Impact_antwort_id is extracted and named Interview_ID. Because of the grouping, you select a random value. Then you join by this to the master table to the column Impact_Question_id.
The random selecting by any_value of the column Impact_antwort_id (Interview_ID) could be the issue for the mismatch. I would also group by this value and risk double matches.
#standardSQL
CREATE TEMP FUNCTION jsonunnest(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
try{
return [].concat(JSON.parse(input) || [] ).map(j=>JSON.stringify(j));
} catch(e) {return ["no",JSON.parse(input)]}
""";
WITH
basetable as (Select row_number() over () Impact_Question_id,
"txt" as Impact_Question_text,
json
#1 as bdmp_id,
#current_date() Impact_Question_aktualisiert_am_ts,
from unnest([ '{"reply":[{"Impact_antwort_id":"anytext_2","Impact_reply_get":"ok test 2"}]}','{"reply":[{"Impact_antwort_id":"anytext_3"}]}']) as json
),
`Impact_JSON` AS (
SELECT
Impact_Question_id,
Impact_Question_text, json,
#ROW_NUMBER() OVER (PARTITION BY bdmp_id, DATE(Impact_Question_aktualisiert_am_ts)
# ORDER BY
# Impact_Question_aktualisiert_am_ts DESC) AS ROW
FROM
basetable
),
json_answers AS (
SELECT
regexp_replace(SPLIT(ANY_VALUE(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')),'_')[SAFE_OFFSET(1)], "[^0-9]+"," " ) AS Interview_ID,
# regexp_replace(SPLIT(ANY_VALUE(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')),'_')[SAFE_OFFSET(3)], "[^0-9]+"," " ) AS Quest_ID,
# STRING_AGG(DISTINCT(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')), ',\n')
# AS Impact_antwort_id,
#STRING_AGG(DISTINCT(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_daten_typ')),',\n')
# AS Impact_reply_data_type,
# IFNULL(JSON_EXTRACT_SCALAR(Impact, '$.Impact_topic_text'), 'Empty') AS Impact_topic_text,
IFNULL(JSON_EXTRACT_SCALAR(Impact, '$.Impact_reply_get'), 'Empty') AS Impact_reply_get,
string_agg(Impact) as impact_parsed_json,
FROM `Impact_JSON`,
UNNEST(jsonunnest(JSON_EXTRACT(json, '$.reply'))) Impact
GROUP by 2 #5,6
)
SELECT
*
FROM
`Impact_JSON` AS T
full join
json_answers as J
ON
(SAFE_CAST(J.Interview_ID as INT64))
=
T.Impact_Question_id

Is there something like Spark's unionByName in BigQuery?

I'd like to concatenate tables with different schemas, filling unknown values with null.
Simply using UNION ALL of course does not work like this:
WITH
x AS (SELECT 1 AS a, 2 AS b ),
y AS (SELECT 3 AS b, 4 AS c )
SELECT * FROM x
UNION ALL
SELECT * FROM y
a b
1 2
3 4
(unwanted result)
In Spark, I'd use unionByName to get the following result:
a b c
1 2
3 4
(wanted result)
Of course, I can manually create the needed query (adding nullss) in BigQuery like so:
SELECT a, b, NULL c FROM x
UNION ALL
SELECT NULL a, b, c FROM y
But I'd prefer to have a generic solution, not requiring me to generate something like that.
So, is there something like unionByName in BigQuery? Or can one come up with a generic SQL function for this?
Consider below approach (I think it is as generic as one can get)
create temp function json_extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function json_extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp table temp_table as (
select json, key, value
from (
select to_json_string(t) json from table_x as t
union all
select to_json_string(t) from table_y as t
) t, unnest(json_extract_keys(json)) key with offset
join unnest(json_extract_values(json)) value with offset
using(offset)
order by key
);
execute immediate(select '''
select * except(json) from temp_table
pivot (any_value(value) for key in ("''' || string_agg(distinct key, '","') || '"))'
from temp_table
)
if applied to sample data in your question - output is

ARRAY_CONTACT() returns empty array

I am compiling a list of values per users from 2 different columns into a single array like:
with test as (
select 1 as userId, 'something' as value1, cast(null as string) as value2
union all
select 1 as userId, cast(null as string), cast(null as string)
)
select
userId,
ARRAY_CONCAT(
ARRAY_AGG(distinct value1 ignore nulls ),
ARRAY_AGG(distinct value2 ignore nulls )
) as combo,
from test
group by userId
Everything works one until ARRAY_AGG() but then the ARRAY_CONCAT() just won't have it and returns and empty array [] whereas I expect it to be ['something'].
I am at loss as to why this is happening and whether I can force a workaround here.
I am at loss as to why this is happening ...
ARRAY_CONCAT function returns NULL if any input argument is NULL
... and whether I can force a workaround here
Use below workaround
select
userid,
array_concat(
ifnull(array_agg(distinct value1 ignore nulls ), []),
ifnull(array_agg(distinct value2 ignore nulls ), [])
) as combo,
from test
group by userid
if applied to sample data in your question - output is

Split string in bigquery

I've the following string that I would like to split and given in rows.
Example values in my column are:
['10000', '10001', '10002', '10003', '10004']
Using the SPLIT function, I get the following result:
I've two questions:
How do I split it so that I get '10000', instead of ['10000'?
How do I remove the Apostrof ' ?
Response:
Consider below example
with t as (
select ['10000', '10001', '10002', '10003', '10004'] col
)
select cast(item as int64) num
from t, unnest(col) item
Above is assumption that col is array. In case if it is a string - use below
with t as (
select "['10000', '10001', '10002', '10003', '10004']" col
)
select cast(trim(item, " '[]") as int64) num
from t, unnest(split(col)) item
Both with output

Bigquery array of STRINGs to array of INTs

I'm trying to pull an array of INT64 s in BigQuery standard SQL from a column which is a long string of numbers separated by commas (for example, 2013,1625,1297,7634). I can pull an array of strings easily with:
SELECT
SPLIT(string_col,",")
FROM
table
However, I want to return an array of INT64 s, not an array of strings. How can I do that? I've tried
CAST(SPLIT(string_col,",") AS ARRAY<INT64>)
but that doesn't work.
Below is for BigQuery Standard SQL
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, '2013,1625,1297,7634' AS string_col UNION ALL
SELECT 2, '1,2,3,4,5'
)
SELECT id,
(SELECT ARRAY_AGG(CAST(num AS INT64))
FROM UNNEST(SPLIT(string_col)) AS num
) AS num,
ARRAY(SELECT CAST(num AS INT64)
FROM UNNEST(SPLIT(string_col)) AS num
) AS num_2
FROM yourTable
Mikhail beat me to it and his answer is more extensive but adding this as a more minimal repro:
SELECT CAST(num as INT64) from unnest(SPLIT("2013,1625,1297,7634",",")) as num;