How to get count of matches in field of table for list of phrases from another table in bigquery? - google-bigquery

Given an arbitrary list of phrases phrase1, phrase2*, ... phraseN (say these are in another table Phrase_Table), how would one get the count of matches for each phrase in a field F in a bigquery table?
Here, "*" means there must be some non-empty/non-blank string after the phrase.
Lets say you have a table with and ID field and two string fields Field1, Field2
Output would look something like
id, CountOfPhrase1InField1, CountOfPhrase2InField1, CountOfPhrase1InField2, CountOfPhrase2InField2
or I guess instead of all of those output fields maybe there's a single json object field
id, [{"fieldName": Field1, "counts": {phrase1: m, phrase2: mm, ...},
{"fieldName": Field2, "counts": {phrase1: m2, phrase2: mm2, ...},...]
Thanks!

Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'foo1 foo foo40' str UNION ALL
SELECT 'test1 test test2 test'
), `project.dataset.keywords` AS (
SELECT 'foo' key UNION ALL
SELECT 'test'
)
SELECT str, ARRAY_AGG(STRUCT(key, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]'))) as matches)) all_matches
FROM `project.dataset.table`
CROSS JOIN `project.dataset.keywords`
GROUP BY str
with result
Row str all_matches.key all_matches.matches
1 foo1 foo foo40 foo 2
test 0
2 test1 test test2 test foo 0
test 2
If you prefer output as json you can add TO_JSON_STRING() as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'foo1 foo foo40' str UNION ALL
SELECT 'test1 test test2 test'
), `project.dataset.keywords` AS (
SELECT 'foo' key UNION ALL
SELECT 'test'
)
SELECT str, TO_JSON_STRING(ARRAY_AGG(STRUCT(key, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]'))) as matches))) all_matches
FROM `project.dataset.table`
CROSS JOIN `project.dataset.keywords`
GROUP BY str
with output
Row str all_matches
1 foo1 foo foo40 [{"key":"foo","matches":2},{"key":"test","matches":0}]
2 test1 test test2 test [{"key":"foo","matches":0},{"key":"test","matches":2}]
there are endless ways of presenting outputs like above - hope you will adjust it to whatever exactly you need :o)

Related

Put strings inside quotes with string_to_array()

I am using the following query:
WITH a as (SELECT unnest(string_to_array(animals, ',')) as "pets" FROM all_animals where id = 100)
select * from a
which returns the following data:
1 Cat
2 Dog
3 Bird
My question is, how can I format my string_to_array select above to include single quotes for the returned data to look like this:
1 'Cat'
2 'Dog'
3 'Bird'
Use quote_literal() to safely single-quote strings:
WITH a AS (
SELECT unnest(string_to_array(animals, ',')) AS pets
FROM all_animals
WHERE id = 100
)
SELECT quote_literal(pets) AS pets
FROM a;
Or shorter without the CTE:
SELECT quote_literal(unnest(string_to_array(animals, ','))) AS pets
FROM all_animals
WHERE id = 100;
db<>fiddle here

In BigQuery, how to filter on json for rows where sum of some subelements is positive?

Following up on How to get count of matches in field of table for list of phrases from another table in bigquery?
Where you end up with something like:
Row str all_matches
1 foo1 foo foo40 [{"key":"foo","matches":2},{"key":"test","matches":0}]
2 test1 test test2 test [{"key":"foo","matches":0},{"key":"test","matches":2}]
How could you further filter on those rows for which sum(matches over all keys) > 0 with StandardSQL?
To keep it simple - just add below line to the end of referenced query
HAVING SUM(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]')))) > 0
So, the final query (BigQuery Standard SQL) will be
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'foo1 foo foo40' str UNION ALL
SELECT 'test1 test test2 test' UNION ALL
SELECT 'abc xyz'
), `project.dataset.keywords` AS (
SELECT 'foo' key UNION ALL
SELECT 'test'
)
SELECT str,
TO_JSON_STRING(ARRAY_AGG(STRUCT(key, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]'))) AS matches))) all_matches
FROM `project.dataset.table`
CROSS JOIN `project.dataset.keywords`
GROUP BY str
HAVING SUM(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]')))) > 0
with result
Row str all_matches
1 foo1 foo foo40 [{"key":"foo","matches":2},{"key":"test","matches":0}]
2 test1 test test2 test [{"key":"foo","matches":0},{"key":"test","matches":2}]
Note: I added one more row into dummy data and it is filtered out from output because there is no matches at all in that row

Array column and aggregation in BigQuery SQL: Why the values are not all aggregated?

I've executed the code below in BigQuery
SELECT ( --inner query
SELECT STRING_AGG(c) FROM t1.array_column c
)
FROM (
select 1 as f1, ['1','2','3'] as array_column
union all
select 2 as f1, ['5','6','7'] as array_column
) t1;
I expected something like
Row|f0_
1 | 1,2,3,4,5,6,7
because there is no GROUP BY in the inner query. So, I'm expecting STRING_AGG to be evaluated on all the lines.
SELECT STRING_AGG(c) FROM t1.array_column c
Instead I'm getting something like this:
Row|f0_
1 |1,2,3
2 |5,6,7
I'm having troubles understand why I have this result
This is your query:
SELECT (SELECT STRING_AGG(c) FROM t1.array_column c
)
FROM (select 1 as f1, ['1', '2', '3'] as array_column
union all
select 2 as f1, ['5', '6', '7'] as array_column
) t1;
First, I'm surprised it works. I thought you needed unnest():
SELECT (SELECT STRING_AGG(c) FROM UNNEST(t1.array_column) c
)
What is happening? Well, this would be more obvious if you selected f1. Then you would get:
1 1,2,3
2 5,6,7
This should make it more clear. For each row in t1 (and there are two rows), your code is:
unnesting the array into rows with a column called c.
reaggregating those rows into a string (with no name)
If you want to combine the elements in the arrays, use array_concat_agg():
SELECT array_concat_agg(array_column)
FROM (select 1 as f1, ['1','2','3'] as array_column
union all
select 2 as f1, ['5','6','7'] as array_column
) t1;
If you want this represented as a string instead of an array, use array_to_string():
SELECT array_to_string(array_concat_agg(array_column), ',')
FROM (select 1 as f1, ['1','2','3'] as array_column
union all
select 2 as f1, ['5','6','7'] as array_column
) t1;
Below is for BigQuery Standard SQL
#standardSQL
SELECT STRING_AGG((SELECT STRING_AGG(c) FROM t1.array_column c))
FROM (
SELECT 1 AS f1, ['1','2','3'] AS array_column UNION ALL
SELECT 2 AS f1, ['5','6','7'] AS array_column
) t1
and produces
Row f0_
1 1,2,3,5,6,7
Note 1: you were almost there - you were just missing extra STRING_AGG that does final grouping of strings created off of respective array in each row
Note 2: because array_column is of ARRAY type it is treated as inner table referenced as t1.array_column as as such - FROM t1.array_column c is equivalent to FROM UNNEST(array_column) c - very cool hidden feature :o)

Extracting Key-worlds out of string and show them in another column

I need to write a query to extract specific names out of String and have them show in another column for example a column has this field
Column:
Row 1: jasdhj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj
Row 2:jasnasc8918212,ahsahkdjjMina67,
Row 3:kasdhakshd,asda,asdasd,121,121,Sina878788kasas
Key Words: Amin,Mina,Sina
How could I have these key words in another column? I dont want to insert another column but if that's the only solution let me know.
Any help appreciated!
Below is for BigQuery Standard SQL
#standardSQL
WITH keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
You can test, play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
), keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
with results as
Row str keywords_in_str
1 jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj Amin,Mina
2 jasnasc8918212,ahsahkdjjMina67, Mina
3 kasdhakshd,asda,asdasd,121,121,Sina878788kasas Sina
to count the no of keywords
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
)
select str,array(select as struct countif(lower(x) ="amin") amin,countif(lower(x) ="mina") mina,countif(lower(x)="sina") sina from unnest(x)x)keyword from
(select str,regexp_extract_all(str,"(?i)(Amin|Mina|Sina)")x from `project.dataset.table`)

How do I combine 2 records with a single field into 1 row with 2 fields (Oracle 11g)?

Here's a sample data
record1: field1 = test2
record2: field1 = test3
The actual output I want is
record1: field1 = test2 | field2 = test3
I've looked around the net but can't find what I'm looking for. I can use a custom function to get it in this format but I'm trying to see if there's a way to make it work without resorting to that.
thanks a lot
You need to use pivot:
with t(id, d) as (
select 1, 'field1 = test2' from dual union all
select 2, 'field1 = test3' from dual
)
select *
from t
pivot (max (d) for id in (1, 2))
If you don't have the id field you can generate it, but you will have XML type:
with t(d) as (
select 'field1 = test2' from dual union all
select 'field1 = test3' from dual
), t1(id, d) as (
select ROW_NUMBER() OVER(ORDER BY d), d from t
)
select *
from t1
pivot xml (max (d) for id in (select id from t1))
There are several ways to approach this - google pivot rows to columns. Here is one set of answers: http://www.dba-oracle.com/t_converting_rows_columns.htm