Extracting Key-worlds out of string and show them in another column - google-bigquery

I need to write a query to extract specific names out of String and have them show in another column for example a column has this field
Column:
Row 1: jasdhj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj
Row 2:jasnasc8918212,ahsahkdjjMina67,
Row 3:kasdhakshd,asda,asdasd,121,121,Sina878788kasas
Key Words: Amin,Mina,Sina
How could I have these key words in another column? I dont want to insert another column but if that's the only solution let me know.
Any help appreciated!

Below is for BigQuery Standard SQL
#standardSQL
WITH keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
You can test, play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
), keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
with results as
Row str keywords_in_str
1 jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj Amin,Mina
2 jasnasc8918212,ahsahkdjjMina67, Mina
3 kasdhakshd,asda,asdasd,121,121,Sina878788kasas Sina

to count the no of keywords
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
)
select str,array(select as struct countif(lower(x) ="amin") amin,countif(lower(x) ="mina") mina,countif(lower(x)="sina") sina from unnest(x)x)keyword from
(select str,regexp_extract_all(str,"(?i)(Amin|Mina|Sina)")x from `project.dataset.table`)

Related

How to fin and extract substring BIGQUERY

A have a string column at BigQuery table for example:
name
WW_for_all_feed
EU_param_1_for_all_feed
AU_for_all_full_settings_18+
WW_for_us_param_5_for_us_feed
WW_for_us_param_5_feed
WW_for_all_25+
and also have a list of variables, for example :
param_1_for_all
param_5_for_us
param_5
full_settings
And if string at column "name" contains one of this substrings needs to extract it :
name
param
WW_for_all_feed
None
EU_param_1_for_all_feed
param_1_for_all
AU_for_all_full_settings_18+
full_settings
WW_for_us_param_5_for_us_feed
param_5_for_us
WW_for_us_param_5_feed
param_5
WW_for_all_25+
None
I want to try regexp and replace, but don't know pattern for find substring
Use below
select name, param
from your_table
left join params
on regexp_contains(name, param)
if apply to sample data as in your question
with your_table as (
select 'WW_for_all_feed' name union all
select 'EU_param_1_for_all_feed' union all
select 'AU_for_all_full_settings_18+' union all
select 'WW_for_us_param_5_for_us_feed' union all
select 'WW_for_all_25+'
), params as (
select 'param_1_for_all' param union all
select 'param_5_for_us' union all
select 'full_settings'
)
output is
but I have an another issue (updated question) If one of params is substring for another?
use below then
select name, string_agg(param order by length(param) desc limit 1) param
from your_table
left join params
on regexp_contains(name, param)
group by name
if applied to your updated data sample - output is

How to select columns of data in BigQuery that has all non-NULL values

I found this question on here: How to select columns of data in BigQuery that has all NULL values
but I would like to do the opposite and find all the columns with non-null values. How would I flip this previous solution to accomplish the opposite? I am not that familiar with regexp syntax and I couldn't figure out a solution trying to research this online.
Thank you for your help in advance.
The script of How to select columns of data in BigQuery that has all NULL values
can be modified as following:
WITH `project.dataset.table` AS (
SELECT 77 A, 1 B, NULL C UNION ALL
SELECT NULL, 6, NULL UNION ALL
SELECT NULL, 2, NULL UNION ALL
SELECT NULL, 3, NULL
)
SELECT all_column, count(null_column) as count_null, count(1) as total_rows
FROM `project.dataset.table` AS t,
UNNEST(REGEXP_EXTRACT_ALL(
TO_JSON_STRING(t),
r'\"([a-zA-Z0-9\_]+)\":')
) AS all_column
left join UNNEST(REGEXP_EXTRACT_ALL(
TO_JSON_STRING(t),
r'\"([a-zA-Z0-9\_]+)\":null')
) AS null_column
on null_column=all_column
GROUP BY 1
HAVING count(null_column)=count(1)
The TO_JSON_STRING converts each entry to following string column_name:value.
The REGEXP_EXTRACT_ALL( ... , r'\"([a-zA-Z0-9\_]+)\":') extract from that string the column name.
if the value is null.

How to search for multiple matches using the IN operator in BigQuery?

Right now I am filtering my rows by using the WHERE operator and 2 conditional statements. It seems somewhat inefficient that I am writing 2 conditions. Would it be possible to check if "amznbida" and "ksga" are in the array by only writing in one statement?
standardSQL
-- Get all the keys
SELECT
*
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE
"amznbida" IN UNNEST(ARRAY(SELECT name FROM UNNEST(keywords)))
AND
"ksga"IN UNNEST(ARRAY(SELECT name FROM UNNEST(keywords)))
Just remove the UNNEST(ARRAY( part and leave the subquery - you should be fine.
working example:
SELECT
*,
t in (select * from unnest(a)) condition
FROM unnest([
struct('a' as t, ['a', 'b', 'c'] as a),
('b',['r', 'f'])
])
Below is for BigQuery Standard SQL
#standardSQL
SELECT *
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE 2 = (SELECT COUNT(DISTINCT name) FROM UNNEST(keywords) WHERE name IN ("amznbida", "ksga"))
Yu can test, play with above using dummy data as below
#standardSQL
WITH `encoded-victory-198215.DFP_TEST.test3` AS (
SELECT
ARRAY<STRUCT<value ARRAY<STRING>, name STRING>>[
STRUCT(['ksg-1', 'ksg-2'], 'ksga'), STRUCT(['amznbid-1', 'amznbid-2'], 'amznbida')
] keywords,
1 impression UNION ALL
SELECT
ARRAY<STRUCT<value ARRAY<STRING>, name STRING>>[
STRUCT(['xxx-1', 'xxx-2'], 'xxxa'), STRUCT(['amznbid-1', 'amznbid-2'], 'amznbida')
] keywords,
2 impression
)
SELECT *
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE 2 = (SELECT COUNT(DISTINCT name) FROM UNNEST(keywords) WHERE name IN ("amznbida", "ksga"))
with result
Row keywords.value keywords.name impression
1 ksg-1 ksga 1
ksg-2
amznbid-1 amznbida
amznbid-2

How to get yes/no statistics from SQL of how often strings occur each

Is there a way to query a table from BigQuery project HTTPArchive by checking how often certain strings occur by a certain file type?
I was able to write a query for a single check but how to perform this query on multiple strings at once without needing to send the same query every time just with a different string check and process the ~800GB of table data every time?
Getting the results as array might work somehow? I want to publish in-depth monthly statistics to the public for free so the option to send those queries separately and get billed for querying of roughly $2000/month is no option for me as a student.
SELECT matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (
SELECT url, (LOWER(body) CONTAINS 'document.write') AS matched
FROM httparchive.har.2017_09_01_chrome_requests_bodies
WHERE url LIKE "%.js"
)
GROUP BY matched
Please note that this is just one example of many (~50) and the pre-generated stats are not what I am looking for as it doesn't contain the needed information.
Below is for BigQuery Standard SQL
#standardSQL
WITH strings AS (
SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT
ext, str, COUNT(1) total,
COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches,
ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s
GROUP BY ext, str
-- ORDER BY ext, str
You can test / play with above using [totally] dummy data as below
#standardSQL
WITH `httparchive.har.2017_09_01_chrome_requests_bodies` AS (
SELECT '1234.js' AS url, 'abc=1;x=2' AS body UNION ALL
SELECT 'qaz.js', 'y=1;xyz=0' UNION ALL
SELECT 'edc.go', 's=1;xyz=2;abc=3' UNION ALL
SELECT 'edc.go', 's=1;xyz=4;abc=5' UNION ALL
SELECT 'rfv.php', 'd=1' UNION ALL
SELECT 'tgb.txt', '?abc=xyz' UNION ALL
SELECT 'yhn.php', 'like v' UNION ALL
SELECT 'ujm.go', 'lkjsad' UNION ALL
SELECT 'ujm.go', 'yhj' UNION ALL
SELECT 'ujm.go', 'dfgh' UNION ALL
SELECT 'ikl.js', 'werwer'
), strings AS (
SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT
ext, str, COUNT(1) total,
COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches,
ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s
GROUP BY ext, str
ORDER BY ext, str
One method is to bring in a table with the different strings. This is the idea:
SELECT str, matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (SELECT crb.url, s.str, (LOWER(crb.body) CONTAINS s.str) AS matched
FROM httparchive.har.2017_09_01_chrome_requests_bodies crb CROSS JOIN
(SELECT 'document.write' as str UNION ALL
SELECT 'xxx' as str
) s
WHERE url LIKE "%.js"
)
GROUP BY str, matched;
You would just add more strings to s.

SQL - Query column that does not exist

I have the following query where I am querying ISIN field.
SELECT Isin FROM FundPriceDetails
WHERE Isin IN
(
'ES06139009N6' , 'MAD',
'GB0002634946' , 'LSE',
'SG1L01001701' , 'SGX'
)
The second column does not exist but I wish to show it against ISIN values without inserting the row in my select query
How do I go about doing it ? A the moment I have only ISIN in my select statement. I need to create a anonymous column that contains the next column
Use a join:
SELECT x.*
FROM (SELECT 'ES06139009N6' AS lsin, 'MAD' AS col2 UNION ALL
SELECT 'GB0002634946', 'LSE' UNION ALL
SELECT 'SG1L01001701', 'SGX'
) x JOIN
FundPriceDetails fpd
ON fpd.lsin = x.lsin;