Count unique within combination of json keys in BigQuery - sql

In BigQuery I do have a json stored in 1 column like this:
{"key1": "value1", "key3":"value3"}
{"key2": "value2"}
{"key3": "value3"}
What I'd want to know is how to calculate number of unique combinations, paying attention that there can be up to 100+ different keys so avoiding listing them would be beneficial.
In example above end result of unique number will be 2, because first and third matched by "key3", while second didn't matched with anything.
I understand how to build this with writing an app that will calculate it, but would like to see if there is any solution possible with 1 query

If your JSON values are formatted with no spaces after the :, then you can treat this as string manipulations:
with t as (
select '{"key1":"value1", "key3":"value3"}' as kv union all
select '{"key2":"value2"}' union all
select '{"key3":"value3"}'
)
select x, count(*)
from t cross join
unnest(regexp_extract_all(t.kv, '"[^,]+"')) x
group by x
having count(*) = 1;
With the spaces, you can use replace() to get rid of them:
with t as (
select '{"key1": "value1", "key3":"value3"}' as kv union all
select '{"key2": "value2"}' union all
select '{"key3": "value3"}'
)
select replace(x, '": "', '":"'), count(*)
from t cross join
unnest(regexp_extract_all(t.kv, '"[^,]+"')) x
group by 1
having count(*) = 1;

Related

BigQuery - concatenate ignoring NULL

I'm very new to SQL. I understand in MySQL there's the CONCAT_WS function, but BigQuery doesn't recognise this.
I have a bunch of twenty fields I need to CONCAT into one comma-separated string, but some are NULL, and if one is NULL then the whole result will be NULL. Here's what I have so far:
CONCAT(m.track1, ", ", m.track2))) As Tracks,
I tried this but it returns NULL too:
CONCAT(m.track1, IFNULL(m.track2,CONCAT(", ", m.track2))) As Tracks,
Super grateful for any advice, thank you in advance.
Unfortunately, BigQuery doesn't support concat_ws(). So, one method is string_agg():
select t.*,
(select string_agg(track, ',')
from (select t.track1 as track union all select t.track2) x
) x
from t;
Actually a simpler method uses arrays:
select t.*,
array_to_string([track1, track2], ',')
Arrays with NULL values are not supported in result sets, but they can be used for intermediate results.
I have a bunch of twenty fields I need to CONCAT into one comma-separated string
Assuming that these are the only fields in the table - you can use below approach - generic enough to handle any number of columns and their names w/o explicit enumeration
select
(select string_agg(col, ', ' order by offset)
from unnest(split(trim(format('%t', (select as struct t.*)), '()'), ', ')) col with offset
where not upper(col) = 'NULL'
) as Tracks
from `project.dataset.table` t
Below is oversimplified dummy example to try, test the approach
#standardSQL
with `project.dataset.table` as (
select 1 track1, 2 track2, 3 track3, 4 track4 union all
select 5, null, 7, 8
)
select
(select string_agg(col, ', ' order by offset)
from unnest(split(trim(format('%t', (select as struct t.*)), '()'), ', ')) col with offset
where not upper(col) = 'NULL'
) as Tracks
from `project.dataset.table` t
with output

How to search for multiple matches using the IN operator in BigQuery?

Right now I am filtering my rows by using the WHERE operator and 2 conditional statements. It seems somewhat inefficient that I am writing 2 conditions. Would it be possible to check if "amznbida" and "ksga" are in the array by only writing in one statement?
standardSQL
-- Get all the keys
SELECT
*
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE
"amznbida" IN UNNEST(ARRAY(SELECT name FROM UNNEST(keywords)))
AND
"ksga"IN UNNEST(ARRAY(SELECT name FROM UNNEST(keywords)))
Just remove the UNNEST(ARRAY( part and leave the subquery - you should be fine.
working example:
SELECT
*,
t in (select * from unnest(a)) condition
FROM unnest([
struct('a' as t, ['a', 'b', 'c'] as a),
('b',['r', 'f'])
])
Below is for BigQuery Standard SQL
#standardSQL
SELECT *
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE 2 = (SELECT COUNT(DISTINCT name) FROM UNNEST(keywords) WHERE name IN ("amznbida", "ksga"))
Yu can test, play with above using dummy data as below
#standardSQL
WITH `encoded-victory-198215.DFP_TEST.test3` AS (
SELECT
ARRAY<STRUCT<value ARRAY<STRING>, name STRING>>[
STRUCT(['ksg-1', 'ksg-2'], 'ksga'), STRUCT(['amznbid-1', 'amznbid-2'], 'amznbida')
] keywords,
1 impression UNION ALL
SELECT
ARRAY<STRUCT<value ARRAY<STRING>, name STRING>>[
STRUCT(['xxx-1', 'xxx-2'], 'xxxa'), STRUCT(['amznbid-1', 'amznbid-2'], 'amznbida')
] keywords,
2 impression
)
SELECT *
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE 2 = (SELECT COUNT(DISTINCT name) FROM UNNEST(keywords) WHERE name IN ("amznbida", "ksga"))
with result
Row keywords.value keywords.name impression
1 ksg-1 ksga 1
ksg-2
amznbid-1 amznbida
amznbid-2

Extracting Key-worlds out of string and show them in another column

I need to write a query to extract specific names out of String and have them show in another column for example a column has this field
Column:
Row 1: jasdhj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj
Row 2:jasnasc8918212,ahsahkdjjMina67,
Row 3:kasdhakshd,asda,asdasd,121,121,Sina878788kasas
Key Words: Amin,Mina,Sina
How could I have these key words in another column? I dont want to insert another column but if that's the only solution let me know.
Any help appreciated!
Below is for BigQuery Standard SQL
#standardSQL
WITH keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
You can test, play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
), keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
with results as
Row str keywords_in_str
1 jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj Amin,Mina
2 jasnasc8918212,ahsahkdjjMina67, Mina
3 kasdhakshd,asda,asdasd,121,121,Sina878788kasas Sina
to count the no of keywords
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
)
select str,array(select as struct countif(lower(x) ="amin") amin,countif(lower(x) ="mina") mina,countif(lower(x)="sina") sina from unnest(x)x)keyword from
(select str,regexp_extract_all(str,"(?i)(Amin|Mina|Sina)")x from `project.dataset.table`)

Using a case statement as an if statement

I am attempting to create an IF statement in BigQuery. I have built a concept that will work but it does not select the data from a table, I can only get it to display 1 or 0
Example:
SELECT --AS STRUCT
CASE
WHEN (
Select Count(1) FROM ( -- If the records are the same, then return = 0, if the records are not the same then > 1
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Prior_Filtered`
Except Distinct
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest_Filtered`
)
)= 0
THEN
(Select * from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest`) -- This Does not
work Scalar subquery cannot have more than one column unless using SELECT AS
STRUCT to build STRUCT values at [16:4] END
SELECT --AS STRUCT
CASE
WHEN (
Select Count(1) FROM ( -- If the records are the same, then return = 0, if the records are not the same then > 1
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Prior_Filtered`
Except Distinct
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest_Filtered`
)
)= 0
THEN 1 --- This does work
Else
0
END
How can I Get this query to return results from an existing table?
You question is still a little generic, so my answer same as well - and just mimic your use case at extend I can reverse engineer it from your comments
So, in below code - project.dataset.yourtable mimics your table ; whereas
project.dataset.yourtable_Prior_Filtered and project.dataset.yourtable_Latest_Filtered mimic your respective views
#standardSQL
WITH `project.dataset.yourtable` AS (
SELECT 'aaa' cols, 'prior' filter UNION ALL
SELECT 'bbb' cols, 'latest' filter
), `project.dataset.yourtable_Prior_Filtered` AS (
SELECT cols FROM `project.dataset.yourtable` WHERE filter = 'prior'
), `project.dataset.yourtable_Latest_Filtered` AS (
SELECT cols FROM `project.dataset.yourtable` WHERE filter = 'latest'
), check AS (
SELECT COUNT(1) > 0 changed FROM (
SELECT DISTINCT cols FROM `project.dataset.yourtable_Latest_Filtered`
EXCEPT DISTINCT
SELECT DISTINCT cols FROM `project.dataset.yourtable_Prior_Filtered`
)
)
SELECT t.* FROM `project.dataset.yourtable` t
CROSS JOIN check WHERE check.changed
the result is
Row cols filter
1 aaa prior
2 bbb latest
if you changed your table to
WITH `project.dataset.yourtable` AS (
SELECT 'aaa' cols, 'prior' filter UNION ALL
SELECT 'aaa' cols, 'latest' filter
) ......
the result will be
Row cols filter
Query returned zero records.
I hope this gives you right direction
Added more explanations:
I can be wrong - but per your question - it looks like you have one table project.dataset.yourtable and two views project.dataset.yourtable_Prior_Filtered and project.dataset.yourtable_Latest_Filtered which present state of your table prior and after some event
So, first three CTE in the answer above just mimic those table and views which you described in your question.
They are here so you can see concept and can play with it without any extra work before adjusting this to your real use-case.
For your real use-case you should omit them and use your real table and views names and whatever columns the have.
So the query for you to play with is:
#standardSQL
WITH check AS (
SELECT COUNT(1) > 0 changed FROM (
SELECT DISTINCT cols FROM `project.dataset.yourtable_Latest_Filtered`
EXCEPT DISTINCT
SELECT DISTINCT cols FROM `project.dataset.yourtable_Prior_Filtered`
)
)
SELECT t.* FROM `project.dataset.yourtable` t
CROSS JOIN check WHERE check.changed
It should be a very simple IF statement in any language.
Unfortunately NO! it cannot be done with just simple IF and if you see it fit you can submit a feature request to BigQuery team for whatever you think makes sense

How to get yes/no statistics from SQL of how often strings occur each

Is there a way to query a table from BigQuery project HTTPArchive by checking how often certain strings occur by a certain file type?
I was able to write a query for a single check but how to perform this query on multiple strings at once without needing to send the same query every time just with a different string check and process the ~800GB of table data every time?
Getting the results as array might work somehow? I want to publish in-depth monthly statistics to the public for free so the option to send those queries separately and get billed for querying of roughly $2000/month is no option for me as a student.
SELECT matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (
SELECT url, (LOWER(body) CONTAINS 'document.write') AS matched
FROM httparchive.har.2017_09_01_chrome_requests_bodies
WHERE url LIKE "%.js"
)
GROUP BY matched
Please note that this is just one example of many (~50) and the pre-generated stats are not what I am looking for as it doesn't contain the needed information.
Below is for BigQuery Standard SQL
#standardSQL
WITH strings AS (
SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT
ext, str, COUNT(1) total,
COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches,
ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s
GROUP BY ext, str
-- ORDER BY ext, str
You can test / play with above using [totally] dummy data as below
#standardSQL
WITH `httparchive.har.2017_09_01_chrome_requests_bodies` AS (
SELECT '1234.js' AS url, 'abc=1;x=2' AS body UNION ALL
SELECT 'qaz.js', 'y=1;xyz=0' UNION ALL
SELECT 'edc.go', 's=1;xyz=2;abc=3' UNION ALL
SELECT 'edc.go', 's=1;xyz=4;abc=5' UNION ALL
SELECT 'rfv.php', 'd=1' UNION ALL
SELECT 'tgb.txt', '?abc=xyz' UNION ALL
SELECT 'yhn.php', 'like v' UNION ALL
SELECT 'ujm.go', 'lkjsad' UNION ALL
SELECT 'ujm.go', 'yhj' UNION ALL
SELECT 'ujm.go', 'dfgh' UNION ALL
SELECT 'ikl.js', 'werwer'
), strings AS (
SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT
ext, str, COUNT(1) total,
COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches,
ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s
GROUP BY ext, str
ORDER BY ext, str
One method is to bring in a table with the different strings. This is the idea:
SELECT str, matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (SELECT crb.url, s.str, (LOWER(crb.body) CONTAINS s.str) AS matched
FROM httparchive.har.2017_09_01_chrome_requests_bodies crb CROSS JOIN
(SELECT 'document.write' as str UNION ALL
SELECT 'xxx' as str
) s
WHERE url LIKE "%.js"
)
GROUP BY str, matched;
You would just add more strings to s.