How to get yes/no statistics from SQL of how often strings occur each - sql

Is there a way to query a table from BigQuery project HTTPArchive by checking how often certain strings occur by a certain file type?
I was able to write a query for a single check but how to perform this query on multiple strings at once without needing to send the same query every time just with a different string check and process the ~800GB of table data every time?
Getting the results as array might work somehow? I want to publish in-depth monthly statistics to the public for free so the option to send those queries separately and get billed for querying of roughly $2000/month is no option for me as a student.
SELECT matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (
SELECT url, (LOWER(body) CONTAINS 'document.write') AS matched
FROM httparchive.har.2017_09_01_chrome_requests_bodies
WHERE url LIKE "%.js"
)
GROUP BY matched
Please note that this is just one example of many (~50) and the pre-generated stats are not what I am looking for as it doesn't contain the needed information.

Below is for BigQuery Standard SQL
#standardSQL
WITH strings AS (
SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT
ext, str, COUNT(1) total,
COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches,
ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s
GROUP BY ext, str
-- ORDER BY ext, str
You can test / play with above using [totally] dummy data as below
#standardSQL
WITH `httparchive.har.2017_09_01_chrome_requests_bodies` AS (
SELECT '1234.js' AS url, 'abc=1;x=2' AS body UNION ALL
SELECT 'qaz.js', 'y=1;xyz=0' UNION ALL
SELECT 'edc.go', 's=1;xyz=2;abc=3' UNION ALL
SELECT 'edc.go', 's=1;xyz=4;abc=5' UNION ALL
SELECT 'rfv.php', 'd=1' UNION ALL
SELECT 'tgb.txt', '?abc=xyz' UNION ALL
SELECT 'yhn.php', 'like v' UNION ALL
SELECT 'ujm.go', 'lkjsad' UNION ALL
SELECT 'ujm.go', 'yhj' UNION ALL
SELECT 'ujm.go', 'dfgh' UNION ALL
SELECT 'ikl.js', 'werwer'
), strings AS (
SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT
ext, str, COUNT(1) total,
COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches,
ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s
GROUP BY ext, str
ORDER BY ext, str

One method is to bring in a table with the different strings. This is the idea:
SELECT str, matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (SELECT crb.url, s.str, (LOWER(crb.body) CONTAINS s.str) AS matched
FROM httparchive.har.2017_09_01_chrome_requests_bodies crb CROSS JOIN
(SELECT 'document.write' as str UNION ALL
SELECT 'xxx' as str
) s
WHERE url LIKE "%.js"
)
GROUP BY str, matched;
You would just add more strings to s.

Related

How to convert fields to JSON in Postgresql

I have a table with the following schema (postgresql 14):
message sentiment classification
any text positive mobile, communication
message are only string, phrases.
sentiment is a string, only one word
classification are string but can have 1 to many word comma separated
I would like to create a json field with these columns, like this:
{"msg":"any text", "sentiment":"positive","classification":["mobile,"communication"]}
Also, if possible, is there a way to consider the classification this way:
{"msg":"any text", "sentiment":"positive","classification 1":"mobile","classification 2" communication"}
The first part of question is easy - Postgres provides functions for splitting string and converting to json:
with t(message, sentiment, classification) as (values
('any text','positive','mobile, communication')
)
select row_to_json(x.*)
from (
select t.message
, t.sentiment
, array_to_json(string_to_array(t.classification, ', ')) as classification
from t
) x
The second part is harder - your want json to have variable number of attributes, mixed of grouped and nongrouped data. I suggest to unwind all attributes and then assemble them back (note the numbered CTE is actually not needed if your real table has id - I just needed some column to group by):
with t(message, sentiment, classification) as (values
('any text','positive','mobile, communication')
)
, numbered (id, message, sentiment, classification) as (
select row_number() over (order by null)
, t.*
from t
)
, extracted (id,message,sentiment,classification,index) as (
select n.id
, n.message
, n.sentiment
, l.c
, l.i
from numbered n
join lateral unnest(string_to_array(n.classification, ', ')) with ordinality l(c,i) on true
), unioned (id, attribute, value) as (
select id, concat('classification ', index::text), classification
from extracted
union all
select id, 'message', message
from numbered
union all
select id, 'sentiment', sentiment
from numbered
)
select json_object_agg(attribute, value)
from unioned
group by id;
DB fiddle
Use jsonb_build_object and concatenate the columns you want
SELECT
jsonb_build_object(
'msg',message,
'sentiment',sentiment,
'classification',
string_to_array(classification,','))
FROM mytable;
Demo: db<>fiddle
The second output is definitely not trivial. The SQL code would be much larger and harder to maintain - not to mention that parsing such file also requires a little more effort.
You can use a cte to handle the flattening of the classification attributes and then perform the necessary grouping in the main queries for each problem component:
with cte(r, m, s, k) as (
select row_number() over (order by t.message), t.message, t.sentiment, v.* from tbl t
cross join json_array_elements(array_to_json(string_to_array(t.classification, ', '))) v
)
-- first part --
select json_build_object('msg', t1.message, 'sentiment', t1.sentiment, 'classification', string_to_array(t1.classification, ', ')) from tbl t1
-- second part --
select jsonb_build_object('msg', t1.m, 'sentiment', t1.s)||('{'||t1.g||'}')::jsonb
from (select c.m, c.s, array_to_string(array_agg('"classification '||c.r||'":'||c.k), ', ') g
from cte c group by c.m, c.s) t1

How to fin and extract substring BIGQUERY

A have a string column at BigQuery table for example:
name
WW_for_all_feed
EU_param_1_for_all_feed
AU_for_all_full_settings_18+
WW_for_us_param_5_for_us_feed
WW_for_us_param_5_feed
WW_for_all_25+
and also have a list of variables, for example :
param_1_for_all
param_5_for_us
param_5
full_settings
And if string at column "name" contains one of this substrings needs to extract it :
name
param
WW_for_all_feed
None
EU_param_1_for_all_feed
param_1_for_all
AU_for_all_full_settings_18+
full_settings
WW_for_us_param_5_for_us_feed
param_5_for_us
WW_for_us_param_5_feed
param_5
WW_for_all_25+
None
I want to try regexp and replace, but don't know pattern for find substring
Use below
select name, param
from your_table
left join params
on regexp_contains(name, param)
if apply to sample data as in your question
with your_table as (
select 'WW_for_all_feed' name union all
select 'EU_param_1_for_all_feed' union all
select 'AU_for_all_full_settings_18+' union all
select 'WW_for_us_param_5_for_us_feed' union all
select 'WW_for_all_25+'
), params as (
select 'param_1_for_all' param union all
select 'param_5_for_us' union all
select 'full_settings'
)
output is
but I have an another issue (updated question) If one of params is substring for another?
use below then
select name, string_agg(param order by length(param) desc limit 1) param
from your_table
left join params
on regexp_contains(name, param)
group by name
if applied to your updated data sample - output is

Count unique within combination of json keys in BigQuery

In BigQuery I do have a json stored in 1 column like this:
{"key1": "value1", "key3":"value3"}
{"key2": "value2"}
{"key3": "value3"}
What I'd want to know is how to calculate number of unique combinations, paying attention that there can be up to 100+ different keys so avoiding listing them would be beneficial.
In example above end result of unique number will be 2, because first and third matched by "key3", while second didn't matched with anything.
I understand how to build this with writing an app that will calculate it, but would like to see if there is any solution possible with 1 query
If your JSON values are formatted with no spaces after the :, then you can treat this as string manipulations:
with t as (
select '{"key1":"value1", "key3":"value3"}' as kv union all
select '{"key2":"value2"}' union all
select '{"key3":"value3"}'
)
select x, count(*)
from t cross join
unnest(regexp_extract_all(t.kv, '"[^,]+"')) x
group by x
having count(*) = 1;
With the spaces, you can use replace() to get rid of them:
with t as (
select '{"key1": "value1", "key3":"value3"}' as kv union all
select '{"key2": "value2"}' union all
select '{"key3": "value3"}'
)
select replace(x, '": "', '":"'), count(*)
from t cross join
unnest(regexp_extract_all(t.kv, '"[^,]+"')) x
group by 1
having count(*) = 1;

Extracting Key-worlds out of string and show them in another column

I need to write a query to extract specific names out of String and have them show in another column for example a column has this field
Column:
Row 1: jasdhj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj
Row 2:jasnasc8918212,ahsahkdjjMina67,
Row 3:kasdhakshd,asda,asdasd,121,121,Sina878788kasas
Key Words: Amin,Mina,Sina
How could I have these key words in another column? I dont want to insert another column but if that's the only solution let me know.
Any help appreciated!
Below is for BigQuery Standard SQL
#standardSQL
WITH keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
You can test, play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
), keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
with results as
Row str keywords_in_str
1 jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj Amin,Mina
2 jasnasc8918212,ahsahkdjjMina67, Mina
3 kasdhakshd,asda,asdasd,121,121,Sina878788kasas Sina
to count the no of keywords
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
)
select str,array(select as struct countif(lower(x) ="amin") amin,countif(lower(x) ="mina") mina,countif(lower(x)="sina") sina from unnest(x)x)keyword from
(select str,regexp_extract_all(str,"(?i)(Amin|Mina|Sina)")x from `project.dataset.table`)

Using a case statement as an if statement

I am attempting to create an IF statement in BigQuery. I have built a concept that will work but it does not select the data from a table, I can only get it to display 1 or 0
Example:
SELECT --AS STRUCT
CASE
WHEN (
Select Count(1) FROM ( -- If the records are the same, then return = 0, if the records are not the same then > 1
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Prior_Filtered`
Except Distinct
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest_Filtered`
)
)= 0
THEN
(Select * from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest`) -- This Does not
work Scalar subquery cannot have more than one column unless using SELECT AS
STRUCT to build STRUCT values at [16:4] END
SELECT --AS STRUCT
CASE
WHEN (
Select Count(1) FROM ( -- If the records are the same, then return = 0, if the records are not the same then > 1
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Prior_Filtered`
Except Distinct
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest_Filtered`
)
)= 0
THEN 1 --- This does work
Else
0
END
How can I Get this query to return results from an existing table?
You question is still a little generic, so my answer same as well - and just mimic your use case at extend I can reverse engineer it from your comments
So, in below code - project.dataset.yourtable mimics your table ; whereas
project.dataset.yourtable_Prior_Filtered and project.dataset.yourtable_Latest_Filtered mimic your respective views
#standardSQL
WITH `project.dataset.yourtable` AS (
SELECT 'aaa' cols, 'prior' filter UNION ALL
SELECT 'bbb' cols, 'latest' filter
), `project.dataset.yourtable_Prior_Filtered` AS (
SELECT cols FROM `project.dataset.yourtable` WHERE filter = 'prior'
), `project.dataset.yourtable_Latest_Filtered` AS (
SELECT cols FROM `project.dataset.yourtable` WHERE filter = 'latest'
), check AS (
SELECT COUNT(1) > 0 changed FROM (
SELECT DISTINCT cols FROM `project.dataset.yourtable_Latest_Filtered`
EXCEPT DISTINCT
SELECT DISTINCT cols FROM `project.dataset.yourtable_Prior_Filtered`
)
)
SELECT t.* FROM `project.dataset.yourtable` t
CROSS JOIN check WHERE check.changed
the result is
Row cols filter
1 aaa prior
2 bbb latest
if you changed your table to
WITH `project.dataset.yourtable` AS (
SELECT 'aaa' cols, 'prior' filter UNION ALL
SELECT 'aaa' cols, 'latest' filter
) ......
the result will be
Row cols filter
Query returned zero records.
I hope this gives you right direction
Added more explanations:
I can be wrong - but per your question - it looks like you have one table project.dataset.yourtable and two views project.dataset.yourtable_Prior_Filtered and project.dataset.yourtable_Latest_Filtered which present state of your table prior and after some event
So, first three CTE in the answer above just mimic those table and views which you described in your question.
They are here so you can see concept and can play with it without any extra work before adjusting this to your real use-case.
For your real use-case you should omit them and use your real table and views names and whatever columns the have.
So the query for you to play with is:
#standardSQL
WITH check AS (
SELECT COUNT(1) > 0 changed FROM (
SELECT DISTINCT cols FROM `project.dataset.yourtable_Latest_Filtered`
EXCEPT DISTINCT
SELECT DISTINCT cols FROM `project.dataset.yourtable_Prior_Filtered`
)
)
SELECT t.* FROM `project.dataset.yourtable` t
CROSS JOIN check WHERE check.changed
It should be a very simple IF statement in any language.
Unfortunately NO! it cannot be done with just simple IF and if you see it fit you can submit a feature request to BigQuery team for whatever you think makes sense