get a distinct count and repeating per row in BigQuery SQL - google-bigquery

i havent been able to find an answer to this but I think it should be possible to do this with only one query with BigQuery - i'm looking for something close to this, even with approximate results.
let's say i have a table that looks like this, with one column called myValue
====
myValue
====
foo |
bar |
bar |
baz |
i would like to be able to have a single query that results in a column next to it that has the distinct counts of myValue, on every row.
=======|=====================|
myValue|myNewCounts |
=======|=====================|
foo |[foo:1, bar:2, baz:1]
bar |[foo:1, bar:2, baz:1]
bar |[foo:1, bar:2, baz:1]
baz |[foo:1, bar:2, baz:1]
I know that you can use ARRAY_AGG(distinct) to get the distinct values on every row, but I haven't been able to find a way to also get the counts as well on every row, even in an approximate fashion.
It's important that this be done in a single query - I could obviously have a separate query that calculates the distinct counts and then join that back to this table, but im trying to do this in one query.
one would think - that in a columnar database - returning myValue and myNewCounts in one pass should be doable somehow....

Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM `project.dataset.table`, (
SELECT '[' || STRING_AGG(x, ', ') || ']' myNewCounts FROM (
SELECT FORMAT('%s:%i', myValue, COUNT(1)) x
FROM `project.dataset.table`
GROUP BY myValue
))
if to apply to sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'foo' myValue UNION ALL
SELECT 'bar' UNION ALL
SELECT 'bar' UNION ALL
SELECT 'baz'
)
SELECT * FROM `project.dataset.table`, (
SELECT '[' || STRING_AGG(x, ', ') || ']' myNewCounts FROM (
SELECT FORMAT('%s:%i', myValue, COUNT(1)) x
FROM `project.dataset.table`
GROUP BY myValue
))
result is
Row myValue myNewCounts
1 foo [foo:1, bar:2, baz:1]
2 bar [foo:1, bar:2, baz:1]
3 bar [foo:1, bar:2, baz:1]
4 baz [foo:1, bar:2, baz:1]
In case if myNewCounts is expected to be an array - use below version instead
#standardSQL
SELECT * FROM `project.dataset.table`, (
SELECT ARRAY_AGG(x) myNewCounts FROM (
SELECT FORMAT('%s:%i', myValue, COUNT(1)) x
FROM `project.dataset.table`
GROUP BY myValue
))

Related

Geography function over a column

I am trying to use the st_makeline() function in order to create lines for every points and the next one in a single column.
Do I need to create another column with the 2 points already ?
with t1 as(
SELECT *, ST_GEOGPOINT(cast(long as float64) , cast(lat as float64)) geometry FROM `my_table.faissal.trajets_flix`
where id = 1
order by index_loc
)
select index_loc geometry
from t1
Here are the results
Thanks for your help
You seems to want to write this code:
https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_makeline
WITH t1 as (
SELECT *, ST_GEOGPOINT(cast(long as float64), cast(lat as float64)) geometry
FROM `my_table.faissal.trajets_flix`
-- WHERE id = 1
)
SELECT id, ST_MAKELINE(ARRAY_AGG(geometry ORDER BY index_loc)) traj
FROM t1
GROUP BY id;
with output:
When visualized on the map.
Consider also below simple and cheap option
select st_geogfromtext(format('linestring(%s)',
string_agg(long || ' ' || lat order by index_loc))
) as path
from `my_table.faissal.trajets_flix`
where id = 1
if applied to sample data in your question - output is
which is visualized as

How to partially filter substring from a table with count

I am trying to filter substring from a string. I achieve it like. But I can print count value but it is always 1. I need to print real count number.
#standardSQL
WITH `project.dataset.table` AS (
select term from(
select LOWER(REGEXP_EXTRACT(textPayload,"Search term:(.*)")) as term from `log_dataset.backend_*`
where REGEXP_CONTAINS(textPayload, "Search term:.*")=true
)
group by term
order by count(*) desc
), temp AS (
SELECT term, COUNT(1) `count`
FROM `project.dataset.table`
GROUP BY term
)
SELECT term , `count` FROM (
SELECT term, `count`, STARTS_WITH(prev_str, term) AND
ARRAY_LENGTH(REGEXP_EXTRACT_ALL(term, r' ')) = ARRAY_LENGTH(REGEXP_EXTRACT_ALL(prev_str, r' ')) AS flag
FROM (
SELECT term, `count`, LAG(term) OVER(ORDER BY term DESC) AS prev_str
FROM temp
)
)
WHERE NOT IFNULL(flag, FALSE)
These are a list of terms
anderstand
anderstan
andersta
anderst
understand
understan
understa
underst
unders
under
understand i
understand i
understand it
understand it
understand it y
understand it ye
understand it yes
understand it yes it
understand it yes it
Desired output is
Row str count
1 understand it yes it 2
2 anderstand 1
3 understand it yes 1
4 understand 1
5 understand it 2
To obtain the desired output you can employ a GROUP BY statement as follows:
SELECT
str,
COUNT(*) AS count
FROM
`project_id.dataset.table`
GROUP BY
str
In addition, the LIKE operator can be used to filter the words in the str field.

How to get count of matches in field of table for list of phrases from another table in bigquery?

Given an arbitrary list of phrases phrase1, phrase2*, ... phraseN (say these are in another table Phrase_Table), how would one get the count of matches for each phrase in a field F in a bigquery table?
Here, "*" means there must be some non-empty/non-blank string after the phrase.
Lets say you have a table with and ID field and two string fields Field1, Field2
Output would look something like
id, CountOfPhrase1InField1, CountOfPhrase2InField1, CountOfPhrase1InField2, CountOfPhrase2InField2
or I guess instead of all of those output fields maybe there's a single json object field
id, [{"fieldName": Field1, "counts": {phrase1: m, phrase2: mm, ...},
{"fieldName": Field2, "counts": {phrase1: m2, phrase2: mm2, ...},...]
Thanks!
Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'foo1 foo foo40' str UNION ALL
SELECT 'test1 test test2 test'
), `project.dataset.keywords` AS (
SELECT 'foo' key UNION ALL
SELECT 'test'
)
SELECT str, ARRAY_AGG(STRUCT(key, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]'))) as matches)) all_matches
FROM `project.dataset.table`
CROSS JOIN `project.dataset.keywords`
GROUP BY str
with result
Row str all_matches.key all_matches.matches
1 foo1 foo foo40 foo 2
test 0
2 test1 test test2 test foo 0
test 2
If you prefer output as json you can add TO_JSON_STRING() as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'foo1 foo foo40' str UNION ALL
SELECT 'test1 test test2 test'
), `project.dataset.keywords` AS (
SELECT 'foo' key UNION ALL
SELECT 'test'
)
SELECT str, TO_JSON_STRING(ARRAY_AGG(STRUCT(key, ARRAY_LENGTH(REGEXP_EXTRACT_ALL(str, CONCAT(key, r'[^\s]'))) as matches))) all_matches
FROM `project.dataset.table`
CROSS JOIN `project.dataset.keywords`
GROUP BY str
with output
Row str all_matches
1 foo1 foo foo40 [{"key":"foo","matches":2},{"key":"test","matches":0}]
2 test1 test test2 test [{"key":"foo","matches":0},{"key":"test","matches":2}]
there are endless ways of presenting outputs like above - hope you will adjust it to whatever exactly you need :o)

How to parse JSON in Standard SQL BigQuery?

After streaming some json data into BQ, we have a record that looks like:
"{\"Type\": \"Some_type\", \"Identification\": {\"Name\": \"First Last\"}}"
How would I extract the type from this? E.g. I would like to get Some_type.
I tried all possible combinations shown in https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions without success, namely, I thought:
SELECT JSON_EXTRACT_SCALAR(raw_json , "$[\"Type\"]") as parsed_type FROM `table` LIMIT 1000
is what I need. However, I get:
Invalid token in JSONPath at: ["Type"]
Picture of rows preview
Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "{\"Type\": \"Some_type\", \"Identification\": {\"Name\": \"First Last\"}}" raw_json UNION ALL
SELECT 2, '{"Type": "Some_type", "Identification": {"Name": "First Last"}}'
)
SELECT id, JSON_EXTRACT_SCALAR(raw_json , "$.Type") AS parsed_type
FROM `project.dataset.table`
with result
Row id parsed_type
1 1 Some_type
2 2 Some_type
See below update example - take a look at third record which I think mimic your case
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "{\"Type\": \"Some_type\", \"Identification\": {\"Name\": \"First Last\"}}" raw_json UNION ALL
SELECT 2, '''{"Type": "Some_type", "Identification": {"Name": "First Last"}}''' UNION ALL
SELECT 3, '''"{\"Type\": \"
null1\"}"
'''
)
SELECT id,
JSON_EXTRACT_SCALAR(REGEXP_REPLACE(raw_json, r'^"|"$', '') , "$.Type") AS parsed_type
FROM `project.dataset.table`
with result
Row id parsed_type
1 1 Some_type
2 2 Some_type
3 3 null1
Note: I use null1 instead of null so you can easily see that it is not a NULL but rather string null1

SQL query to get column names if it has specific value

I have a situation here, I have a table with a flag assigned to the column names(like 'Y' or 'N'). I have to select the column names of a row, if it have a specific value.
My Table:
Name|sub-1|sub-2|sub-3|sub-4|sub-5|sub-6|
-----------------------------------------
Tom | Y | | Y | Y | | Y |
Jim | Y | Y | | | Y | Y |
Ram | | Y | | Y | Y | |
So I need to get, what are all the subs are have 'Y' flag for a particular Name.
For Example:
If I select Tom I need to get the list of 'Y' column name in query output.
Subs
____
sub-1
sub-3
sub-4
sub-6
Your help is much appreciated.
The problem is that your database model is not normalized. If it was properly normalized the query would be easy. So the workaround is to normalize the model "on-the-fly" to be able to make the query:
select col_name
from (
select name, sub_1 as val, 'sub_1' as col_name
from the_table
union all
select name, sub_2, 'sub_2'
from the_table
union all
select name, sub_3, 'sub_3'
from the_table
union all
select name, sub_4, 'sub_4'
from the_table
union all
select name, sub_5, 'sub_5'
from the_table
union all
select name, sub_6, 'sub_6'
from the_table
) t
where name = 'Tom'
and val = 'Y'
The above is standard SQL and should work on any (relational) DBMS.
Below code works for me.
select t.Subs from (select name, u.subs,u.val
from TableName s
unpivot
(
val
for subs in (sub-1, sub-2, sub-3,sub-4,sub-5,sub-6,sub-7)
) u where u.val='Y') T
where t.name='Tom'
Somehow I am near to the solution. I can get for all rows. (I just used 2 columns)
select col from ( select col, case s.col when 'sub-1' then sub-1 when 'sub-2' then sub-2 end AS val from mytable cross join ( select 'sub-1' AS col union all select 'sub-2' ) s ) s where val ='Y'
It gives the columns for all row. I need the same data for a single row. Like if I select "Tom", I need the column names for 'Y' value.
I'm answering this under a few assumptions here. The first is that you KNOW the names of the columns of the table in question. Second, that this is SQL Server. Oracle and MySql have ways of performing this, but I don't know the syntax for that.
Anyways, what I'd do is perform an 'UNPIVOT' on the data.
There's a lot of parans there, so to explain. The actual 'unpivot' statement (aliased as UNPVT) takes the data and twists the columns into rows, and the SELECT associated with it provides the data that is being returned. Here's I used the 'Name', and placed the column names under the 'Subs' column and the corresponding value into the 'Val' column. To be precise, I'm talking about this aspect of the above code:
SELECT [Name], [Subs], [Val]
FROM
(SELECT [Name], [Sub-1], [Sub-2], [Sub-3], [Sub-4], [Sub-5], [Sub-6]
FROM pvt) p
UNPIVOT
(Orders FOR [Name] IN
([Name], [Sub-1], [Sub-2], [Sub-3], [Sub-4], [Sub-5], [Sub-6])
)AS unpvt
My next step was to make that a 'sub-select' where I could find the specific name and val that was being hunted for. That would leave you with a SQL Statement that looks something along these lines
SELECT [Name], [Subs], [Val]
FROM (
SELECT [Name], [Subs], [Val]
FROM
(SELECT [Name], [Sub-1], [Sub-2], [Sub-3], [Sub-4], [Sub-5], [Sub-6]
FROM pvt) p
UNPIVOT
(Orders FOR [Name] IN
([Name], [Sub-1], [Sub-2], [Sub-3], [Sub-4], [Sub-5], [Sub-6])
)AS unpvt
) AS pp
WHERE 1 = 1
AND pp.[Val] = 'Y'
AND pp.[Name] = 'Tom'
select col from (
select col,
case s.col
when 'sub-1' then sub-1
when 'sub-2' then sub-2
when 'sub-3' then sub-3
when 'sub-4' then sub-4
when 'sub-5' then sub-5
when 'sub-6' then sub-6
end AS val
from mytable
cross join
(
select 'sub-1' AS col union all
select 'sub-2' union all
select 'sub-3' union all
select 'sub-4' union all
select 'sub-5' union all
select 'sub-6'
) s on name="Tom"
) s
where val ='Y'
included the join condition as
on name="Tom"