SQL Query validation failure on GCP BigQuery with github_repos dataset - sql

I would like to get a list all unique repositories on GutHub by using the following command:
SELECT DISTINCT repo_name FROM `bigquery-public-data.github_repos.commits`
However I get the following error:
Column repo_name of type ARRAY cannot be used in SELECT DISTINCT at [1:17]
In the schema it says repo_name is of type STRING, what am I doing wrong?

repo_name is defined as a "string" with mode "repeated" in the table schema which roughly means an ARRAY of STRING in BigQuery.
https://cloud.google.com/bigquery/docs/nested-repeated
What does REPEATED field in Google Bigquery mean?

As another user posted, in the schema of the bigquery-public-data.github_repos.commits table you can see that the repo_name field is defined as a STRING REPEATED which means that each entry of repo_name is an array constituted by string-type elements. You can see this with the following query:
#standardSQL
SELECT repo_name
FROM `bigquery-public-data.github_repos.commits`
LIMIT 100;
In order to find the distinct repo names you can employ the UNNEST operator to expand each one of the repo_name elements. The following query performs a CROSS JOIN that adds a new field repo_name_single to the table constituted by the individual repository names. This way, the DISTINCT function can be employed.
#standardSQL
SELECT DISTINCT(repo_name_unnest)
FROM `bigquery-public-data.github_repos.commits`
CROSS JOIN UNNEST(repo_name) AS repo_name_unnest;

You can use the below query
SELECT
commit
, repo_name
FROM
`bigquery-public-data.github_repos.commits`,
UNNEST(repo_name) as repo_name
WHERE
commit = 'c87298e36356ac19519a93dee3dfac8ebffe45e8'
Which will give a result like below
Row | commit | repo_name
===================================================================
1 | c87298e36356ac19519a93dee3dfac8ebffe45e8 | noondaysun/sakai
2 | c87298e36356ac19519a93dee3dfac8ebffe45e8 | OpenCollabZA/sakai

Related

BigQuery SQL for JSON field returns no data

Yet again I find myself flumoxed with my SQL of JSON field in bigquery.
This is the contents of a field called json_data - https://storage.googleapis.com/greyrock_storage/misc/freepik.json
The record has an id of 1675816490
This is my SQL:
SELECT
##JSON_EXTRACT(json_data, '$data.resources.boost.url_source') AS url_source,
JSON_VALUE(boost, "$.url_source") AS url_source,
FROM `my database` ,
UNNEST(JSON_QUERY_ARRAY(json_data.data)) AS data,
UNNEST(JSON_QUERY_ARRAY(data.resources)) AS resources,
UNNEST(JSON_QUERY_ARRAY(resources.boost)) AS boost
WHERE
id = 1675816490
I expected to see a list of all the values in the record for data.resources.boost.url_source BUT it returns 'There is no data to display.'
try like this
SELECT JSON_VALUE(boosts.url_source) AS url_source
FROM `my database` AS a
CROSS JOIN UNNEST(JSON_QUERY_ARRAY(a.json_data.data.resources.boost)) AS boosts

JSON Extract JSON in Metabase SQL

i have this table
id
status
outgoing
1
paid
{"a945248027_14454878":"processing"}
2
unpaid
{"old.a945248027_14454878":"cancelled"}
i am trying to extract the value after underscore i.e 14454878
i tried extracting the keys using this query on metabase
select id, outgoing,
substring(key from '_([^_]+)$') as key
from table,
cross join lateral jsonb_object_keys(outgoing) as j(key);
but i keep getting the error
ERROR: function jsonb_object_keys(json) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts. Position: 129
Please help
The column is defined as json but i used a function that expects jsonb.
so i changed Use jsonb_object_keys() to jsonb_object_keys()
select id, outgoing,
substring(key from '_([^_]+)$') as key
from table,
cross join lateral json_object_keys(outgoing) as j(key);

Bigquery UDF to repeat queries. Error : Scalar subquery cannot have more than one column

I am trying to get unique values from multiple columns but since the datastructure is an array I can't directly do DISTINCT on all columns. I am using UNNEST() for each column and performing a UNION ALL for each column.
My idea is to create a UDF so that I can simply give the column name each time instead of performing the select every time.
I would like to replace this Query with a UDF since there are many feature columns and I need to do many UNION ALL.
SELECT DISTINCT user_log as unique_value,
'user_log' as feature
FROM `my_table`
left join UNNEST(user_Log) AS user_log
union all
SELECT DISTINCT page_name as unique_value,
'user_login_page_name' as feature
FROM `my_table`
left join UNNEST(PageName) AS page_name
order by feature;
My UDF
CREATE TEMP FUNCTION get_uniques(feature_name ARRAY<STRING>, feature STRING)
AS (
(SELECT DISTINCT feature as unique_value,
'feature' as feature
FROM `my_table`
left join UNNEST(feature_name) AS feature));
SELECT get_uniques(user_Log, log_feature);
However the UDF to select the column doesnt really work and gives the error
Scalar subquery cannot have more than one column unless using SELECT AS STRUCT to build STRUCT values; failed to parse CREATE [TEMP] FUNCTION statement at [8:1]
There is probably a better way of doing this. Appreciate your help.
By reading what are you trying to achieve, which is:
My idea is to create a UDF so that i can simply give the column name each time instead of performing the select every time.
One approach could be to use format in combination with execution immediate to create your custom query and get the desirable output.
Below example shows the function using format to return a custom query and execute immediate to retrieve the final query output from the final table. I'm using a public data set so you can also try it out on your side:
CREATE TEMP FUNCTION GetUniqueValues(table_name STRING, col_name STRING, nest_col_name STRING)
AS (format("SELECT DISTINCT %s.%s as unique_val,'%s' as featured FROM %s ", col_name,nest_col_name,col_name,table_name));
EXECUTE IMMEDIATE (
select CONCAT(
(SELECT GetUniqueValues('bigquery-public-data.github_repos.commits','Author','name'))
,' union all '
,(SELECT GetUniqueValues('bigquery-public-data.github_repos.commits','Committer','name'))
,' limit 100'))
output
Row | unique_val | featured
1 | Sergio Garcia Murillo | Committer
2 | klimek | Committer
3 | marclaporte#gmail.com | Committer
4 | acoul | Committer
5 | knghtbrd | Committer
... | ... | ...
100 | Gustavo Narea | Committer

add table_id to the result from multiple tables in BigQuery

Below is how I structured the data in BigQuery database.
test
-> sales
-> monthly-2015
-> monthly-2016
-> ...
I want to combine the data of all tables with the table name , monthly-*, and below is how I wrote the sql from examples I found.
Running this sql leads an error like following Scalar subquery produced more than one element. How could I fix it to error?
SELECT
*,
(
SELECT
table_id
FROM
`test.sales.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'monthly-%')
FROM
`test.sales.monthly*`
I want to combine the data of all tables with the table name , monthly-*
Try below
SELECT *, 'monthly_' || _TABLE_SUFFIX as table_name
FROM `test.sales.monthly_*`

Hive - getting the column names count of a table

How can I get the hive column count names using HQL? I know we can use the describe.tablename to get the names of columns. How do we get the count?
create table mytable(i int,str string,dt date, ai array<int>,strct struct<k:int,j:int>);
select count(*)
from (select transform ('')
using 'hive -e "desc mytable"'
as col_name,data_type,comment
) t
;
5
Some additional playing around:
create table mytable (id int,first_name string,last_name string);
insert into mytable values (1,'Dudu',null);
select size(array(*)) from mytable limit 1;
This is not bulletproof since not all combinations of columns types can be combined into an array.
It also requires that the table will contain at least 1 row.
Here is a more complex but also stronger solution (types versa), but also requires that the table will contain at least 1 row
select size(str_to_map(val)) from (select transform (struct(*)) using 'sed -r "s/.(.*)./\1/' as val from mytable) t;