BigQuery: Query to GroupBy Array Column - google-bigquery

I have (2) columns in BigQuery table:
1. url
2. tags
URL is a single value, and TAGS is an array(example below):
row | URL &nbsp| TAGS
1 | x.com | donkey
| donkey
| lives
| here
How can I group by TAGS array in BigQuery?
What's the trick to get the following query working?
SELECT TAGS FROM `URL_TAGS_TABLE`
group by unnest(TAGS)
I have tried group by TO_JSON_STRING but it does not give me the desired results
I'd like to see the following output
x.com | donkey | count 2
x.com | lives | count 1
x.com | here | count 1

Below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'x.com' url, ['donkey','donkey','lives','here'] tags UNION ALL
SELECT 'y.com' url, ['abc','xyz','xyz','xyz'] tags
)
SELECT url, tag, COUNT(1) AS `count`
FROM `project.dataset.table`, UNNEST(tags) tag
GROUP BY url, tag
with result
Row url tag count
1 x.com donkey 2
2 x.com lives 1
3 x.com here 1
4 y.com abc 1
5 y.com xyz 3

Related

How can I aggregate values inside a jsonb array using SQL

My data looks something like this:
tags | fullName
----------------------------------------------+------------------------------------------
["tag1", "tag2"] | John
["tag3", "tag1"] | Jane
["tag1", "tag3"] | Bob
tags is a jsonb type and fullName a text in a postgres database
What I'm struggling to do is, create a view such as
tags | count
----------------------------------------------+------------------------------------------
tag1 | 3
tag2 | 1
tag3 | 2
You may use the jsonb_array_elements function to expand the row as array elements before grouping and counting.
See example and working fiddle below.
SELECT
tag_names as tag,
COUNT(1) as count
FROM (
SELECT
jsonb_array_elements(tags) as tag_names
FROM
my_table
) t
GROUP BY tag_names;
tag
count
tag1
3
tag2
1
tag3
2
View on DB Fiddle
Or shorter
SELECT
jsonb_array_elements(tags) as tag,
COUNT(1) as count
FROM
my_table
GROUP BY
tag
View on DB Fiddle

Count string occurances within a list column - Snowflake/SQL

I have a table with a column that contains a list of strings like below:
EXAMPLE:
STRING User_ID [...]
"[""null"",""personal"",""Other""]" 2122213 ....
"[""Other"",""to_dos_and_thing""]" 2132214 ....
"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]" 4342323 ....
QUESTION:
I want to be able to get a count of the amount of times each unique string appears (strings are seperable within the strings column by commas) but only know how to do the following:
SELECT u.STRING, count(u.USERID) as cnt
FROM table u
group by u.STRING
order by cnt desc;
However the above method doesn't work as it only counts the number of user ids that use a specific grouping of strings.
The ideal output using the example above would like this!
DESIRED OUTPUT:
STRING COUNT_Instances
"null" 1223
"personal" 543
"Other" 324
"to_dos_and_thing" 221
"getting_things_done" 146
"Work!!!!!" 22
Based on your description, here is my sample table:
create table u (user_id number, string varchar);
insert into u values
(2122213, '"[""null"",""personal"",""Other""]"'),
(2132214, '"[""Other"",""to_dos_and_thing""]"'),
(2132215, '"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]"' );
I used SPLIT_TO_TABLE to split each string as a row, and then REGEXP_SUBSTR to clean the data. So here's the query and output:
select REGEXP_SUBSTR( s.VALUE, '""(.*)""', 1, 1, 'i', 1 ) extracted, count(*) from u,
lateral SPLIT_TO_TABLE( string , ',' ) s
GROUP BY extracted
order by count(*) DESC;
+---------------------+----------+
| EXTRACTED | COUNT(*) |
+---------------------+----------+
| Other | 2 |
| null | 1 |
| personal | 1 |
| to_dos_and_thing | 1 |
| getting_things_done | 1 |
| TO_dos_and_thing | 1 |
| Work!!!!! | 1 |
+---------------------+----------+
SPLIT_TO_TABLE https://docs.snowflake.com/en/sql-reference/functions/split_to_table.html
REGEXP_SUBSTR https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html

Aggregate rows according to JSON array content

I have a PSQL table with json tags, that are always strings stored in a json array :
id | tags (json)
--------------------------
1 | ["tag1", "tag2"]
2 | ["tag12", "tag2"]
122 | []
I would like to query, for instance, the count of entries in the table containing each tag.
For instance, I'd like to get :
tag | count
--------------------------
tag1 | 1
tag2 | 2
tag12 | 1
I tried
SELECT tags::text AS tag, COUNT(id) AS cnt FROM my_table GROUP BY tag;
but if does not work, since it gives
tag | cnt
--------------------------
["tag1", "tag2"] | 1
["tag12", "tag2"] | 1
I guess I need to get the list of all tags in an inner query, and then for each tag count the rows that contain this tag, but I can't find how to do that. Can you help me with that ?
Use json[b]_array_elements_text() and a lateral join to unnest the array:
select x.tag, count(*) cnt
from mytable t
cross join lateral json_array_elements_text(t.tags) as x(tag)
group by x.tag

Select all rows that have at least a list of features with wildcard support

given a table definition:
Objects:
obj_id | obj_name
-------|--------------
1 | object1
2 | object2
3 | object3
Tags:
tag_id | tag_name
-------|--------------
1 | code:python
2 | code:cpp
3 | color:green
4 | colorful
5 | image
objects_tags:
obj_id | tag_id
-------|---------
1 | 1
1 | 2
2 | 1
2 | 3
3 | 1
3 | 2
3 | 3
I'd like to select objects that contain all tags from given list with wildcards. Similar question has been asked several times and answer to simpler variant looks more or less like this:
SELECT obj_id,count(*) c FROM objects_tags
INNER JOIN objects USING(obj_id)
INNER JOIN tags USING(tag_id)
WHERE (name GLOB 'code*' OR name GLOB 'color*')
GROUP BY obj_id
HAVING (c==2)
However this solution doesn't work with wildcards. Is it possible to create similar query that would return objects that for each given wildcard query returned at least 1 tag? Checking if c>=2 doesn't work because one wildcard tag can return multiple results while another may return 0 still passing query even though it shouldn't.
I considered builting dynamic query built by client software that would consist of N INTERSECTs (one per tag) because there's probably not going to be many of them but it sounds like really dirty solution and if there's any more SQL way then I'd prefer to use it.
SQLite supports WITH clause so I would try to use it to determine all tags first, and then use these tags to find objects in the below way.
The example (demo) is made for PostGreSQL because I could not run SQLIte on any online tester, but I belive you will convert it easily to SQLite:
this query retrieves all tags:
WITH tagss AS (
SELECT * FROM Tags
WHERE tag_name LIKE 'code:%' OR tag_name LIKE 'color:%'
)
SELECT * FROM tagss;
| tag_id | tag_name |
|--------|-------------|
| 1 | code:python |
| 2 | code:cpp |
| 3 | color:green |
and the final query uses the above subquery in this way:
WITH tagss AS (
SELECT * FROM Tags
WHERE tag_name LIKE 'code:%' OR tag_name LIKE 'color:%'
)
SELECT obj_id,count(*) c
FROM objects_tags
INNER JOIN tagss USING(tag_id)
WHERE tag_name IN ( SELECT tag_name FROM tagss)
GROUP BY obj_id
HAVING count(*) >= (
SELECT count(*) FROM tagss
)
| obj_id | c |
|--------|---|
| 3 | 3 |

join multiple row in table by filed value

i have a table company row like this :
id(int) |name(string) |maincategory(int) |subcategory(string)
1 |Google |1 |1,2,3
2 |yahoo |4 |4,1
and other table category like:
id(int) |name(string)
1 |Search
2 |Email
3 |Image
4 |Video
i want to join tow table by company.subcategory = category.id
is it possible in sql ?
Start by splitting your subcategory column. In the end you should have an additional company_category table with company_id and category_id as columns.
company_id(int) |category_id(int)
1 |1
1 |2
1 |3
2 |4
2 |1
Your design is invalid. You shoud have another table called companySubcategories or something like that.
This table shoud have two columns companyId an categoryId.
Then your select would look like this:
select <desired fields> from
company c
join companySubcategories cs on cs.companyId = cs.id
join category ct on ct.id = cs.categoryId
you can do like below...
select * from
company c, category cc
where c. subcategory like '%'||cc.id||'%';
it is working as expected in oracle database ..
You could introduce a new table company_subcategory to keep track of subcategories
id (int) | subcategory(int)
1 | 1
1 | 2
1 | 3
2 | 1
2 | 4
then you would be able to run select as
select company.name AS company, category.name AS category
FROM company
JOIN company_subcategory
ON company.id = company_subcategory.company
JOIN category
ON company_subcategory.subcategory = category.id;
to get
+---------+----------+
| company | category |
+---------+----------+
| google | search |
| google | email |
| google | image |
| yahoo | search |
| yahoo | video |
+---------+----------+
SELECT *
FROM COMPANY CMP, CATEGORY CT
WHERE (SELECT CASE
WHEN INSTR(CMP.SUB_CATEGORY, CT.ID) > 0 THEN
'TRUE'
ELSE
'FALSE'
END
FROM DUAL) = 'TRUE'
This query looks for the ID in the SUB_CATEGORY, using the INSTR function.
In case it does exist, the row is returned.
The output is as below
ID NAME MAIN_CATEGORY SUB_CATEGORY ID NAME
1 Google 1 1,2,3 1 Search
1 Google 1 1,2,3 2 Email
1 Google 1 1,2,3 3 Image
2 yahoo 2 4,1 1 Search
2 yahoo 2 4,1 4 Video
Hope it helps.
However, I suggest you avoid this type of entries, as an ID should have separate entries and not combined entries. This may create problems in future, so it would be better to avoid it now.