How to count frequency of elements in a bigquery array field - google-bigquery

I have a table that looks like this:
I am looking for a table that gives a frequency count of the elements in the fields l_0, l_1, l_2, l_3.
For example the output should look like this:
| author_id | year | l_o.name | l_0.count| l1.name | l1.count | l2.name | l2.count| l3.name | l3.count|
| 2164089123 | 1987 | biology | 3 | botany | 3 | | | | |
| 2595831531 | 1987 | computer science | 2 | simulation | 2 | computer simulation | 2 | mathematical model | 2 |
Edit:
In some cases the array field might have more than one type of element. For example l_0 could be ['biology', 'biology', 'geometry', 'geometry']. In that case the output for fields l_0, l_1, l_2, l_3 would be a nested repeated field with all the elements in l_0.name and all the corresponding counts in the l_0.count.

This should work, assuming you want to count on a per-array basis:
SELECT
author_id,
year,
(SELECT AS STRUCT ANY_VALUE(l_0) AS name, COUNT(*) AS count
FROM UNNEST(l_0) AS l_0) AS l_0,
(SELECT AS STRUCT ANY_VALUE(l_1) AS name, COUNT(*) AS count
FROM UNNEST(l_1) AS l_1) AS l_1,
(SELECT AS STRUCT ANY_VALUE(l_2) AS name, COUNT(*) AS count
FROM UNNEST(l_2) AS l_2) AS l_2,
(SELECT AS STRUCT ANY_VALUE(l_3) AS name, COUNT(*) AS count
FROM UNNEST(l_3) AS l_3) AS l_3
FROM YourTable;
To avoid so much repetition, you can make use of a SQL UDF:
CREATE TEMP FUNCTION GetNameAndCount(elements ARRAY<STRING>) AS (
(SELECT AS STRUCT ANY_VALUE(elem) AS name, COUNT(*) AS count
FROM UNNEST(elements) AS elem)
);
SELECT
author_id,
year,
GetNameAndCount(l_0) AS l_0,
GetNameAndCount(l_1) AS l_1,
GetNameAndCount(l_2) AS l_2,
GetNameAndCount(l_3) AS l_3
FROM YourTable;
If you potentially need to account for multiple different names within an array, you can have the UDF return an array of them with associated counts instead:
CREATE TEMP FUNCTION GetNamesAndCounts(elements ARRAY<STRING>) AS (
ARRAY(
SELECT AS STRUCT elem AS name, COUNT(*) AS count
FROM UNNEST(elements) AS elem
GROUP BY elem
ORDER BY count
)
);
SELECT
author_id,
year,
GetNamesAndCounts(l_0) AS l_0,
GetNamesAndCounts(l_1) AS l_1,
GetNamesAndCounts(l_2) AS l_2,
GetNamesAndCounts(l_3) AS l_3
FROM YourTable;
Note that if you want to perform counting across rows, however, you'll need to unnest the arrays and perform the GROUP BY at the outer level, but it doesn't look like this is your intention based on the question.

Related

Count string occurances within a list column - Snowflake/SQL

I have a table with a column that contains a list of strings like below:
EXAMPLE:
STRING User_ID [...]
"[""null"",""personal"",""Other""]" 2122213 ....
"[""Other"",""to_dos_and_thing""]" 2132214 ....
"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]" 4342323 ....
QUESTION:
I want to be able to get a count of the amount of times each unique string appears (strings are seperable within the strings column by commas) but only know how to do the following:
SELECT u.STRING, count(u.USERID) as cnt
FROM table u
group by u.STRING
order by cnt desc;
However the above method doesn't work as it only counts the number of user ids that use a specific grouping of strings.
The ideal output using the example above would like this!
DESIRED OUTPUT:
STRING COUNT_Instances
"null" 1223
"personal" 543
"Other" 324
"to_dos_and_thing" 221
"getting_things_done" 146
"Work!!!!!" 22
Based on your description, here is my sample table:
create table u (user_id number, string varchar);
insert into u values
(2122213, '"[""null"",""personal"",""Other""]"'),
(2132214, '"[""Other"",""to_dos_and_thing""]"'),
(2132215, '"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]"' );
I used SPLIT_TO_TABLE to split each string as a row, and then REGEXP_SUBSTR to clean the data. So here's the query and output:
select REGEXP_SUBSTR( s.VALUE, '""(.*)""', 1, 1, 'i', 1 ) extracted, count(*) from u,
lateral SPLIT_TO_TABLE( string , ',' ) s
GROUP BY extracted
order by count(*) DESC;
+---------------------+----------+
| EXTRACTED | COUNT(*) |
+---------------------+----------+
| Other | 2 |
| null | 1 |
| personal | 1 |
| to_dos_and_thing | 1 |
| getting_things_done | 1 |
| TO_dos_and_thing | 1 |
| Work!!!!! | 1 |
+---------------------+----------+
SPLIT_TO_TABLE https://docs.snowflake.com/en/sql-reference/functions/split_to_table.html
REGEXP_SUBSTR https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html

How do I list the values being counted?

I use the aggregate function to count the most occurring unique values (which lets say is 5). I now want to list these unique values that were counted in a column - struggling with how to do that. Can I even do that? I'm using PostgreSQL.
SELECT IDs,
COUNT(DISTINCT people) AS num_people
FROM class
GROUP BY IDs
ORDER BY COUNT(DISTINCT people) desc
LIMIT 1
Current Sample Result:
-------------------------------------
| **IDs** | **num_people** |
-------------------------------------
| Aabbcc | 5 |
-------------------------------------
I want this result with the new column at thee end. (It could be a separate rows too - it
does not have to be all in one row - but that would be ideal)
-----------------------------------------------------------------------
| **IDs** | **num_people** | **people_listed** |
-----------------------------------------------------------------------
| Aabbcc | 5 | Coco, Riley, Allan, Betty, Cici |
-----------------------------------------------------------------------
You could use the aggregate function ARRAY_AGG or STRING_AGG for that:
SELECT IDs,
COUNT(DISTINCT people) AS num_people,
STRING_AGG(DISTINCT people, ', ') AS people_listed
FROM class
GROUP BY IDs
ORDER BY COUNT(DISTINCT people) desc
LIMIT 1

Select first matching string value in list from SQL table

I'd like to query Table and return the most granular frequency in the table for a given row. Sample table and desired result are below. I've tried a few iterations of the query but haven't cracked it yet.
By "most granular frequency" I mean that I'd like to return the first match for any row in this set ['hourly', 'daily', 'weekly', 'monthly'] as a new column called min_frequency
Table
----------------------------------
id | name | frequency
----------------------------------
----------------------------------
1 | apples | hourly
----------------------------------
2 | apples | daily
----------------------------------
3 | oranges | weekly
----------------------------------
4 | oranges | monthly
----------------------------------
Desired result:
name | min_frequency
----------------------------------
----------------------------------
apples | hourly
----------------------------------
oranges | weekly
----------------------------------
Current attempt:
SELECT name, (
CASE
WHEN frequency='hourly' then frequency
WHEN frequency='daily' then frequency
WHEN frequency='weekly' then frequency
WHEN frequency='yearly' then frequency
END
) as min_frequency from Table
GROUP BY name, min_frequency
You could use distinct on with conditional sorting logic:
select distinct on (name) *
from mytable
order by
name,
case frequency
when 'hourly' then 1
when 'daily' then 2
when 'weekly' then 3
when 'monthly' then 4
end
Although you can use a giant case expression, arrays are convenient for this:
select distinct on (name) t.*
from t
order by name,
array_position(array['hourly', 'daily', 'weekly', 'monthly'], frequency)
Note if you have frequencies other than those listed, this may not work as expected.

Counting SQLite rows that might match multiple times in a single query

I have a SQLite table which has a column containing categories that each row may fall into. Each row has a unique ID, but may fall into zero, one, or more categories, for example:
|-------+-------|
| name | cats |
|-------+-------|
| xyzzy | a b c |
| plugh | b |
| quux | |
| quuux | a c |
|-------+-------|
I'd like to obtain counts of how many items are in each category. In other words, output like this:
|------------+-------|
| categories | total |
|------------+-------|
| a | 2 |
| b | 2 |
| c | 2 |
| none | 1 |
|------------+-------|
I tried to use the case statement like this:
select case
when cats like "%a%" then 'a'
when cats like "%b%" then 'b'
when cats like "%c%" then 'c'
else 'none'
end as categories,
count(*)
from test
group by categories
But the problem is this only counts each row once, so it can't handle multiple categories. You then get this output instead:
|------------+-------|
| categories | total |
|------------+-------|
| a | 2 |
| b | 1 |
| none | 1 |
|------------+-------|
One possibility is to use as many union statements as you have categories:
select case
when cats like "%a%" then 'a'
end as categories, count(*)
from test
group by categories
union
select case
when cats like "%b%" then 'b'
end as categories, count(*)
from test
group by categories
union
...
but this seems really ugly and the opposite of DRY.
Is there a better way?
Fix your data structure! You should have a table with one row per name and per category:
create table nameCategories (
name varchar(255),
category varchar(255)
);
Then your query would be easy:
select category, count(*)
from namecategories
group by category;
Why is your data structure bad? Here are some reasons:
A column should contain a single value.
SQL has pretty lousy string functionality.
SQL queries to do what you want cannot be optimized.
SQL has a great data structure for storing lists. It is called a table, not a string.
With that in mind, here is one brute force method for doing what you want:
with categories as (
select 'a' as category union all
select 'b' union all
. . .
)
select c.category, count(t.category)
from categories c left join
test t
on ' ' || t.categories || ' ' like '% ' || c.category || ' %'
group by c.category;
If you already have a table of valid categories, then the CTE is not needed.

Select only partial result but get total number of rows

I got stuck on SQL subquery selection. Right now, I have a table products:
id | name | description
----+-------+----------------------
6 | 123 | this is a +
| | girl.
7 | 124 | this is a +
| | girl.
8 | 125 | this is a +
| | girl.
9 | 126 | this is a +
| | girl.
10 | 127 | this is a +
| | girl.
11 | 127 | this is a +
| | girl. Isn't this?
12 | 345 | this is a cute slair
13 | ggg | this is a +
| | girl
14 | shout | this is a monster
15 | haha | they are cute
16 | 123 | this is cute
What I want to do is to find ( the total number of records and the first 5 records ) which contains '1' or 'this' in either name or description columns.
What I can figure out is so ugly:
SELECT *, (select count(id)
from (SELECT * from products
where description like any (array['%1%','%this%'])
or name like any (array['%1%','%this%'])
) as foo
) as total
from (SELECT * from products
where description like any (array['%1%','%this%'])
or name like any (array['%1%','%this%']))
) as fooo
limit 5;
You can use the aggregate function count() as window function to compute the total count in the same query level:
SELECT id, name, description, count(*) OVER () AS total
FROM products p
WHERE description LIKE ANY ('{%1%,%this%}'::text[])
OR name LIKE ANY ('{%1%,%this%}'::text[])
ORDER BY id
LIMIT 5;
Quoting the manual on window functions:
In addition to these functions, any built-in or user-defined aggregate function can be used as a window function
This works, because LIMIT is applied after window functions.
I also use an alternative syntax for array literals. One is as good as the other. This one is shorter for longer arrays. And sometimes an explicit type cast is needed. I am assuming text here.
It is simpler and a bit faster than the version with a CTE in my test.
BTW, this WHERE clause with a regular expression is shorter - but slower:
WHERE description ~ '(1|this)'
OR name ~ '(1|this)'
Ugly, but fast
One more test: I found the primitive version (similar to what you had already) to be even faster:
SELECT id, name, description
, (SELECT count(*)
FROM products p
WHERE description LIKE ANY ('{%1%,%this%}'::text[])
OR name LIKE ANY ('{%1%,%this%}'::text[])
) AS total
FROM products p
WHERE description LIKE ANY ('{%1%,%this%}'::text[])
OR name LIKE ANY ('{%1%,%this%}'::text[])
ORDER BY id
LIMIT 5;
Assuming you are using postgresql 9.0+, you can use CTE's for this.
Eg.
WITH p AS (
SELECT *
FROM products
WHERE description LIKE ANY (ARRAY['%1%','%this%']) OR name LIKE ANY (ARRAY['%1%','%this%'])
)
SELECT *,
(select count(*) from p) as total
FROM p
ORDER BY id LIMIT 5;