Count string occurances within a list column - Snowflake/SQL - sql

I have a table with a column that contains a list of strings like below:
EXAMPLE:
STRING User_ID [...]
"[""null"",""personal"",""Other""]" 2122213 ....
"[""Other"",""to_dos_and_thing""]" 2132214 ....
"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]" 4342323 ....
QUESTION:
I want to be able to get a count of the amount of times each unique string appears (strings are seperable within the strings column by commas) but only know how to do the following:
SELECT u.STRING, count(u.USERID) as cnt
FROM table u
group by u.STRING
order by cnt desc;
However the above method doesn't work as it only counts the number of user ids that use a specific grouping of strings.
The ideal output using the example above would like this!
DESIRED OUTPUT:
STRING COUNT_Instances
"null" 1223
"personal" 543
"Other" 324
"to_dos_and_thing" 221
"getting_things_done" 146
"Work!!!!!" 22

Based on your description, here is my sample table:
create table u (user_id number, string varchar);
insert into u values
(2122213, '"[""null"",""personal"",""Other""]"'),
(2132214, '"[""Other"",""to_dos_and_thing""]"'),
(2132215, '"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]"' );
I used SPLIT_TO_TABLE to split each string as a row, and then REGEXP_SUBSTR to clean the data. So here's the query and output:
select REGEXP_SUBSTR( s.VALUE, '""(.*)""', 1, 1, 'i', 1 ) extracted, count(*) from u,
lateral SPLIT_TO_TABLE( string , ',' ) s
GROUP BY extracted
order by count(*) DESC;
+---------------------+----------+
| EXTRACTED | COUNT(*) |
+---------------------+----------+
| Other | 2 |
| null | 1 |
| personal | 1 |
| to_dos_and_thing | 1 |
| getting_things_done | 1 |
| TO_dos_and_thing | 1 |
| Work!!!!! | 1 |
+---------------------+----------+
SPLIT_TO_TABLE https://docs.snowflake.com/en/sql-reference/functions/split_to_table.html
REGEXP_SUBSTR https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html

Related

how to loop an array in string in a where clause

I have an information table with a column of an array in string format. The length is unknown starting from 0. How can I put it in a where clause of PostgreSQL?
* hospital_information_table
| ID | main_name | alternative_name |
| --- | ---------- | ----------------- |
| 111 | 'abc' | 'abe, abx' |
| 222 | 'bbc' | '' |
| 333 | 'cbc' | 'cbe,cbd,cbf,cbg' |
​
​
* record
| ID | name | hospital_id |
| --- | ------- | ------------ |
| 1 | 'abc-1' | |
| 2 | 'bbe+2' | |
| 3 | 'cbf*3' | |
​
e.g. this column is for alternative names of hospitals. let's say e.g. 'abc,abd,abe,abf' as column Name and '111' as ID. And I have a record with a hospital name 'cbf*3' ('3' is the department name) and I would like to check its ID. How can I check all names one by one in 'cbe,cbd,cbf,cbg' and get its ID '333'?
--update--
In the example, in the record table, I used '-', '*', '+', meaning that I couldn't split the name in the record table under a certain pattern. But I can make sure that some of the alternative names may appear in the record name (as a substring). something similar to e.g. 'cbf' in 'cbf*3'. I would like to check all names, if 'abe' in 'cbf*3'? no, if 'abx' in 'cbf*3'? no, then the next row etc.
--update--
Thanks for the answers! They are great!
For more details, the original dataset is not in alphabetic languages. The text in the record name is not separable. it is really hard to find a separator or many separators. Therefore, for the solutions with regrex like '[-*+]' could not work here.
Thanks in advance!
You could use regexp_split_to_array to convert the coma-delimited string to a proper array, and then use the any operator to search inside it:
SELECT r.*, h.id
FROM record r
JOIN hospital_information h ON
SPLIT_PART(r.name, '-', 1) = ANY(REGEXP_SPLIT_TO_ARRAY(h.name, ','))
SQLFiddle demo
Substring can be used with a regular expression to get the hospital name from the record's name.
And String_to_array can transform a CSV string to an array.
SELECT
r.id as record_id
, r.name as record_name
, h.id as hospital_id
FROM record r
LEFT JOIN hospital_information h
ON SUBSTRING(r.name from '^(.*)[+*\-]\w+$') = ANY(STRING_TO_ARRAY(h.alternative_name,',')||h.main_name)
WHERE r.hospital_id IS NULL;
record_id
record_name
hospital_id
1
abc-1
111
2
bbe+2
222
3
cbf*3
333
Demo on db<>fiddle here
Btw, text [] can be used as a datatype in a table.

Seperate phone numbers from string in cell - random order

I have a bunch of data that contains a phone number and a birthday as well as other data.
{1997-06-28,07742367858}
{07791100873,1996-07-14}
{30/01/1997,07974335488}
{1997-04-04,07701003703}
{1996-03-11,07480227283}
{1998-06-20,07713817233}
{1996-09-13,07435148920}
{"21 03 2000",07548542539,1st}
{1996-03-09,07539248008}
{07484642432,1996-03-01}
I am trying to extract the phone number from this, however unsure on how to get this out when the data is not always in the same order.
I would expect to one column that return a phone number, the next which returned a birthday then another which return any arbitrary value in the 3rd column slot.
You can try to sort parts of each string by the number of digits they contain. This can be done with the expression:
select length(regexp_replace('1997-06-28', '\D', '', 'g'))
length
--------
8
(1 row)
The query removes curly brackets from strings, splits them by comma, sorts elements by the number of digits and aggregates back to arrays:
with my_data(str) as (
values
('{1997-06-28,07742367858}'),
('{07791100873,1996-07-14}'),
('{30/01/1997,07974335488}'),
('{1997-04-04,07701003703}'),
('{1996-03-11,07480227283}'),
('{1998-06-20,07713817233}'),
('{1996-09-13,07435148920}'),
('{"21 03 2000",07548542539,1st}'),
('{1996-03-09,07539248008}'),
('{07484642432,1996-03-01}')
)
select id, array_agg(elem order by length(regexp_replace(elem, '\D', '', 'g')) desc)
from (
select id, trim(unnest(string_to_array(str, ',')), '"') as elem
from (
select trim(str, '{}') as str, row_number() over () as id
from my_data
) s
) s
group by id
Result:
id | array_agg
----+--------------------------------
1 | {07742367858,1997-06-28}
2 | {07791100873,1996-07-14}
3 | {07974335488,30/01/1997}
4 | {07701003703,1997-04-04}
5 | {07480227283,1996-03-11}
6 | {07713817233,1998-06-20}
7 | {07435148920,1996-09-13}
8 | {07548542539,"21 03 2000",1st}
9 | {07539248008,1996-03-09}
10 | {07484642432,1996-03-01}
(10 rows)
See also this answer Looking for solution to swap position of date format DMY to YMD if you want to normalize dates. You should modify the function:
create or replace function iso_date(text)
returns date language sql immutable as $$
select case
when $1 like '__/__/____' then to_date($1, 'DD/MM/YYYY')
when $1 like '____/__/__' then to_date($1, 'YYYY/MM/DD')
when $1 like '____-__-__' then to_date($1, 'YYYY-MM-DD')
when trim($1, '"') like '__ __ ____' then to_date(trim($1, '"'), 'DD MM YYYY')
end
$$;
and use it:
select id, a[1] as phone, iso_date(a[2]) as birthday, a[3] as comment
from (
select id, array_agg(elem order by length(regexp_replace(elem, '\D', '', 'g')) desc) as a
from (
select id, trim(unnest(string_to_array(str, ',')), '"') as elem
from (
select trim(str, '{}') as str, row_number() over () as id
from my_data
) s
) s
group by id
) s
id | phone | birthday | comment
----+-------------+------------+---------
1 | 07742367858 | 1997-06-28 |
2 | 07791100873 | 1996-07-14 |
3 | 07974335488 | 1997-01-30 |
4 | 07701003703 | 1997-04-04 |
5 | 07480227283 | 1996-03-11 |
6 | 07713817233 | 1998-06-20 |
7 | 07435148920 | 1996-09-13 |
8 | 07548542539 | 2000-03-21 | 1st
9 | 07539248008 | 1996-03-09 |
10 | 07484642432 | 1996-03-01 |
(10 rows)

SQL select counts on 1 value

I have a table like this:
+---------+------------+--------+--------------+
| Id | Name | Status | Content_type |
+---------+------------+--------+--------------+
| 2960671 | PostJob | Error | general_url |
| 2960670 | auto_index | Done | general_url |
| 2960669 | auto_index | Done | document |
| 2960668 | auto_index | Error | document |
| 2960667 | auto_index | Error | document |
+---------+------------+--------+--------------+
And I want to count how many of each type that has 'Error' as status, so in the result it would be 1x general_url and 2x document
I tried something like this:
SELECT COUNT(DISTINCT Content_type) from Indexing where Status = 'Error';
But I could not figure out how to get the content_type out of it
You want this
select Content_type,
count(Status)
from Indexing
where Status='Error'
group by Content_type;
GROUP BY should do the job:
SELECT Content_type, COUNT(Id) from Indexing where Status = 'Error' GROUP BY Content_type;
Explanation:
COUNT (x) counts the number of rows in the group, COUNT (*) would do the same.
COUNT (DISTINCT x) counts the number of distinct values in the group.
Without a GROUP BY clause the group is the whole set of records, so in your example you would have seen a single value (2) as your result; i.e. there are 2 distinct Content_types in the set.
SQL Fiddle Oracle
Schema
create table test
(id varchar2(10),
name varchar2(30),
status varchar2(20),
content_type varchar2(30)
);
insert into test values('2960671','PostJob','Error','general_url');
insert into test values('2960670','auto_index','Done','general_url');
insert into test values('2960669','auto_index','Done','document');
insert into test values('2960668','auto_index','Error','document');
insert into test values('2960667','auto_index','Error','document');
Select Query
SELECT LISTAGG(content_type, ',') WITHIN GROUP (ORDER BY rownum) AS content_type,
count(content_type) as content_type_count
from
(
select distinct(content_type) content_type
FROM test
where status='Error'
);
Output
| CONTENT_TYPE | CONTENT_TYPE_COUNT |
|----------------------|--------------------|
| document,general_url | 2 |
The idea here is to print comma separated content_type values so that you can know the count of content_type along with actual values
Try this One
SELECT count(`content_type`) as 'count', content_type as 'x content type' FROM `tablename` where status= 'Error' group by(`content_type`)

Search an SQL table that already contains wildcards?

I have a table that contains patters for phone numbers, where x can match any digit.
+----+--------------+----------------------+
| ID | phone_number | phone_number_type_id |
+----+--------------+----------------------+
| 1 | 1234x000x | 1 |
| 2 | 87654311100x | 4 |
| 3 | x111x222x | 6 |
+----+--------------+----------------------+
Now, I might have 511132228 which will match with row 3 and it should return its type. So, it's kind of like SQL wilcards, but the other way around and I'm confused on how to achieve this.
Give this a go:
select * from my_table
where '511132228' like replace(phone_number, 'x', '_')
select *
from yourtable
where '511132228' like (replace(phone_number, 'x','_'))
Try below query:
SELECT ID,phone_number,phone_number_type_id
FROM TableName
WHERE '511132228' LIKE REPLACE(phone_number,'x','_');
Query with test data:
With TableName as
(
SELECT 3 ID, 'x111x222x' phone_number, 6 phone_number_type_id from dual
)
SELECT 'true' value_available
FROM TableName
WHERE '511132228' LIKE REPLACE(phone_number,'x','_');
The above query will return data if pattern match is available and will not return any row if no match is available.

Splitting a string column in BigQuery

Let's say I have a table in BigQuery containing 2 columns. The first column represents a name, and the second is a delimited list of values, of arbitrary length. Example:
Name | Scores
-----+-------
Bob |10;20;20
Sue |14;12;19;90
Joe |30;15
I want to transform into columns where the first is the name, and the second is a single score value, like so:
Name,Score
Bob,10
Bob,20
Bob,20
Sue,14
Sue,12
Sue,19
Sue,90
Joe,30
Joe,15
Can this be done in BigQuery alone?
Good news everyone! BigQuery can now SPLIT()!
Look at "find all two word phrases that appear in more than one row in a dataset".
There is no current way to split() a value in BigQuery to generate multiple rows from a string, but you could use a regular expression to look for the commas and find the first value. Then run a similar query to find the 2nd value, and so on. They can all be merged into only one query, using the pattern presented in the above example (UNION through commas).
Trying to rewrite Elad Ben Akoune's answer in Standart SQL, the query becomes like this;
WITH name_score AS (
SELECT Name, split(Scores,';') AS Score
FROM (
(SELECT * FROM (SELECT 'Bob' AS Name ,'10;20;20' AS Scores))
UNION ALL
(SELECT * FROM (SELECT 'Sue' AS Name ,'14;12;19;90' AS Scores))
UNION ALL
(SELECT * FROM (SELECT 'Joe' AS Name ,'30;15' AS Scores))
))
SELECT name, score
FROM name_score
CROSS JOIN UNNEST(name_score.score) AS score;
And this outputs;
+------+-------+
| name | score |
+------+-------+
| Bob | 10 |
| Bob | 20 |
| Bob | 20 |
| Sue | 14 |
| Sue | 12 |
| Sue | 19 |
| Sue | 90 |
| Joe | 30 |
| Joe | 15 |
+------+-------+
If someone is still looking for an answer
select Name,split(Scores,';') as Score
from (
# replace the inner custome select with your source table
select *
from
(select 'Bob' as Name ,'10;20;20' as Scores),
(select 'Sue' as Name ,'14;12;19;90' as Scores),
(select 'Joe' as Name ,'30;15' as Scores)
);