PLSQL - Count of all characters within a string - sql

I want to be able to generate a count of all characters in a given string from the result of an Oracle PLSQL query.
For instance, given the string "strings", output would be as such
character | count
-----------------
g | 1
i | 1
n | 1
r | 1
s | 2
t | 1
My thinking was something along the lines of
SELECT COLUMN, COUNT(COLUMN) FROM TABLE GROUP BY COLUMN
but that would require converting a string into a set of characters which is where I'm stuck.
Ideally this extends to a count of all ASCII characters not just A-Z, in order to perform analysis on the contents of the database.
I'm curious if there's a better way to do this than creating a procedure and whitelisting characters to count and running that on a given string.

This is a commonly used way to split a string into characters;
once you have one record for each character, counting them is quite straightforward:
select single_char, count(*)
from (
select substr(x, level, 1) as single_char
from (select 'abbabbaccb' x from dual)
connect by level <= length(x)
)
group by single_char

Related

Big Query String Manipulation using SubQuery

I would appreciate a push in the right direction with how this might be achieved using GCP Big Query, please.
I have a column in my table of type string, inside this string there are a repeating sequence of characters and I need to extract and process each of them. To illustrate, lets say the column name is 'instruments'. A possible value for instruments could be:
'band=false;inst=basoon,inst=cello;inst=guitar;cases=false,permits=false'
In which case I need to extract 'basoon', 'cello' and 'guitar'.
I'm more or less a SQL newbie, sorry. So far I have:
SELECT
bandId,
REGEXP_EXTRACT(instruments, r'inst=.*?\;') AS INSTS
FROM `inventory.band.mytable`;
This extracts the instruments substring ('inst=basoon,inst=cello;inst=guitar;') and gives me an output column 'INSTS' but now I think I need to split the values in that column on the comma and do some further processing. This is where I'm stuck as I cannot see how to structure additional queries or processing blocks.
How can I reference the INSTS in order to do subsequent processing? Documentation suggests I should be buildin subqueries using WITH but I can't seem to get anything going. Could some kind soul give me a push in the right direction, please?
BigQuery has a function SPLIT() that does the same as SPLIT_PART() in other databases.
Assuming that you don't alternate between the comma and the semicolon for separating your «key»=«value» pairs, and only use the semicolon,
first you split your instruments string into as many parts that contain inst=. To do that, you use an in-line table of consecutive integers to CROSS JOIN with, so that you can SPLIT(instruments,';',i) with an increasing integer value for i. You will get strings in the format inst=%, of which you want the part after the equal sign. You get that part by applying another SPLIT(), this time with the equal sign as the delimiter, and for the second split part:
WITH indata(bandid,instruments) AS (
-- some input, don't use in real query ...
-- I assume that you don't alternate between comma and semicolon for the delimiter, and stick to semicolon
SELECT
1,'band=false;inst=basoon;inst=cello;inst=guitar;cases=false;permits=false'
UNION ALL
SELECT
2,'band=true;inst=drum;inst=cello;inst=bass;inst=flute;cases=false;permits=true'
UNION ALL
SELECT
3,'band=false;inst=12string;inst=banjo;inst=triangle;inst=tuba;cases=false;permits=true'
)
-- real query starts here, replace following comma with "WITH" ...
,
-- need a series of consecutive integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
)
SELECT
bandid
, i
, SPLIT(SPLIT(instruments,';',i),'=',2) AS instrument
FROM indata CROSS JOIN i
WHERE SPLIT(instruments,';',i) like 'inst=%'
ORDER BY 1
-- out bandid | i | instrument
-- out --------+---+------------
-- out 1 | 2 | basoon
-- out 1 | 3 | cello
-- out 1 | 4 | guitar
-- out 2 | 2 | drum
-- out 2 | 3 | cello
-- out 2 | 4 | bass
-- out 2 | 5 | flute
-- out 3 | 2 | 12string
-- out 3 | 3 | banjo
-- out 3 | 4 | triangle
-- out 3 | 5 | tuba
Consider below few options (just to demonstrate different technics here)
Option 1
select bandId,
( select string_agg(split(kv, '=')[offset(1)])
from unnest(split(instruments, ';')) kv
where split(kv, '=')[offset(0)] = 'inst'
) as insts
from `inventory.band.mytable`
Option 2 (for obvious reason this one would be my choice)
select bandId,
array_to_string(regexp_extract_all(instruments, r'inst=([^;$]+)'), ',') instrs
from `inventory.band.mytable`
If applied to sample data in your question - output in both cases is

Aggregating / Concatenation of very long Varchar2 strings and find key words in the text || Oracle

I have been given a task to develop a script/ function/ query to aggregate groups of rows in a table and then search for specific keywords in it. The column to be aggregated is a varchar2 column with size 3200 and some of the aggregated rows have lengths way beyond 5000.
(I understand that the size of varchar2 is 4000)
When I try to aggregate the data into a single column, it gives a "result of string concatenation is too long" error (ORA-01489)
I have tried inbuilt aggregators like LISTAGG, XMLAGG, and also some custom functions but I have been asked to prefer a SQL query over a function or procedure.
Once I can get the data to be aggregated, I have to then search through the rows for matching keywords.
(can't just search the rows without aggregating as some of the words are split across the rows, eg row1 ends with "KEYW" and row2 starts with "ORD" if I need to look for "KEYWORD" in the table
my table kind of looks like this (can't post the real table data, sorry),
id_1 | id_2 | name | row_num | description
1 5 A 0 this has so
1 5 A 1 me keyword
1 5 B 0 this is
1 3 E 0 new some
2 12 A 0 diff str
here the unique rows are identified using the first 3 columns and the 4th column lists the order in which these "description" strings need to be concatenated.
I would like to get the output as:
id_1 | id_2 | name | description (concated)
1 5 A this is **some** keyword
1 3 E new **some**
when looking for the keyword "some"
Please help as I am fairly new to DBs and any help will be highly appreciated.
Thanks & Regards
Kunal

matching array in Postgres with string manipulation

I was working with the "<#" operator and two arrays of strings.
anyarray <# anyarray → boolean
Every string is formed in this way: ${name}_${number}, and I would like to check if the name part is included and the number is equal or lower than the one in the other array.
['elementOne_10'] & [['elementOne_7' , 'elementTwo20']] → true
['elementOne_10'] & [['elementOne_17', 'elementTwo20']] → false
what would be an efficient way to do this?
Assuming your sample data elementTwo20 in fact follows your described schema and should be elementTwo_20:
step-by-step demo:db<>fiddle
SELECT
id
FROM (
SELECT
*,
split_part(u, '_', 1) as name, -- 3
split_part(u, '_', 2)::int as num,
split_part(compare, '_', 1) as comp_name,
split_part(compare, '_', 2)::int as comp_num
FROM
t,
unnest(data) u, -- 1
(SELECT unnest('{elementOne_10}'::text[]) as compare) s -- 2
)s
GROUP BY id -- 4
HAVING
ARRAY_AGG(name) #> ARRAY_AGG(comp_name) -- 5
AND MAX(comp_num) BETWEEN MIN(num) AND MAX(num)
unnest() your array elements into one element per record
JOIN and unnest() your comparision data
split the element strings into their name and num parts
unnest() creates several records per original array, they can be grouped by an identifier (best is an id column)
Filter with your criteria in the HAVING clause: Compare the name parts for example with array operators, for BETWEEN comparing you can use MIN and MAX on the num part.
Note:
As #a_horse_with_no_name correctly mentioned: If possible think about your database design and normalize it:
Don't store arrays -> You don't need to unnest them on every operation
Relevant data should be kept separated, not concatenated as a string -> You don't need to split them on every operation
id | name | num
---------------------
1 | elementOne | 7
1 | elementTwo | 20
2 | elementOne | 17
2 | elementTwo | 20
This is exactly the result of the inner subquery. You have to create this every time you need these data. It's better to store the data like this.

Number of palindromes in character strings

I'm trying to gather a list of 6 letter palindromes and the number of times they occur using Postgres 9.3.5.
This is the query I've tried:
SELECT word, count(*)
FROM ( SELECT regexp_split_to_table(read_sequence, '([ATCG])([ATCG])([ATCG])(\3)(\2)(\1)') as word
FROM reads ) t
GROUP BY word;
However this brings up results that a) aren't palindromic and b) greater or less than 6 letters long.
\d reads
Table "public.reads"
Column | Type | Modifiers
--------------+---------+-----------
read_header | text | not null
read_sequence | text |
option | text |
quality_score | text |
pair_end | text | not null
species_id | integer |
Indexes:
"reads_pkey" PRIMARY KEY, btree (read_header, pair_end)
read_sequence contains DNA sequences, 'ATGCTGATGCGGCGTAGCTGGATCGA' for example.
I'd like to see the number of palindromes in each sequence so the example would contain 1 another sequence could have 4 another 3 and so on.
Count per row:
SELECT read_header, pair_end, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM reads r
, generate_series(1, length(r.read_sequence) - 5 ) i
WHERE substr(read_sequence, i, 6) ~ '([ATCG])([ATCG])([ATCG])\3\2\1'
GROUP BY 1,2,3
ORDER BY 1,2,3,4 DESC;
Count per read_header and palindrome:
SELECT read_header, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP BY 1,2
ORDER BY 1,2,3 DESC;
Count per read_header:
SELECT read_header, count(*) AS ct
FROM
...
GROUP BY 1
ORDER BY 1,2 DESC;
Count per palindrome:
SELECT substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP BY 1
ORDER BY 1,2 DESC;
SQL Fiddle.
Explain
A palindrome could start at any position 5 characters shy of the end to allow a length of 6. And palindromes can overlap. So:
Generate a list of possible starting positions with generate_series() in a LATERAL join, and based on this all possible 6-character strings.
Test for palindrome with regular expression with back references, similar to what you had, but regexp_split_to_table() is not the right function here. Use a regular expression match (~).
Aggregate, depending on what you actually want.

SQL Select where id is in `column`

I have a column that has multiple numbers separated by a comma. Example for a row:
`numbers`:
1,2,6,66,4,9
I want to make a query that will select the row only if the number 6 (for example) is in the column numbers.
I cant use LIKE because if there is 66 it'll work too.
You can use like. Concatenate the field separators at the beginning and end of the list and then use like. Here is the SQL Server sytnax:
where ','+numbers+',' like '%,'+'6'+',%'
SQL Server uses + for string concatenation. Other databases use || or the concat() function.
You should change your database to rather have a new table that joins numbers with the row of your current table. So if your row looks like this:
id numbers
1 1,2,6,66,4,9
You would have a new table that joins those values like so
row_id number
1 1
1 2
1 6
1 66
1 4
1 9
Then you can search for the number 6 in the number column and get the row_id