Number of palindromes in character strings - sql

I'm trying to gather a list of 6 letter palindromes and the number of times they occur using Postgres 9.3.5.
This is the query I've tried:
SELECT word, count(*)
FROM ( SELECT regexp_split_to_table(read_sequence, '([ATCG])([ATCG])([ATCG])(\3)(\2)(\1)') as word
FROM reads ) t
GROUP BY word;
However this brings up results that a) aren't palindromic and b) greater or less than 6 letters long.
\d reads
Table "public.reads"
Column | Type | Modifiers
--------------+---------+-----------
read_header | text | not null
read_sequence | text |
option | text |
quality_score | text |
pair_end | text | not null
species_id | integer |
Indexes:
"reads_pkey" PRIMARY KEY, btree (read_header, pair_end)
read_sequence contains DNA sequences, 'ATGCTGATGCGGCGTAGCTGGATCGA' for example.
I'd like to see the number of palindromes in each sequence so the example would contain 1 another sequence could have 4 another 3 and so on.

Count per row:
SELECT read_header, pair_end, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM reads r
, generate_series(1, length(r.read_sequence) - 5 ) i
WHERE substr(read_sequence, i, 6) ~ '([ATCG])([ATCG])([ATCG])\3\2\1'
GROUP BY 1,2,3
ORDER BY 1,2,3,4 DESC;
Count per read_header and palindrome:
SELECT read_header, substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP BY 1,2
ORDER BY 1,2,3 DESC;
Count per read_header:
SELECT read_header, count(*) AS ct
FROM
...
GROUP BY 1
ORDER BY 1,2 DESC;
Count per palindrome:
SELECT substr(read_sequence, i, 6) AS word, count(*) AS ct
FROM
...
GROUP BY 1
ORDER BY 1,2 DESC;
SQL Fiddle.
Explain
A palindrome could start at any position 5 characters shy of the end to allow a length of 6. And palindromes can overlap. So:
Generate a list of possible starting positions with generate_series() in a LATERAL join, and based on this all possible 6-character strings.
Test for palindrome with regular expression with back references, similar to what you had, but regexp_split_to_table() is not the right function here. Use a regular expression match (~).
Aggregate, depending on what you actually want.

Related

Group data series into variable width windows based on first event

I have computational task which can be reduced to the follow problem:
I have a large set of pairs of integers (key, val) which I want to group into windows. The first window starts with the first pair p ordered by key attribute and spans all the pairs where p[i].key belongs to [p[0].key; p[0].key + N), with some arbitrary integer N, positive and common to all windows.
The next window starts with the first pair ordered by key not included in the previous windows and again spans all the pairs from its key to key + N, and so on for the following windows.
The last step is to sum second attribute for each window and display it together with the first key of the window.
For example, given list of records with values:
key
val
1
3
2
7
5
1
6
4
7
1
10
3
13
5
and N=3, the windows would be:
{(1,3),(2,7)},
{(5,1),(6,4),(7,1)},
{(10,3)}
{(13,5)}
The final result:
key
sum_of_values
1
10
5
6
10
3
13
5
This is easy to program with a standard programming language but I have no clue how to solve this with SQL.
Note: If clickhouse doesn't support the RECURSIVE keyword, just remove that keyword from the expression.
Clickhouse seems to use non-standard syntax for the WITH clause. The below uses standard SQL. Adjust as needed.
Sorry. clickhouse may not support this approach. If not, we would need to find another method of walking through the data.
Standard SQL:
There are a few ways. Here's one approach. First assign row numbers to allow recursively stepping through the rows. We could use LEAD as well.
Assign a group (key value) to each row based on the current key and the last group/key value and whether they are within some distance (N = 3, in this case).
The last step is to just SUM these values per group start_key and to use the start_key value as the starting key in each group.
WITH RECURSIVE nrows (xkey, val, n) AS (
SELECT xkey, val, ROW_NUMBER() OVER (ORDER BY xkey) FROM test
)
, cte (xkey, val, n, start_key) AS (
SELECT xkey, val, n, xkey FROM nrows WHERE n = 1
UNION ALL
SELECT t1.xkey, t1.val, t1.n
, CASE WHEN t1.xkey <= t2.start_key + (3-1) THEN t2.start_key ELSE t1.xkey END
FROM nrows AS t1
JOIN cte AS t2
ON t2.n = t1.n-1
)
SELECT start_key
, SUM(val) AS sum_values
FROM cte
GROUP BY start_key
ORDER BY start_key
;
Result:
+-----------+------------+
| start_key | sum_values |
+-----------+------------+
| 1 | 10 |
| 5 | 6 |
| 10 | 3 |
| 13 | 5 |
+-----------+------------+

How can I get the dates from a text string?

I use Vertical SQL and have a field "Note" that is a free text field (no consistent way to enter data). I'd like to create another field with only dates or extract the last date in the field.
E.g
"1st order on 3/2/21, second 5/5/21" -> "3/2/21 5/5/21" or "5/5/21"
"first delivery 2/2/21 second one 8/30/21" -> "2/2/21 8/30/21" or "8/30/21"
"reported 1st: 2/2/21." -> "2/2/21"
Thanks!
You can use REGEXP_SUBSTR() to grab the patterns: one or more digits; slash; one or more digits; slash; one or more digits.
If you have more than one of those patterns, then, create one row as output for each pattern found. For that, CROSS JOIN with a consecutive series of integers, so you can output the n-th occurrence of the pattern. Then, cast the found string as DATE.
Finally, and only if you only need the last date, apply a Vertica-peculiar analytic limit clause , to only output the highest i value for the respective id (which I had to add) of the result table.
WITH
-- need a sequence of integers ...
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
)
,
indata(id,s) AS (
SELECT 1,'1st order on 3/2/21, second 5/5/21'
UNION ALL SELECT 2,'first delivery 2/2/21 second one 8/30/21'
UNION ALL SELECT 3,'reported 1st: 2/2/21.'
)
SELECT
id
, i
, s
, REGEXP_SUBSTR(s,'\d+/\d+/\d+',1,i) AS found_token
, REGEXP_SUBSTR(s,'\d+/\d+/\d+',1,i)::DATE AS found_date
FROM indata CROSS JOIN i
WHERE REGEXP_SUBSTR(s,'(\d+/\d+/\d+)',1,i,'',1) <>''
-- remove the following line if you want all dates from all strings
-- and keep it if you only want the last date in the string
LIMIT 1 OVER(PARTITION BY id ORDER BY i DESC)
;
id | i | s | found_token | found_date
----+---+------------------------------------------+-------------+------------
1 | 2 | 1st order on 3/2/21, second 5/5/21 | 5/5/21 | 2021-05-05
2 | 2 | first delivery 2/2/21 second one 8/30/21 | 8/30/21 | 2021-08-30
3 | 1 | reported 1st: 2/2/21. | 2/2/21 | 2021-02-02
Consistently is critical when parsing string data. If it will always end with a date preceded by a space, pulling the last date should be fairly simple. Consider:
Trim(Mid(Note, InStrRev(Note, " ")))

matching array in Postgres with string manipulation

I was working with the "<#" operator and two arrays of strings.
anyarray <# anyarray → boolean
Every string is formed in this way: ${name}_${number}, and I would like to check if the name part is included and the number is equal or lower than the one in the other array.
['elementOne_10'] & [['elementOne_7' , 'elementTwo20']] → true
['elementOne_10'] & [['elementOne_17', 'elementTwo20']] → false
what would be an efficient way to do this?
Assuming your sample data elementTwo20 in fact follows your described schema and should be elementTwo_20:
step-by-step demo:db<>fiddle
SELECT
id
FROM (
SELECT
*,
split_part(u, '_', 1) as name, -- 3
split_part(u, '_', 2)::int as num,
split_part(compare, '_', 1) as comp_name,
split_part(compare, '_', 2)::int as comp_num
FROM
t,
unnest(data) u, -- 1
(SELECT unnest('{elementOne_10}'::text[]) as compare) s -- 2
)s
GROUP BY id -- 4
HAVING
ARRAY_AGG(name) #> ARRAY_AGG(comp_name) -- 5
AND MAX(comp_num) BETWEEN MIN(num) AND MAX(num)
unnest() your array elements into one element per record
JOIN and unnest() your comparision data
split the element strings into their name and num parts
unnest() creates several records per original array, they can be grouped by an identifier (best is an id column)
Filter with your criteria in the HAVING clause: Compare the name parts for example with array operators, for BETWEEN comparing you can use MIN and MAX on the num part.
Note:
As #a_horse_with_no_name correctly mentioned: If possible think about your database design and normalize it:
Don't store arrays -> You don't need to unnest them on every operation
Relevant data should be kept separated, not concatenated as a string -> You don't need to split them on every operation
id | name | num
---------------------
1 | elementOne | 7
1 | elementTwo | 20
2 | elementOne | 17
2 | elementTwo | 20
This is exactly the result of the inner subquery. You have to create this every time you need these data. It's better to store the data like this.

PLSQL - Count of all characters within a string

I want to be able to generate a count of all characters in a given string from the result of an Oracle PLSQL query.
For instance, given the string "strings", output would be as such
character | count
-----------------
g | 1
i | 1
n | 1
r | 1
s | 2
t | 1
My thinking was something along the lines of
SELECT COLUMN, COUNT(COLUMN) FROM TABLE GROUP BY COLUMN
but that would require converting a string into a set of characters which is where I'm stuck.
Ideally this extends to a count of all ASCII characters not just A-Z, in order to perform analysis on the contents of the database.
I'm curious if there's a better way to do this than creating a procedure and whitelisting characters to count and running that on a given string.
This is a commonly used way to split a string into characters;
once you have one record for each character, counting them is quite straightforward:
select single_char, count(*)
from (
select substr(x, level, 1) as single_char
from (select 'abbabbaccb' x from dual)
connect by level <= length(x)
)
group by single_char

Combine elements of array into different array

I need to split text elements in an array and combine the elements (array_agg) by index into different rows
E.g., input is
'{cat$ball$x... , dog$bat$y...}'::text[]
I need to split each element by '$' and the desired output is:
{cat,dog} - row 1
{ball,bat} - row 2
{x,y} - row 3
...
Sorry for not being clear the first time. I have edited my question. I tried similar options but unable to figure out how to get it with multiple text elements separated with '$' sysmbol
Exactly two parts per array element (original question)
Use unnest(), split_part() and array_agg():
SELECT array_agg(split_part(t, '$', 1)) AS col1
, array_agg(split_part(t, '$', 2)) AS col2
FROM unnest('{cat$ball, dog$bat}'::text[]) t;
Related:
Split comma separated column data into additional columns
General solution (updated question)
For any number of arrays with any number of elements containing any number of parts.
Demo for a table tbl:
CREATE TABLE tbl (tbl_id int PRIMARY KEY, arr text[]);
INSERT INTO tbl VALUES
(1, '{cat1$ball1, dog2$bat2}') -- 2 parts per array element, 2 elements
, (2, '{cat$ball$x, dog$bat$y}') -- 3 parts ...
, (3, '{a1$b1$c1$d1, a2$b2$c2$d2, a3$b3$c3$d3}'); -- 4 parts, 3 elements
Query:
SELECT tbl_id, idx, array_agg(elem ORDER BY ord) AS pivoted_array
FROM tbl t
, unnest(t.arr) WITH ORDINALITY a1(string, ord)
, unnest(string_to_array(a1.string, '$')) WITH ORDINALITY a2(elem, idx)
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
We are looking at two (nested) LATERAL joins here. LATERAL requires Postgres 9.3. Details:
What is the difference between LATERAL and a subquery in PostgreSQL?
WITH ORDINALITY for the the first unnest() is up for debate. A simpler query normally works, too. It's just not guaranteed to work according to SQL standards:
SELECT tbl_id, idx, array_agg(elem) AS pivoted_array
FROM tbl t
, unnest(t.arr) string
, unnest(string_to_array(string, '$')) WITH ORDINALITY a2(elem, idx)
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
Details:
PostgreSQL unnest() with element number
WITH ORDINALITY requires Postgres 9.4 or later. The same back-patched to Postgres 9.3:
SELECT tbl_id, idx, array_agg(arr2[idx]) AS pivoted_array
FROM tbl t
, LATERAL (
SELECT string_to_array(string, '$') AS arr2 -- convert string to array
FROM unnest(t.arr) string -- unnest org. array
) x
, generate_subscripts(arr2, 1) AS idx -- unnest 2nd array with ord. numbers
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
Each query returns:
tbl_id | idx | pivoted_array
--------+-----+---------------
1 | 1 | {cat,dog}
1 | 2 | {bat,ball}
1 | 3 | {y,x}
2 | 1 | {cat2,dog2}
2 | 2 | {ball2,bat2}
3 | 1 | {a3,a1,a2}
3 | 2 | {b1,b2,b3}
3 | 3 | {c2,c1,c3}
3 | 4 | {d2,d3,d1}
SQL Fiddle (still stuck on pg 9.3).
The only requirement for these queries is that the number of parts in elements of the same array is constant. We could even make it work for a varying number of parts using crosstab() with two parameters to fill in NULL values for missing parts, but that's beyond the scope of this question:
PostgreSQL Crosstab Query
A bit messy but you could unnest the array, use regex to separate the text and then aggregate back up again:
with a as (select unnest('{cat$ball, dog$bat}'::_text) some_text),
b as (select regexp_matches(a.some_text, '(^[a-z]*)\$([a-z]*$)') animal_object from a)
select array_agg(animal_object[1]) animal, array_agg(animal_object[2]) a_object
from b
If you're processing multiple records at once you may want to use something like a row number before the unnest so that you have a group by to aggregate back to an array in your final select statement.