SQL: Divide long text in multiple rows - sql

I would like to divide a long text in multiple rows; there are other questions similar to this one but none of them worked for me.
What I have
ID | Message
----------------------------------
1 | Very looooooooooooooooong text
2 | Short text
What I would like to do is divide that string every n characters
Result if n = 15:
Id | Message
------------------------------------------
1 | Very looooooooo
1 | oooooooong text
2 | Short text
Even better if the split is done at the first space after n character.
I tried with string_split and substring but I cannot find anything that works.
I thought to use something similar to this:
SELECT index, element FROM table, CAST(message AS SUPER) AS element AT index;
But it doesn't take into account the length and I don't like casting a varchar variable into a super.

You can use generate_series() to accomplish this:
select m.*, gs.posn, substring(m.message, gs.posn, 15) as split_message
from messages m
cross join lateral generate_series(1, length(message), 15) gs(posn);
Splitting on spaces after the length is a little trickier. We would have to split the message into words and then figure out how to break them into groups and then reaggregate.
I could not figure out how to split on spaces without recursion. I hope you don't mind that it treats all whitespace as word boundaries:
with recursive by_words as (
select m.*, s.n, s.word, length(s.word) as word_len,
max(s.n) over (partition by m.id) as num_words
from messages m
cross join lateral regexp_split_to_table(m.message, '\s+')
with ordinality as s(word, n)
), rejoin as (
select id, n, array[word] as words, word_len as cum_word_len,
word_len >= 15 as keep
from by_words
where n = 1
union all
select p.id, c.n,
case
when p.cum_word_len >= 15 then array[c.word]
else p.words||c.word
end as words,
case
when p.cum_word_len >= 15 then c.word_len
else p.cum_word_len + c.word_len + 1
end as cum_word_len,
(p.cum_word_len + c.word_len + 1 >= 15)
or (c.n = c.num_words) as keep
from rejoin p
join by_words c on (c.id, c.n) = (p.id, p.n + 1)
)
select id,
row_number() over (partition by id
order by n) as segnum,
array_to_string(words, ' ') as split_message
from rejoin
where keep
order by 1, 2
;
db<>fiddle here
Edit to add:
Can you please tell me whether the below works in Redshift?
with gs as (
select generate_series as posn
from generate_series(1, 150000, 15)
)
select *, substring(m.message, gs.posn, 15) as split_message
from messages m
join gs
on gs.posn <= greatest(1, length(m.message))
order by m.id, gs.posn
;

Thanks to #Mike Organek 's answer and his help I found a solution that works with Redshift too.
Problem in Mike's answer for Redshift is related to generate_series that is not well supported in Redshift, so here's a workaround.
with row as (
select t.*, row_number() over () as x
from table t -- big enough table
limit 100
),
result as
(
select (x-1)*15+1 as posn from row --change 15 to a number to split the long text with
)
select * into gs
from result
And then Mike's answer:
select *, substring(m.feedback from gs.posn for 15) as split_message
from messages m
join gs
on gs.posn <= greatest(1, length(m.message))
order by m.id, gs.posn

Related

Perform loop and calculation on BigQuery Array type

My original data, B is an array of INT64:
And I want to calculate the difference between B[n+1] - B[n], hence result in a new table as follow:
I figured out I can somehow achieve this by using LOOP and IF condition:
DECLARE x INT64 DEFAULT 0;
LOOP
SET x = x + 1
IF(x < array_length(table.B))
THEN INSERT INTO newTable (SELECT A, B[OFFSET(x+1)] - B[OFFSET(x)]) from table
END IF;
END LOOP;
The problem is that the above idea doesn't work on each row of my data, cause I still need to loop through each row in my data table, but I can't find a way to integrate my scripting part into a normal query, where I can
SELECT A, [calculation script] from table
Can someone point me how can I do it? Or any better way to solve this problem?
Thank you.
Below actually works - BigQuery
select * replace(
array(select diff from (
select offset, lead(el) over(order by offset) - el as diff
from unnest(B) el with offset
) where not diff is null
order by offset
) as B
)
from `project.dataset.table` t
if to apply to sample data in your question - output is
You can use unnest() with offset for this purpose:
select id, a,
array_agg(b_el - prev_b_el order by n) as b_diffs
from (select t.*, b_el, lag(b_el) over (partition by t.id order by n) as prev_b_el
from t cross join
unnest(b) b_el with offset n
) t
where prev_b_el is not null
group by t.id, t.a

Select by length of characters

I have to select the longest phrase that has points>0 but being contained in a phrase which has points=0, if you look at the demo than the rows in output would be number 3 and 6:
http://sqlfiddle.com/#!18/e954f/1/0
many thanks in advance.
You can use a CTE to find all phrases with positive points which are a substring of a phrase with 0 points. Then you can find the maximum length of the substrings associated with each 0 point phrase, and JOIN that back to the CTE to get the phrase that matches that condition:
WITH cte AS (
SELECT w1.*, w2.id AS w2_id
FROM words w1
JOIN (SELECT *
FROM words
WHERE points = 0) w2 ON w1.phrase = LEFT(w2.phrase, LEN(w1.phrase))
WHERE w1.points > 0
)
SELECT cte.id, cte.phrase, points
FROM cte
JOIN (SELECT w2_id, MAX(LEN(phrase)) AS max_len
FROM cte
GROUP BY w2_id) cte_max ON cte_max.w2_id = cte.w2_id AND cte_max.max_len = LEN(cte.phrase)
Output:
id phrase points
3 tool box online 1
6 stone road 1
Updated SQLFiddle
You can use an inner join comparing the phrases with a LIKE to get only the ones contained in another phrase. Filter for the point in a WHERE clause. Then get the rank() partitioned by the phrase from the joined instance and ordered by the length descending. In an outer SELECT only get the ones with a rank of one.
SELECT x.id,
x.phrase,
x.points
FROM (SELECT w1.id,
w1.phrase,
w1.points,
rank() OVER (PARTITION BY w2.phrase
ORDER BY len(w1.phrase) DESC) r
FROM words w1
INNER JOIN words w2
ON w2.phrase LIKE concat(w1.phrase, '%')
WHERE w2.points = 0
AND w1.points > 0) x
WHERE x.r = 1;
SQL Fiddle
Edit:
To include the other phrase:
SELECT x.id,
x.phrase,
x.other_phrase,
x.points
FROM (SELECT w1.id,
w1.phrase,
w2.phrase other_phrase,
w1.points,
rank() OVER (PARTITION BY w2.phrase
ORDER BY len(w1.phrase) DESC) r
FROM words w1
INNER JOIN words w2
ON w2.phrase LIKE concat(w1.phrase, '%')
WHERE w2.points = 0
AND w1.points > 0) x
WHERE x.r = 1;
You will get from max to min length of phrase where points>0
SELECT *, LEN(phrase) AS Lenght FROM words where points>0 ORDER BY LEN(phrase) DESC
And if you want the longest phrase
SELECT TOP 1 *, LEN(phrase) AS Lenght FROM words where points>0 ORDER BY LEN(phrase) DESC

Replace string with random text - Oracle SQL

I have a table table1 with 1 column - edi_value which is of type CLOB.
These are the entries:
seq edi_message
1 ISA*00* *00* *08*9254110060 *ZZ*123456789 *041216*0805*U*00501*000095071*0*P*>~
GS*AG*5137624388*123456789*20041216*0805*95071*X*005010~
ST*824*021390001*005010X186A1~
2 ISA*00* *00* *08*56789876678 *ZZ*123456789 *041216*0805*U*00501*000095071*0*P*>~
GS*AG*5137624388*123456789*20041216*0805*95071*X*005010~
ST*824*021390001*005010X186A1~
Please note - there can be varying number of lines, from 3 to 500.
What I'm looking for is the following conditions:
Ignore text before first * in each line, for every line, before the first *, it should not change. For ex. GS, ST should not change. ONLY after the first * should randomize
Replace numbers [0-9] with random numbers, for ex. if 0 is replaced with 1, then it should be 1 througout.
Replace text [A-Za-z] with random text, for ex. if A is replaced with W, then it should be replaced with W throughout
Leave special characters as is
One character/number should ONLY map to one random character/number
Output can be:
seq edi_message
1 ISA*11* *11* *13*4030111101 *QQ*102030234 *101010*1313*U*11311*111143121*1*V*>~
GS*WE*3122000233*102030234*01101010*1313*43121*X*113111~
ST*300*101241111*113111X130A1~
2 ISA*11* *11* *13*30234320023 *QQ*102030234 *101010*1313*U*11311*111143121*1*V*>~
GS*WE*3122000233*102030234*01101010*1313*43121*X*113111~
ST*300*101241111*113111X130W1~
How can this be achieved in Oracle SQL?
You can use translate with a helper function for generating random strings (though #LukStorms has a much neater SQL solution for that using LISTAGG), along with a method to tokenise and then re-concatenate the values into lines (I use a pure SQL method here for demonstration):
create or replace function f(p_low integer, p_high integer)
return varchar as
r varchar(2000) := '';
x integer;
begin
for i in p_low..p_high loop
x := dbms_random.value(0,length(r)+1);
r := substr(r,1,x)||chr(i)||substr(r,x+1);
end loop;
return r;
end;
/
select * from table1;
| EDI_VALUE |
| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ISA*00* *00* *08*9254110060 *ZZ*123456789 *041216*0805*U*00501*000095071*0*P*>~<br> GS*AG*5137624388*123456789*20041216*0805*95071*X*005010~<br> ST*824*021390001*005010X186A1~ |
| ISA*00* *00* *08*56789876678 *ZZ*123456789 *041216*0805*U*00501*000095071*0*P*>~<br> GS*AG*5137624388*123456789*20041216*0805*95071*X*005010~<br> ST*824*021390001*005010X186A |
with t as (select f(48,57)||f(65,90) translate_chars from dual)
select (select new_value
from (select substr(sys_connect_by_path(r_line,'
'),2) new_value, connect_by_isleaf isleaf
from (select lvl
, substr(line,1,instr(line,'*')-1)||
translate(substr(line,instr(line,'*'))
,'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
,(select translate_chars from t)) r_line
from (select level lvl
, regexp_substr(edi_value,'^.*$',1,level,'m') line
from (select table1.edi_value from dual)
connect by level <= regexp_count(edi_value,'^.*$',1,'m')))
start with lvl=1 connect by lvl=(prior lvl)+1)
where isleaf=1)
from table1;
| (SELECTNEW_VALUEFROM(SELECTSUBSTR(SYS_CONNECT_BY_PATH(R_LINE,''),2)NEW_VALUE,CONNECT_BY_ISLEAFISLEAFFROM(SELECTLVL,SUBSTR(LINE,1,INSTR(LINE,'*')-1)||TRANSLATE(SUBSTR(LINE,INSTR(LINE,'*')),'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ',(SELECTTRANSLATE_CHARSFR |
| :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ISA*66* *66* *67*1935006626 *VV*098532471 *650902*6763*K*66360*666613640*6*P*>~<br> GS*GZ*3084295877*098532471*96650902*6763*13640*I*663606~<br> ST*795*690816660*663606I072G0~ |
| ISA*66* *66* *67*32471742247 *VV*098532471 *650902*6763*K*66360*666613640*6*P*>~<br> GS*GZ*3084295877*098532471*96650902*6763*13640*I*663606~<br> ST*795*690816660*663606I072G |
db<>fiddle here
You can use CTE's with a CONNECT to generate the strings for the letters and numbers.
Then use the ordered and scrambled strings in the translate.
A CROSS APPLY can be used to REGEX split the message into parts.
Then only translate those that start with a *.
And use LISTAGG to glue the parts back together.
WITH
NUMS as
(
select
LISTAGG(n, '') WITHIN GROUP (ORDER BY n) as n_from,
LISTAGG(n, '') WITHIN GROUP (ORDER BY DBMS_RANDOM.VALUE) as n_to
from (select level-1 n from dual connect by level <= 10)
),
LETTERS as
(
select
LISTAGG(c, '') WITHIN GROUP (ORDER BY c) as c_from,
LISTAGG(c, '') WITHIN GROUP (ORDER BY DBMS_RANDOM.VALUE) as c_to
from (select chr(ascii('A')+level-1 ) c from dual connect by level <= 26)
)
SELECT ca.scrambled as scrambled_message
FROM table1 t
CROSS JOIN NUMS
CROSS JOIN LETTERS
CROSS APPLY
(
SELECT LISTAGG(CASE WHEN part like '*%' then translate(part, n_from||c_from, n_to||c_to) else part end, '') WITHIN GROUP (ORDER BY lvl) as scrambled
FROM
(
SELECT
level AS lvl,
REGEXP_SUBSTR(t.edi_message,'[*]\S+|[^*]+',1,level,'m') AS part
FROM dual
CONNECT BY level <= regexp_count(t.edi_message, '[*]\S+|[^*]+')+1
) parts
) ca;
A test on db<>fiddle here
Example output:
SCRAMBLED_MESSAGE
-----------------------------------------------------------------------------------------------------------
ISA*99* *99* *92*3525999959 *PP*950525023 *959595*9292*A*99299*999932909*9*J*>~
GS*WQ*2900555022*950525023*59959595*9292*32909*I*992999~
ST*255*959039999*992999I925V9~
ISA*99* *99* *92*25023205502 *PP*950525023 *959595*9292*A*99299*999932909*9*J*>~
GS*WQ*2900555022*950525023*59959595*9292*32909*I*992999~
ST*255*959039999*992999I925W9~

How do I create a list of all possible anagrams of a word/string in PostgreSQL

How do I create a list of all possible anagrams of a word/string in PostgreSQL.
For example if String is 'act'
then the desired output should be:
act,
atc,
cta,
cat,
tac,
tca
I have one Table 'tbl_words' which contains million of words.
Then I want to check/search for only valid words in my database table from this anagrams list.
Like from above list of anagrams valid words are : act, cat.
Is there any way to do this?
Update 1:
I need output like this:
(all permutation for given word )
any idea ??
The query generates all permutations of 3 elements set:
with recursive numbers as (
select generate_series(1, 3) as i
),
rec as (
select i, array[i] as p
from numbers
union all
select n.i, p || n.i
from numbers n
join rec on cardinality(p) < 3 and not n.i = any(p)
)
select p as permutation
from rec
where cardinality(p) = 3
order by 1
permutation
-------------
{1,2,3}
{1,3,2}
{2,1,3}
{2,3,1}
{3,1,2}
{3,2,1}
(6 rows)
Modify the final query to generate permutations of the letters of a given word:
with recursive numbers as (
select generate_series(1, 3) as i
),
rec as (
select i, array[i] as p
from numbers
union all
select n.i, p || n.i
from numbers n
join rec on cardinality(p) < 3 and not n.i = any(p)
)
select a[p[1]] || a[p[2]] || a[p[3]] as result
from rec
cross join regexp_split_to_array('act', '') as a
where cardinality(p) = 3
order by 1
result
--------
act
atc
cat
cta
tac
tca
(6 rows)
Here is a solution:
with recursive params as (
select *
from (values ('cata')) v(str)
),
nums as (
select str, 1 as n
from params
union all
select str, 1 + n
from nums
where n < length(str)
),
pos as (
select str, array[n] as poses, array_remove(array_agg(n) over (partition by str), n) as rests, 1 as lev
from nums
union all
select pos.str, array_append(pos.poses, nums.n), array_remove(rests, nums.n), lev + 1
from pos join
nums
on pos.str = nums.str and array_position(pos.rests, nums.n) > 0
where cardinality(rests) > 0
)
select distinct pos.str , string_agg(substr(pos.str, thepos, 1), '')
from pos cross join lateral
unnest(pos.poses) thepos
where cardinality(rests) = 0
group by pos.str, pos.poses;
This is quite tricky, particularly when there are repeated letters in the string. The approach taken here generates all permutations of the numbers from 1 to n, where n is the length of the string. It then uses these as indexes to extract characters from the original string.
Those who are keen will notice that this uses select distinct with group by. That seems like the easiest way to avoid duplication in the resultant strings.

Find all possible combinations of array without permutations

Input is an array of 'n' length.
I need all combinations inside this array stored into new array.
IN: j='{A, B, C ..}'
OUT: k='{A, B, C, AB, AC, BC, ABC ..}'
Without repetitions, so without BA, CA etc.
Generic solution using a recursive CTE
Works for any number of elements and any base data type that supports the > operator.
WITH RECURSIVE t(i) AS (SELECT * FROM unnest('{A,B,C}'::text[])) -- provide array
, cte AS (
SELECT i::text AS combo, i, 1 AS ct
FROM t
UNION ALL
SELECT cte.combo || t.i::text, t.i, ct + 1
FROM cte
JOIN t ON t.i > cte.i
)
SELECT ARRAY (
SELECT combo
FROM cte
ORDER BY ct, combo
) AS result;
Result is an array of text in the example.
Note that you can have any number of additional non-recursive CTEs when using the RECURSIVE keyword.
More generic yet
If any of the following apply:
Array elements are non-unique (like '{A,B,B}').
The base data type does not support the > operator (like json).
Array elements are very big - for better performance.
Use a row number instead of comparing elements:
WITH RECURSIVE t AS (
SELECT i::text, row_number() OVER () AS rn
FROM unnest('{A,B,B}'::text[]) i -- duplicate element!
)
, cte AS (
SELECT i AS combo, rn, 1 AS ct
FROM t
UNION ALL
SELECT cte.combo || t.i, t.rn, ct + 1
FROM cte
JOIN t ON t.rn > cte.rn
)
SELECT ARRAY (
SELECT combo
FROM cte
ORDER BY ct, combo
) AS result;
Or use WITH ORDINALITY in Postgres 9.4+:
PostgreSQL unnest() with element number
Special case: generate decimal numbers
To generate decimal numbers with 5 digits along these lines:
WITH RECURSIVE t AS (
SELECT i
FROM unnest('{1,2,3,4,5}'::int[]) i
)
, cte AS (
SELECT i AS nr, i
FROM t
UNION ALL
SELECT cte.nr * 10 + t.i, t.i
FROM cte
JOIN t ON t.i > cte.i
)
SELECT ARRAY (
SELECT nr
FROM cte
ORDER BY nr
) AS result;
SQL Fiddle demonstrating all.
if n is small < 20 , all possible combinations can be found using a bitmask approach. There are 2^n different combinations of it. The number values 0 to
(2^n - 1) represents one of the combination.
e.g n=3
0 represents {},empty element
2^3-1=7= 111 b represents element, abc
pseudo code as follows
for b=0 to 2^n - 1 do #each combination
res=""
for i=0 to (n-1) do # which elements are included
if (b && (1<<i) != 0)
res= res+arr[i]
end
print res
end
end