Return five rows of random DNA instead of just one - sql

This is the code I have to create a string of DNA:
prepare dna_length(int) as
with t1 as (
select chr(65) as s
union select chr(67)
union select chr(71)
union select chr(84) )
, t2 as ( select s, row_number() over() as rn from t1)
, t3 as ( select generate_series(1,$1) as i, round(random() * 4 + 0.5) as rn )
, t4 as ( select t2.s from t2 join t3 on (t2.rn=t3.rn))
select array_to_string(array(select s from t4),'') as dna;
execute dna_length(20);
I am trying to figure out how to re-write this to give a table of 5 rows of strings of DNA of length 20 each, instead of just one row. This is for PostgreSQL.
I tried:
CREATE TABLE dna_table(g int, dna text);
INSERT INTO dna_table (1, execute dna_length(20));
But this does not seem to work. I am an absolute beginner. How to do this properly?

PREPARE creates a prepared statement that can be used "as is". If your prepared statement returns one string then you can only get one string. You can't use it in other operations like insert, e.g.
In your case you may create a function:
create or replace function dna_length(int) returns text as
$$
with t1 as (
select chr(65) as s
union
select chr(67)
union
select chr(71)
union
select chr(84))
, t2 as (select s,
row_number() over () as rn
from t1)
, t3 as (select generate_series(1, $1) as i,
round(random() * 4 + 0.5) as rn)
, t4 as (select t2.s
from t2
join t3 on (t2.rn = t3.rn))
select array_to_string(array(select s from t4), '') as dna
$$ language sql;
And use it in a way like this:
insert into dna_table(g, dna) select generate_series(1,5), dna_length(20)
From the official doc:
PREPARE creates a prepared statement. A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
About functions.

This can be much simpler and faster:
SELECT string_agg(CASE ceil(random() * 4)
WHEN 1 THEN 'A'
WHEN 2 THEN 'C'
WHEN 3 THEN 'T'
WHEN 4 THEN 'G'
END, '') AS dna
FROM generate_series(1,100) g -- 100 = 5 rows * 20 nucleotides
GROUP BY g%5;
random() produces random value in the range 0.0 <= x < 1.0. Multiply by 4 and take the mathematical ceiling with ceil() (cheaper than round()), and you get a random distribution of the numbers 1-4. Convert to ACTG, and aggregate with GROUP BY g%5 - % being the modulo operator.
About string_agg():
Concatenate multiple result rows of one column into one, group by another column
As prepared statement, taking
$1 ... the number of rows
$2 ... the number of nucleotides per row
PREPARE dna_length(int, int) AS
SELECT string_agg(CASE ceil(random() * 4)
WHEN 1 THEN 'A'
WHEN 2 THEN 'C'
WHEN 3 THEN 'T'
WHEN 4 THEN 'G'
END, '') AS dna
FROM generate_series(1, $1 * $2) g
GROUP BY g%$1;
Call:
EXECUTE dna_length(5,20);
Result:
| dna |
| :------------------- |
| ATCTTCGACACGTCGGTACC |
| GTGGCTGCAGATGAACAGAG |
| ACAGCTTAAAACACTAAGCA |
| TCCGGACCTCTCGACCTTGA |
| CGTGCGGAGTACCCTAATTA |
db<>fiddle here
If you need it a lot, consider a function instead. See:
What is the difference between a prepared statement and a SQL or PL/pgSQL function, in terms of their purposes?

Related

Replace string with random text - Oracle SQL

I have a table table1 with 1 column - edi_value which is of type CLOB.
These are the entries:
seq edi_message
1 ISA*00* *00* *08*9254110060 *ZZ*123456789 *041216*0805*U*00501*000095071*0*P*>~
GS*AG*5137624388*123456789*20041216*0805*95071*X*005010~
ST*824*021390001*005010X186A1~
2 ISA*00* *00* *08*56789876678 *ZZ*123456789 *041216*0805*U*00501*000095071*0*P*>~
GS*AG*5137624388*123456789*20041216*0805*95071*X*005010~
ST*824*021390001*005010X186A1~
Please note - there can be varying number of lines, from 3 to 500.
What I'm looking for is the following conditions:
Ignore text before first * in each line, for every line, before the first *, it should not change. For ex. GS, ST should not change. ONLY after the first * should randomize
Replace numbers [0-9] with random numbers, for ex. if 0 is replaced with 1, then it should be 1 througout.
Replace text [A-Za-z] with random text, for ex. if A is replaced with W, then it should be replaced with W throughout
Leave special characters as is
One character/number should ONLY map to one random character/number
Output can be:
seq edi_message
1 ISA*11* *11* *13*4030111101 *QQ*102030234 *101010*1313*U*11311*111143121*1*V*>~
GS*WE*3122000233*102030234*01101010*1313*43121*X*113111~
ST*300*101241111*113111X130A1~
2 ISA*11* *11* *13*30234320023 *QQ*102030234 *101010*1313*U*11311*111143121*1*V*>~
GS*WE*3122000233*102030234*01101010*1313*43121*X*113111~
ST*300*101241111*113111X130W1~
How can this be achieved in Oracle SQL?
You can use translate with a helper function for generating random strings (though #LukStorms has a much neater SQL solution for that using LISTAGG), along with a method to tokenise and then re-concatenate the values into lines (I use a pure SQL method here for demonstration):
create or replace function f(p_low integer, p_high integer)
return varchar as
r varchar(2000) := '';
x integer;
begin
for i in p_low..p_high loop
x := dbms_random.value(0,length(r)+1);
r := substr(r,1,x)||chr(i)||substr(r,x+1);
end loop;
return r;
end;
/
select * from table1;
| EDI_VALUE |
| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ISA*00* *00* *08*9254110060 *ZZ*123456789 *041216*0805*U*00501*000095071*0*P*>~<br> GS*AG*5137624388*123456789*20041216*0805*95071*X*005010~<br> ST*824*021390001*005010X186A1~ |
| ISA*00* *00* *08*56789876678 *ZZ*123456789 *041216*0805*U*00501*000095071*0*P*>~<br> GS*AG*5137624388*123456789*20041216*0805*95071*X*005010~<br> ST*824*021390001*005010X186A |
with t as (select f(48,57)||f(65,90) translate_chars from dual)
select (select new_value
from (select substr(sys_connect_by_path(r_line,'
'),2) new_value, connect_by_isleaf isleaf
from (select lvl
, substr(line,1,instr(line,'*')-1)||
translate(substr(line,instr(line,'*'))
,'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
,(select translate_chars from t)) r_line
from (select level lvl
, regexp_substr(edi_value,'^.*$',1,level,'m') line
from (select table1.edi_value from dual)
connect by level <= regexp_count(edi_value,'^.*$',1,'m')))
start with lvl=1 connect by lvl=(prior lvl)+1)
where isleaf=1)
from table1;
| (SELECTNEW_VALUEFROM(SELECTSUBSTR(SYS_CONNECT_BY_PATH(R_LINE,''),2)NEW_VALUE,CONNECT_BY_ISLEAFISLEAFFROM(SELECTLVL,SUBSTR(LINE,1,INSTR(LINE,'*')-1)||TRANSLATE(SUBSTR(LINE,INSTR(LINE,'*')),'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ',(SELECTTRANSLATE_CHARSFR |
| :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ISA*66* *66* *67*1935006626 *VV*098532471 *650902*6763*K*66360*666613640*6*P*>~<br> GS*GZ*3084295877*098532471*96650902*6763*13640*I*663606~<br> ST*795*690816660*663606I072G0~ |
| ISA*66* *66* *67*32471742247 *VV*098532471 *650902*6763*K*66360*666613640*6*P*>~<br> GS*GZ*3084295877*098532471*96650902*6763*13640*I*663606~<br> ST*795*690816660*663606I072G |
db<>fiddle here
You can use CTE's with a CONNECT to generate the strings for the letters and numbers.
Then use the ordered and scrambled strings in the translate.
A CROSS APPLY can be used to REGEX split the message into parts.
Then only translate those that start with a *.
And use LISTAGG to glue the parts back together.
WITH
NUMS as
(
select
LISTAGG(n, '') WITHIN GROUP (ORDER BY n) as n_from,
LISTAGG(n, '') WITHIN GROUP (ORDER BY DBMS_RANDOM.VALUE) as n_to
from (select level-1 n from dual connect by level <= 10)
),
LETTERS as
(
select
LISTAGG(c, '') WITHIN GROUP (ORDER BY c) as c_from,
LISTAGG(c, '') WITHIN GROUP (ORDER BY DBMS_RANDOM.VALUE) as c_to
from (select chr(ascii('A')+level-1 ) c from dual connect by level <= 26)
)
SELECT ca.scrambled as scrambled_message
FROM table1 t
CROSS JOIN NUMS
CROSS JOIN LETTERS
CROSS APPLY
(
SELECT LISTAGG(CASE WHEN part like '*%' then translate(part, n_from||c_from, n_to||c_to) else part end, '') WITHIN GROUP (ORDER BY lvl) as scrambled
FROM
(
SELECT
level AS lvl,
REGEXP_SUBSTR(t.edi_message,'[*]\S+|[^*]+',1,level,'m') AS part
FROM dual
CONNECT BY level <= regexp_count(t.edi_message, '[*]\S+|[^*]+')+1
) parts
) ca;
A test on db<>fiddle here
Example output:
SCRAMBLED_MESSAGE
-----------------------------------------------------------------------------------------------------------
ISA*99* *99* *92*3525999959 *PP*950525023 *959595*9292*A*99299*999932909*9*J*>~
GS*WQ*2900555022*950525023*59959595*9292*32909*I*992999~
ST*255*959039999*992999I925V9~
ISA*99* *99* *92*25023205502 *PP*950525023 *959595*9292*A*99299*999932909*9*J*>~
GS*WQ*2900555022*950525023*59959595*9292*32909*I*992999~
ST*255*959039999*992999I925W9~

can't use JOIN with generate_series on Redshift

generate_series function on Redshift works as expected, when used in a simple select statement.
WITH series AS (
SELECT n as id from generate_series (-10, 0, 1) n
) SELECT * FROM series;
-- Works fine
As soon as I add a JOIN condition, redshift throws
com.amazon.support.exceptions.ErrorException: Function
generate_series(integer,integer,integer)" not supported"
DROP TABLE testing;
CREATE TABLE testing (
id INT
);
WITH series AS (
SELECT n as id from generate_series (-10, 0, 1) n
) SELECT * FROM series S JOIN testing T ON S.id = T.id;
-- Function "generate_series(integer,integer,integer)" not supported.
Redshift Version
SELECT version();
-- PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.1485
Are there any workarounds to make this work?
generate_series is not supported by Redshift. It works only standalone on a leader node.
A workaround would be using row_number against any table that has sufficient number of rows:
with
series as (
select (row_number() over ())-11 from some_table limit 10
) ...
also, this question was asked multiple times already
You are correct that this does not work on Redshift.
See here.
The easiest workaround is to create a permanent table "manually" beforehand with the values within that table, e.g. you could have rows on that table for -1000 to +1000, then select the range from that table,
So for your example you would have something like
WITH series AS (
SELECT n as id from (select num as n from newtable where num between -10 and 0) n
) SELECT * FROM series S JOIN testing T ON S.id = T.id;
Does that work for you?
Alternatively, if you cannot create the table beforehand or prefer not to, you could use something like this
with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
,generted_numbers AS
(
SELECT (1000*t1.num) + (100*t2.num) + (10*t3.num) + t4.num-5000 as gen_num
FROM ten_numbers AS t1
JOIN ten_numbers AS t2 ON 1 = 1
JOIN ten_numbers AS t3 ON 1 = 1
JOIN ten_numbers AS t4 ON 1 = 1
)
select gen_num from generted_numbers
where gen_num between -10 and 0
order by 1;

Select where record does not exists

I am trying out my hands on oracle 11g. I have a requirement such that I want to fetch those id from list which does not exists in table.
For example:
SELECT * FROM STOCK
where item_id in ('1','2'); // Return those records where result is null
I mean if item_id '1' is not present in db then the query should return me 1.
How can I achieve this?
You need to store the values in some sort of "table". Then you can use left join or not exists or something similar:
with ids as (
select 1 as id from dual union all
select 2 from dual
)
select ids.id
from ids
where not exists (select 1 from stock s where s.item_id = ids.id);
You can use a LEFT JOIN to an in-line table that contains the values to be searched:
SELECT t1.val
FROM (
SELECT '1' val UNION ALL SELECT '2'
) t1
LEFT JOIN STOCK t2 ON t1.val = t2.item_id
WHERE t2.item_id IS NULL
First create the list of possible IDs (e.g. 0 to 99 in below query). You can use a recursive cte for this. Then select these IDs and remove the IDs already present in the table from the result:
with possible_ids(id) as
(
select 0 as id from dual
union all
select id + 1 as id from possible_ids where id < 99
)
select id from possible_ids
minus
select item_id from stock;
A primary concern of the OP seems to be a terse notation of the query, notably the set of values to test for. The straightforwwrd recommendation would be to retrieve these values by another query or to generate them as a union of queries from the dual table (see the other answers for this).
The following alternative solution allows for a verbatim specification of the test values under the following conditions:
There is a character that does not occur in any of the test values provided ( in the example that will be - )
The number of values to test stays well below 2000 (to be precise, the list of values plus separators must be written as a varchar2 literal, which imposes the length limit ). However, this should not be an actual concern - If the test involves lists of hundreds of ids, these lists should definitely be retrieved froma table/view.
Caveat
Whether this method is worth the hassle ( not to mention potential performance impacts ) is questionable, imho.
Solution
The test values will be provided as a single varchar2 literal with - separating the values which is as terse as the specification as a list argument to the IN operator. The string starts and ends with -.
'-1-2-3-156-489-4654648-'
The number of items is computed as follows:
select cond, regexp_count ( cond, '[-]' ) - 1 cnt_items from (select '-1-2-3-156-489-4654648-' cond from dual)
A list of integers up to the number of items starting with 1 can be generated using the LEVEL pseudocolumn from hierarchical queries:
select level from dual connect by level < 42;
The n-th integer from that list will serve to extract the n-th value from the string (exemplified for the 4th value) :
select substr ( cond, instr(cond,'-', 1, 4 )+1, instr(cond,'-', 1, 4+1 ) - instr(cond,'-', 1, 4 ) - 1 ) si from (select cond, regexp_count ( cond, '[-]' ) - 1 cnt_items from (select '-1-2-3-156-489-4654648-' cond from dual) );
The non-existent stock ids are generated by subtracting the set of stock ids from the set of values. Putting it all together:
select substr ( cond, instr(cond,'-',1,level )+1, instr(cond,'-',1,level+1 ) - instr(cond,'-',1,level ) - 1 ) si
from (
select cond
, regexp_count ( cond, '[-]' ) - 1 cnt_items
from (
select '-1-2-3-156-489-4654648-' cond from dual
)
)
connect by level <= cnt_items + 1
minus
select item_id from stock
;

T-SQL "Dynamic" Join

Given the following SQL Server table with a single char(1) column:
Value
------
'1'
'2'
'3'
How do I obtain the following results in T-SQL?
Result
------
'1+2+3'
'1+3+2'
'2+1+3'
'2+3+1'
'3+2+1'
'3+1+2'
This needs to be dynamic too, so if my table only holds rows '1' and '2' I'd expect:
Result
------
'1+2'
'2+1'
It seems like I should be able to use CROSS JOIN to do this, but since I don't know how many rows there will be ahead of time, I'm not sure how many times to CROSS JOIN back on myself..?
SELECT a.Value + '+' + b.Value
FROM MyTable a
CROSS JOIN MyTable b
WHERE a.Value <> b.Value
There will always be less than 10 (and really more like 1-3) rows at any given time. Can I do this on-the-fly in SQL Server?
Edit: ideally, I'd like this to happen in a single stored proc, but if I have to use another proc or some user defined functions to pull this off I'm fine with that.
This SQL will compute the permutations without repetitions:
WITH recurse(Result, Depth) AS
(
SELECT CAST(Value AS VarChar(100)), 1
FROM MyTable
UNION ALL
SELECT CAST(r.Result + '+' + a.Value AS VarChar(100)), r.Depth + 1
FROM MyTable a
INNER JOIN recurse r
ON CHARINDEX(a.Value, r.Result) = 0
)
SELECT Result
FROM recurse
WHERE Depth = (SELECT COUNT(*) FROM MyTable)
ORDER BY Result
If MyTable contains 9 rows, it will take some time to compute, but it will return 362,880 rows.
Update with explanation:
The WITH statement is used to define a recursive common table expression. In effect, the WITH statement is looping multiple times performing a UNION until the recursion is finished.
The first part of SQL sets the starting records. Assuming 3 rows named 'A', 'B', and 'C' in MyTable, this will generate these rows:
Result Depth
------ -----
A 1
B 1
C 1
Then the next block of SQL performs the first level of recursion:
SELECT CAST(r.Result + '+' + a.Value AS VarChar(100)), r.Depth + 1
FROM MyTable a
INNER JOIN recurse r
ON CHARINDEX(a.Value, r.Result) = 0
This takes all of the records generated so far (which will be in the recurse table) and joins them to all of the records in MyTable again. The ON clause filters the list of records in MyTable to only return the ones that do not exist already in this row's permutation. This would result in these rows:
Result Depth
------ -----
A 1
B 1
C 1
A+B 2
A+C 2
B+A 2
B+C 2
C+A 2
C+B 2
Then the recursion loops again giving these rows:
Result Depth
------ -----
A 1
B 1
C 1
A+B 2
A+C 2
B+A 2
B+C 2
C+A 2
C+B 2
A+B+C 3
A+C+B 3
B+A+C 3
B+C+A 3
C+A+B 3
C+B+A 3
At this point, the recursion stops because the UNION does not create any more rows because the CHARINDEX will always be 0.
The last SQL filters all of the resulting rows where the computed Depth column matches the # of records in MyTable. This throws out all of the rows except for the ones generated by the last depth of recursion. So the final result will be these rows:
Result
------
A+B+C
A+C+B
B+A+C
B+C+A
C+A+B
C+B+A
You can do this with a recursive CTE:
with t as (
select 'a' as value union all
select 'b' union all
select 'c'
),
const as (select count(*) as cnt from t),
cte as (
select cast(value as varchar(max)) as value, 1 as level
from t
union all
select cte.value + '+' + t.value, 1 + level
from cte join
t
on '+'+cte.value+'+' not like '%+'+t.value+'+%' cross join
const
where level <= const.cnt
)
select cte.value
from cte cross join
const
where level = const.cnt;

is there a PRODUCT function like there is a SUM function in Oracle SQL?

I have a coworker looking for this, and I don't recall ever running into anything like that.
Is there a reasonable technique that would let you simulate it?
SELECT PRODUCT(X)
FROM
(
SELECT 3 X FROM DUAL
UNION ALL
SELECT 5 X FROM DUAL
UNION ALL
SELECT 2 X FROM DUAL
)
would yield 30
select exp(sum(ln(col)))
from table;
edit:
if col always > 0
DECLARE #a int
SET #a = 1
-- re-assign #a for each row in the result
-- as what #a was before * the value in the row
SELECT #a = #a * amount
FROM theTable
There's a way to do string concat that is similiar:
DECLARE #b varchar(max)
SET #b = ""
SELECT #b = #b + CustomerName
FROM Customers
Here's another way to do it. This is definitely the longer way to do it but it was part of a fun project.
You've got to reach back to school for this one, lol. They key to remember here is that LOG is the inverse of Exponent.
LOG10(X*Y) = LOG10(X) + LOG10(Y)
or
ln(X*Y) = ln(X) + ln(Y) (ln = natural log, or simply Log base 10)
Example
If X=5 and Y=6
X * Y = 30
ln(5) + ln(6) = 3.4
ln(30) = 3.4
e^3.4 = 30, so does 5 x 6
EXP(3.4) = 30
So above, if 5 and 6 each occupied a row in the table, we take the natural log of each value, sum up the rows, then take the exponent of the sum to get 30.
Below is the code in a SQL statement for SQL Server. Some editing is likely required to make it run on Oracle. Hopefully it's not a big difference but I suspect at least the CASE statement isn't the same on Oracle. You'll notice some extra stuff in there to test if the sign of the row is negative.
CREATE TABLE DUAL (VAL INT NOT NULL)
INSERT DUAL VALUES (3)
INSERT DUAL VALUES (5)
INSERT DUAL VALUES (2)
SELECT
CASE SUM(CASE WHEN SIGN(VAL) = -1 THEN 1 ELSE 0 END) % 2
WHEN 1 THEN -1
ELSE 1
END
* CASE
WHEN SUM(VAL) = 0 THEN 0
WHEN SUM(VAL) IS NOT NULL THEN EXP(SUM(LOG(ABS(CASE WHEN SIGN(VAL) <> 0 THEN VAL END))))
ELSE NULL
END
* CASE MIN(ABS(VAL)) WHEN 0 THEN 0 ELSE 1 END
AS PRODUCT
FROM DUAL
The accepted answer by tuinstoel is correct, of course:
select exp(sum(ln(col)))
from table;
But notice that if col is of type NUMBER, you will find tremendous performance improvement when using BINARY_DOUBLE instead. Ideally, you would have a BINARY_DOUBLE column in your table, but if that's not possible, you can still cast col to BINARY_DOUBLE. I got a 100x improvement in a simple test that I documented here, for this cast:
select exp(sum(ln(cast(col as binary_double))))
from table;
Is there a reasonable technique that would let you simulate it?
One technique could be using LISTAGG to generate product_expression string and XMLTABLE + GETXMLTYPE to evaluate it:
WITH cte AS (
SELECT grp, LISTAGG(l, '*') AS product_expression
FROM t
GROUP BY grp
)
SELECT c.*, s.val AS product_value
FROM cte c
CROSS APPLY(
SELECT *
FROM XMLTABLE('/ROWSET/ROW/*'
PASSING dbms_xmlgen.getXMLType('SELECT ' || c.product_expression || ' FROM dual')
COLUMNS val NUMBER PATH '.')
) s;
db<>fiddle demo
Output:
+------+---------------------+---------------+
| GRP | PRODUCT_EXPRESSION | PRODUCT_VALUE |
+------+---------------------+---------------+
| b | 2*6 | 12 |
| a | 3*5*7 | 105 |
+------+---------------------+---------------+
More roboust version with handling single NULL value in the group:
WITH cte AS (
SELECT grp, LISTAGG(l, '*') AS product_expression
FROM t
GROUP BY grp
)
SELECT c.*, s.val AS product_value
FROM cte c
OUTER APPLY(
SELECT *
FROM XMLTABLE('/ROWSET/ROW/*'
passing dbms_xmlgen.getXMLType('SELECT ' || c.product_expression || ' FROM dual')
COLUMNS val NUMBER PATH '.')
WHERE c.product_expression IS NOT NULL
) s;
db<>fiddle demo
*CROSS/OUTER APPLY(Oracle 12c) is used for convenience and could be replaced with nested subqueries.
This approach could be used for generating different aggregation functions.
There are many different implmentations of "SQL". When you say "does sql have" are you referring to a specific ANSI version of SQL, or a vendor specific implementation. DavidB's answer is one that works in a few different environments I have tested but depending on your environment you could write or find a function exactly like what you are asking for. Say you were using Microsoft SQL Server 2005, then a possible solution would be to write a custom aggregator in .net code named PRODUCT which would allow your original query to work exactly as you have written it.
In c# you might have to do:
SELECT EXP(SUM(LOG([col])))
FROM table;