Select by length of characters - sql

I have to select the longest phrase that has points>0 but being contained in a phrase which has points=0, if you look at the demo than the rows in output would be number 3 and 6:
http://sqlfiddle.com/#!18/e954f/1/0
many thanks in advance.

You can use a CTE to find all phrases with positive points which are a substring of a phrase with 0 points. Then you can find the maximum length of the substrings associated with each 0 point phrase, and JOIN that back to the CTE to get the phrase that matches that condition:
WITH cte AS (
SELECT w1.*, w2.id AS w2_id
FROM words w1
JOIN (SELECT *
FROM words
WHERE points = 0) w2 ON w1.phrase = LEFT(w2.phrase, LEN(w1.phrase))
WHERE w1.points > 0
)
SELECT cte.id, cte.phrase, points
FROM cte
JOIN (SELECT w2_id, MAX(LEN(phrase)) AS max_len
FROM cte
GROUP BY w2_id) cte_max ON cte_max.w2_id = cte.w2_id AND cte_max.max_len = LEN(cte.phrase)
Output:
id phrase points
3 tool box online 1
6 stone road 1
Updated SQLFiddle

You can use an inner join comparing the phrases with a LIKE to get only the ones contained in another phrase. Filter for the point in a WHERE clause. Then get the rank() partitioned by the phrase from the joined instance and ordered by the length descending. In an outer SELECT only get the ones with a rank of one.
SELECT x.id,
x.phrase,
x.points
FROM (SELECT w1.id,
w1.phrase,
w1.points,
rank() OVER (PARTITION BY w2.phrase
ORDER BY len(w1.phrase) DESC) r
FROM words w1
INNER JOIN words w2
ON w2.phrase LIKE concat(w1.phrase, '%')
WHERE w2.points = 0
AND w1.points > 0) x
WHERE x.r = 1;
SQL Fiddle
Edit:
To include the other phrase:
SELECT x.id,
x.phrase,
x.other_phrase,
x.points
FROM (SELECT w1.id,
w1.phrase,
w2.phrase other_phrase,
w1.points,
rank() OVER (PARTITION BY w2.phrase
ORDER BY len(w1.phrase) DESC) r
FROM words w1
INNER JOIN words w2
ON w2.phrase LIKE concat(w1.phrase, '%')
WHERE w2.points = 0
AND w1.points > 0) x
WHERE x.r = 1;

You will get from max to min length of phrase where points>0
SELECT *, LEN(phrase) AS Lenght FROM words where points>0 ORDER BY LEN(phrase) DESC
And if you want the longest phrase
SELECT TOP 1 *, LEN(phrase) AS Lenght FROM words where points>0 ORDER BY LEN(phrase) DESC

Related

SQL: Divide long text in multiple rows

I would like to divide a long text in multiple rows; there are other questions similar to this one but none of them worked for me.
What I have
ID | Message
----------------------------------
1 | Very looooooooooooooooong text
2 | Short text
What I would like to do is divide that string every n characters
Result if n = 15:
Id | Message
------------------------------------------
1 | Very looooooooo
1 | oooooooong text
2 | Short text
Even better if the split is done at the first space after n character.
I tried with string_split and substring but I cannot find anything that works.
I thought to use something similar to this:
SELECT index, element FROM table, CAST(message AS SUPER) AS element AT index;
But it doesn't take into account the length and I don't like casting a varchar variable into a super.
You can use generate_series() to accomplish this:
select m.*, gs.posn, substring(m.message, gs.posn, 15) as split_message
from messages m
cross join lateral generate_series(1, length(message), 15) gs(posn);
Splitting on spaces after the length is a little trickier. We would have to split the message into words and then figure out how to break them into groups and then reaggregate.
I could not figure out how to split on spaces without recursion. I hope you don't mind that it treats all whitespace as word boundaries:
with recursive by_words as (
select m.*, s.n, s.word, length(s.word) as word_len,
max(s.n) over (partition by m.id) as num_words
from messages m
cross join lateral regexp_split_to_table(m.message, '\s+')
with ordinality as s(word, n)
), rejoin as (
select id, n, array[word] as words, word_len as cum_word_len,
word_len >= 15 as keep
from by_words
where n = 1
union all
select p.id, c.n,
case
when p.cum_word_len >= 15 then array[c.word]
else p.words||c.word
end as words,
case
when p.cum_word_len >= 15 then c.word_len
else p.cum_word_len + c.word_len + 1
end as cum_word_len,
(p.cum_word_len + c.word_len + 1 >= 15)
or (c.n = c.num_words) as keep
from rejoin p
join by_words c on (c.id, c.n) = (p.id, p.n + 1)
)
select id,
row_number() over (partition by id
order by n) as segnum,
array_to_string(words, ' ') as split_message
from rejoin
where keep
order by 1, 2
;
db<>fiddle here
Edit to add:
Can you please tell me whether the below works in Redshift?
with gs as (
select generate_series as posn
from generate_series(1, 150000, 15)
)
select *, substring(m.message, gs.posn, 15) as split_message
from messages m
join gs
on gs.posn <= greatest(1, length(m.message))
order by m.id, gs.posn
;
Thanks to #Mike Organek 's answer and his help I found a solution that works with Redshift too.
Problem in Mike's answer for Redshift is related to generate_series that is not well supported in Redshift, so here's a workaround.
with row as (
select t.*, row_number() over () as x
from table t -- big enough table
limit 100
),
result as
(
select (x-1)*15+1 as posn from row --change 15 to a number to split the long text with
)
select * into gs
from result
And then Mike's answer:
select *, substring(m.feedback from gs.posn for 15) as split_message
from messages m
join gs
on gs.posn <= greatest(1, length(m.message))
order by m.id, gs.posn

Getting Number of Common Values from 2 comma-seperated strings

I have a table that contains comma-separated values in a column In Postgres.
ID PRODS
--------------------------------------
1 ,142,10,75,
2 ,142,87,63,
3 ,75,73,2,58,
4 ,142,2,
Now I want a query where I can give a comma-separated string and it will tell me the number of matches between the input string and the string present in the row.
For instance, for input value ',142,87,', I want the output like
ID PRODS No. of Match
------------------------------------------------------------------------
1 ,142,10,75, 1
2 ,142,87,63, 2
3 ,75,73,2,58, 0
4 ,142,2, 1
Try this:
SELECT
*,
ARRAY(
SELECT
*
FROM
unnest(string_to_array(trim(both ',' from prods), ','))
WHERE
unnest = ANY(string_to_array(',142,87,', ','))
)
FROM
prods_table;
Output is:
1 ,142,10,75, {142}
2 ,142,87,63, {142,87}
3 ,75,73,2,58, {}
4 ,142,2, {142}
Add the cardinality(anyarray) function to the last column to get just a number of matches.
And consider changing your database design.
Check This.
select T.*,
COALESCE(No_of_Match,'0')
from TT T Left join
(
select ID,count(ID) No_of_Match
from (
select ID,unnest(string_to_array(trim(t.prods, ','), ',')) A
from TT t)a
Where A in ('142','87')
group by ID
)B
On T.Id=b.id
Demo Here
OutPut
If you install the intarray extension, this gets quite easy:
select id, prods, cardinality(string_to_array(trim(prods, ','), ',')::int[] & array[142,87])
from bad_design;
Otherwise it's a bit more complicated:
select bd.id, bd.prods, m.matches
from bad_design bd
join lateral (
select bd.id, count(v.p) as matches
from unnest(string_to_array(trim(bd.prods, ','), ',')) as l(p)
left join (
values ('142'),('87') --<< these are your input values
) v(p) on l.p = v.p
group by bd.id
) m on m.id = bd.id
order by bd.id;
Online example: http://rextester.com/ZIYS97736
But you should really fix your data model.
with data as
(
select *,
unnest(string_to_array(trim(both ',' from prods), ',') ) as v
from myTable
),
counts as
(
select id, count(t) as c from data
left join
( select unnest(string_to_array(',142,87,', ',') ) as t) tmp on tmp.t = data.v
group by id
order by id
)
select t1.id, t1.prods, t2.c as "No. of Match"
from myTable t1
inner join counts t2 on t1.id = t2.id;

ROW_NUMBER() Query Plan SORT Optimization

The query below accesses the Votes table that contains over 30 million rows. The result set is then selected from using WHERE n = 1. In the query plan, the SORT operation in the ROW_NUMBER() windowed function is 95% of the query's cost and it is taking over 6 minutes to complete execution.
I already have an index on same_voter, eid, country include vid, nid, sid, vote, time_stamp, new to cover the where clause.
Is the most efficient way to correct this to add an index on vid, nid, sid, new DESC, time_stamp DESC or is there an alternative to using the ROW_NUMBER() function for this to achieve the same results in a more efficient manner?
SELECT v.vid, v.nid, v.sid, v.vote, v.time_stamp, v.new, v.eid,
ROW_NUMBER() OVER (
PARTITION BY v.vid, v.nid, v.sid ORDER BY v.new DESC, v.time_stamp DESC) AS n
FROM dbo.Votes v
WHERE v.same_voter <> 1
AND v.eid <= #EId
AND v.eid > (#EId - 5)
AND v.country = #Country
One possible alternative to using ROW_NUMBER():
SELECT
V.vid,
V.nid,
V.sid,
V.vote,
V.time_stamp,
V.new,
V.eid
FROM
dbo.Votes V
LEFT OUTER JOIN dbo.Votes V2 ON
V2.vid = V.vid AND
V2.nid = V.nid AND
V2.sid = V.sid AND
V2.same_voter <> 1 AND
V2.eid <= #EId AND
V2.eid > (#EId - 5) AND
V2.country = #Country AND
(V2.new > V.new OR (V2.new = V.new AND V2.time_stamp > V.time_stamp))
WHERE
V.same_voter <> 1 AND
V.eid <= #EId AND
V.eid > (#EId - 5) AND
V.country = #Country AND
V2.vid IS NULL
The query basically says to get all rows matching your criteria, then join to any other rows that match the same criteria, but which would be ranked higher for the partition based on the new and time_stamp columns. If none are found then this must be the row that you want (it's ranked highest) and if none are found that means that V2.vid will be NULL. I'm assuming that vid otherwise can never be NULL. If it's a NULLable column in your table then you'll need to adjust that last line of the query.

Query with equation

I have 3 queries that return 3 values. I'd Like to join the queries to perform the following expression:
(MgO + CaO)/SiO2
How can I do that?
MgO:
SELECT sampled_date, result
FROM AF_VW WHERE element = 'MgO' AND ROWNUM = 1 ORDER BY sampled_date DESC;
CaO:
SELECT sampled_date, result
FROM AF_VW WHERE element = 'CaO' AND ROWNUM = 1 ORDER BY sampled_date DESC;
SiO2:
SELECT sampled_date, result
FROM AF_VW WHERE element = 'SiO2' AND ROWNUM = 1 ORDER BY sampled_date DESC;
with x as (
SELECT sampled_date, result, element,
row_number() over(partition by element order by sampled_date desc) rn
FROM AF_VW)
, y as (
select
case when element = 'MgO' then result end as MGO,
case when element = 'CaO' then result end as CaO,
case when element = 'SiO2' then result end as SiO2
FROM x where rn = 1)
select (mgo+cao)/sio2 from y;
You can use row_number function instead of rownum and then select the results for the 3 elements.
This is a bit long for a comment.
The queries in your question are probably not doing what you expect. Oracle evaluates the WHERE clause before the order by. So, the following chooses one arbitrary row with MgO and then does the trivial ordering of the one row by date:
SELECT sampled_date, result
FROM AF_VW
WHERE element = 'MgO' AND ROWNUM = 1
ORDER BY sampled_date DESC;
Really, to get the equivalent result, you would need to emulate the same, unstable logic. Unstable, because the results are not guaranteed to be the same if the query is run multiple times:
with mg as (
SELECT sampled_date, result
FROM AF_VW
WHERE element = 'MgO' AND ROWNUM = 1
),
coa as (
SELECT sampled_date, result
FROM AF_VW
WHERE element = 'CaO' AND ROWNUM = 1
),
sio2 as (
SELECT sampled_date, result
FROM AF_VW
WHERE element = 'SiO2' AND ROWNUM = 1
)
select (mgo.result + cao.result) / sio2.result
from mgo cross join cao cross join sio2;
I suspect you really want the most recent sample date, which is what VKP's answer provides. I just thought you should know that is not what your current queries are doing.

Fastest way to check if the the most recent result for a patient has a certain value

Mssql < 2005
I have a complex database with lots of tables, but for now only the patient table and the measurements table matter.
What I need is the number of patient where the most recent value of 'code' matches a certain value. Also, datemeasurement has to be after '2012-04-01'. I have fixed this in two different ways:
SELECT
COUNT(P.patid)
FROM T_Patients P
WHERE P.patid IN (SELECT patid
FROM T_Measurements M WHERE (M.code ='xxxx' AND result= 'xx')
AND datemeasurement =
(SELECT MAX(datemeasurement) FROM T_Measurements
WHERE datemeasurement > '2012-01-04' AND patid = M.patid
GROUP BY patid
GROUP by patid)
AND:
SELECT
COUNT(P.patid)
FROM T_Patient P
WHERE 1 = (SELECT TOP 1 case when result = 'xx' then 1 else 0 end
FROM T_Measurements M
WHERE (M.code ='xxxx') AND datemeasurement > '2012-01-04' AND patid = P.patid
ORDER by datemeasurement DESC
)
This works just fine, but it makes the query incredibly slow because it has to join the outer table on the subquery (if you know what I mean). The query takes 10 seconds without the most recent check, and 3 minutes with the most recent check.
I'm pretty sure this can be done a lot more efficient, so please enlighten me if you will :).
I tried implementing HAVING datemeasurment=MAX(datemeasurement) but that keeps throwing errors at me.
So my approach would be to write a query just getting all the last patient results since 01-04-2012, and then filtering that for your codes and results. So something like
select
count(1)
from
T_Measurements M
inner join (
SELECT PATID, MAX(datemeasurement) as lastMeasuredDate from
T_Measurements M
where datemeasurement > '01-04-2012'
group by patID
) lastMeasurements
on lastMeasurements.lastmeasuredDate = M.datemeasurement
and lastMeasurements.PatID = M.PatID
where
M.Code = 'Xxxx' and M.result = 'XX'
The fastest way may be to use row_number():
SELECT COUNT(m.patid)
from (select m.*,
ROW_NUMBER() over (partition by patid order by datemeasurement desc) as seqnum
FROM T_Measurements m
where datemeasurement > '2012-01-04'
) m
where seqnum = 1 and code = 'XXX' and result = 'xx'
Row_number() enumerates the records for each patient, so the most recent gets a value of 1. The result is just a selection.