Consecutive Pattern replacing is not happening with REGEXP_REPLACE - sql

I have a string as below
Welcome to the world of the Hackers
I am trying to replace the occurrences of listed strings i.e. of,to,the in between the entire string using below query, but it's not working properly if the patterns are consecutive, it fails.
SELECT regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' )
FROM dual;
Output: Welcome the world the Hackers
Even if the pattern is repeating consecutive it is not working i.e.
SELECT regexp_replace( 'Welcome to to the world of the Hackers', '( to )|( the )|( of )', ' ' )
FROM dual;
Output: Welcome to world the Hackers
Whereas my expected output is: Welcome world Hackers
Is there any alternative/solution for this using REGEXP_REPLACE?

You can use the regular expression (^|\s+)((to|the|of)(\s+|$))+:
SQL Fiddle
Query 1:
WITH test_data ( sentence ) AS (
SELECT 'to the of' FROM DUAL UNION ALL
SELECT 'woof breathe toto' FROM DUAL UNION ALL -- has all the words as sub-strings of words
SELECT 'theory of the offer to total' FROM DUAL -- mix of words to replace and words starting with those words
)
SELECT sentence,
regexp_replace(
sentence,
'(^|\s+)((to|the|of)(\s+|$))+',
'\1'
) AS replaced
FROM test_data
Results:
| SENTENCE | REPLACED |
|------------------------------|--------------------|
| to the of | (null) | -- All words replaced
| woof breathe toto | woof breathe toto |
| theory of the offer to total | theory offer total |
Why doesn't regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' ) work with successive matches?
Because the regular expression parser will look for the second match after the end of the first match and will not include the already parsed part of the string or the replacement text when looking for subsequent matches.
So the first match will be:
'Welcome to the world of the Hackers'
^^^^
The second match will look in the sub-string following that match
'the world of the Hackers'
^^^^
The 'the ' at the start of the sub-string will not be matched as it has no leading space character (yes, there was a space before it but that was matched in the previous match and, yes, that match was replaced with a space but overlapping matches and matches on previous replacements are not how regular expressions work).
So the second match is the ' of ' in the middle of the remaining sub-string.
There will be no third match as the remaining un-parsed sub-string is:
'the Hackers'
and, again, the 'the ' is not matched as there is not leading space character to match.

REGEXP_REPLACE does not match a second pattern which is a part of the already matched pattern. This is more apparent when you use the multi-pattern matching like |. Thus, you can't rely on spaces for word boundaries to match multiple patterns this way. One solution could be to split and combine the characters. This may not be the best way, but works nonetheless. I would be glad to know a better solution.
This also assumes that you are ok with single spaces in the combined string when it had more than one in the original string.Also, words ending with comma or semicolon aren't considered. You may enhance it using NOT REGEXP_LIKE instead of NOT IN for such cases.
WITH t (id,s)
AS (
SELECT 1 , 'Welcome to the world of the Hackers, you told me these words at the'
FROM DUAL
UNION ALL
SELECT 2, 'The second line.Welcome to the world of the Hackers, you told me these words at the'
FROM DUAL
)
SELECT LISTAGG(word, ' ') WITHIN
GROUP (
ORDER BY w
)
FROM (
SELECT id,
LEVEL AS w
,REGEXP_SUBSTR(s, '[^ ]+', 1, LEVEL) AS word
FROM t CONNECT BY LEVEL <= REGEXP_COUNT(s, '[^ ]+')
AND PRIOR id = id
AND PRIOR SYS_GUID() IS NOT NULL
)
WHERE lower(word) NOT IN (
'to'
,'the'
,'of'
)
GROUP BY id;
Demo

Related

Oracle replace some duplicated characters (non digits )

anyone can help me to build proper syntax for regexp_replace to remove any multiplicated non-digits and non-letters from string ? If digit/letter is multiplicated - it is not changed
eg.
source and expected result:
'ABBC000001223, ABC00000212,,, '
'ABBC000001223, ABC00000212, '
(removed second occurance of space after comma and second and third comma )
Use this REGEXP_REPLACE to match any non alphanumeric character in the first group
([^[:alnum:]])
followed by one or more same charcters (group 1)
([^[:alnum:]])(\1)+
and replace it with the original character (group 1)
I added some other data to demonstrate the result
with dta as (
select 'ABBC000001223, ABC00000212,,, ' txt from dual union all
select ',.,;,;;;;,,,,,,,,,,,,#''++`´' txt from dual union all
select 'ABBC000001223ABC00000212' txt from dual)
select txt,
regexp_replace(txt,'([^[:alnum:]])(\1)+', '\1') result
from dta
TXT
-------------------------------
RESULT
--------------------------------
ABBC000001223, ABC00000212,,,
ABBC000001223, ABC00000212,
,.,;,;;;;,,,,,,,,,,,,#'++`´
,.,;,;,#'+`´
ABBC000001223ABC00000212
ABBC000001223ABC00000212

How to get first string after character Oracle SQL

I'm trying to get first string after a character.
Example is like
ABCDEF||GHJ||WERT
I need only
GHJ
I tried to use REGEXP but i couldnt do it.
Can anyone help me with please?
Thank you
Somewhat simpler:
SQL> select regexp_substr('ABCDEF||GHJ||WERT', '\w+', 1, 2) result from dual;
^
RES |
--- give me the 2nd "word"
GHJ
SQL>
which reads as: give me the 2nd word out of that string. Won't work properly if GHJ consists of several words (but that's not what your example suggests).
Something like I interpret with a separator in place, In this case it is || or | example is with oracle database
-- pattern -- > [^] represents non-matching character and + for says one or more character followed by ||
-- 3rd parameter --> starting position
-- 4th parameter --> nth occurrence
WITH tbl(str) AS
(SELECT 'ABCDEF||GHJ||WERT' str FROM dual)
SELECT regexp_substr(str
,'[^||]+'
,1
,2) output
FROM tbl;
I think the most general solution is:
WITH tbl(str) AS (
SELECT 'ABCDEF||GHJ||WERT' str FROM dual UNION ALL
SELECT 'ABC|DEF||GHJ||WERT' str FROM dual UNION ALL
SELECT 'ABClDEF||GHJ||WERT' str FROM dual
)
SELECT regexp_replace(str, '^.*\|\|(.*)\|\|.*', '\1')
FROM tbl;
Note that this works even if the individual elements contain punctuation or a single vertical bar -- which the other solutions do not. Here is a comparison.
Presumably, the double vertical bar is being used for maximum flexibility.
You should use regexp_substr function
select regexp_substr('ABCDEF||GHJ||WERT ', '\|{2}([^|]+)', 1, 1, 'i', 1) str
from dual;
STR
---
GHJ

PLSQL show digits from end of the string

I have the following problem.
There is a String:
There is something 2015.06.06. in the air 1234567 242424 2015.06.07. 12125235
I need to show only just the last date from this string: 2015.06.07.
I tried with regexp_substr with insrt but it doesn't work.
So this is just test, and if I can solve this after it with this solution I should use it for a CLOB query where there are multiple date, and I need only the last one. I know there is regexp_count, and it is help to solve this, but the database what I use is Oracle 10g so it wont work.
Can somebody help me?
The key to find the solution of this problem is the idea of reversing the words in the string presented in this answer.
Here is the possible solution:
WITH words AS
(
SELECT regexp_substr(str, '[^[:space:]]+', 1, LEVEL) word,
rownum rn
FROM (SELECT 'There is something 2015.06.06. in the air 1234567 242424 2015.06.07. 2015.06.08 2015.06.17. 2015.07.01. 12345678999 12125235' str
FROM dual) tab
CONNECT BY LEVEL <= LENGTH(str) - LENGTH(REPLACE(str, ' ')) + 1
)
, words_reversed AS
(
SELECT *
FROM words
ORDER BY rn DESC
)
SELECT regexp_substr(word, '\d{4}\.\d{2}\.\d{2}', 1, 1)
FROM words_reversed
WHERE regexp_like(word, '\d{4}\.\d{2}\.\d{2}')
AND rownum = 1;
From the documentation on regexp_substr, I see one problem immediately:
The . (period) matches any character. You need to escape those with a backslash: \. in order to match only a period character.
For reference, I am linking this post which appears to be the approach you are taking with substr and instr.
Relevant documentation from Oracle:
INSTR(string , substring [, position [, occurrence]])
When position is negative, then INSTR counts and searches backward from the end of string. The default value of position is 1, which means that the function begins searching at the beginning of string.
The problem here is that your regular expression only returns a single value, as explained here, so you will be giving the instr function the appropriate match in the case of multiple dates.
Now, because of this limitation, I recommend using the approach that was proposed in this question, namely reverse the entire string (and your regular expression, i.e. \d{2}\.\d{2}\.\d{4}) and then the first match will be the 'last match'. Then, perform another string reversal to get the original date format.
Maybe this isn't the best solution, but it should work.
There are three different PL/SQL functions that will get you there.
The INSTR function will identify where the first "period" in the date string appears.
SUBSTR applied to the entire string using the value from (1) as the start point
TO_DATE for a specific date mask: YYYY.MM.DD will convert the result from (2) into a Oracle date time type.
To make this work in procedural code, the standard blocks apply:
DECLARE
v_position pls_integer;
... other variables
BEGIN
sql code and function calls;
END
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE finddate
(column1 varchar2(11), column2 varchar2(39))
;
INSERT ALL
INTO finddate (column1, column2)
VALUES ('row1', '1234567 242424 2015.06.07. 12125235')
INTO finddate (column1, column2)
VALUES ('string2', '1234567 242424 2015.06.07. 12125235')
SELECT * FROM dual
;
Query 1:
select instr(column2,'.',1) from finddate
where column1 = 'string2'
select substr(column2,(20-4),10) from finddate
select to_date('2015.06.07','YYYY.MM.DD') from finddate
Results:
| TO_DATE('2015.06.07','YYYY.MM.DD') |
|------------------------------------|
| June, 07 2015 00:00:00 |
| June, 07 2015 00:00:00 |
Here's a way using regexp_replace() that should work with 10g, assuming the format of the lines will be the same:
with tbl(col_string) as
(
select 'There is something 2015.06.06. in the air 1234567 242424 2015.06.07. 12125235'
from dual
)
select regexp_replace(col_string, '^.*(\d{4}\.\d{2}\.\d{2})\. \d*$', '\1')
from tbl;
The regex can be read as:
^ - Match the start of the line
. - followed by any character
* - followed by 0 or more of the previous character (which is any character)
( - Start a remembered group
\d{4}\.\d{2}\.\d{2} - 4 digits followed by a literal period followed by 2 digits, etc
) - End the first remembered group
\. - followed by a literal period
- followed by a space
\d* - followed by any number of digits
$ - followed by the end of the line
regexp_replace then replaces all that with the first remembered group (\1).
Basically describe the whole line as a regular expression, group around what you want to return. You will most likely need to tweak the regex for the end of the line if it could be other characters than digits but this should give you an idea.
For the sake of argument this works too ONLY IF there are 2 occurrences of the date pattern:
with tbl(col_string) as
(
select 'There is something 2015.06.06. in the air 1234567 242424 2015.06.07. 12125235' from dual
)
select regexp_substr(col_string, '\d{4}\.\d{2}\.\d{2}', 1, 2)
from tbl;
returns the second occurrence of the pattern. I expect the above regexp_replace more accurately describes the solution.

Counting word lengths in a string

I am using an Oracle regular expression to extract the first letter of each word in a string. The results are returned in a single cell, with spaces representing hard breaks. Here is an example...
input:
'I hope that some kind person
browsing stack overflow
can help me'
output:
ihtskp bso chm
What I am trying to do next is count the length of each "word" in my output, like this:
6 3 3
Alternatively, a count of the words in each line of the original string would be acceptable, as it would yield the same result.
Thanks!
Count the number of spaces and add one:
select (length(your_col) - length(replace(your_col, ' '))+1) from your_table;
It will give you the number of words per line. From there you can get all counts on one line by using listagg function:
select LISTAGG(cnt,' ') within group (order by null) from (
select (length(a)-length(replace(a,' '))+1) cnt from (
select 'apa bpa bv' a from dual
union all
select 'n bb gg' a from dual
union all
select 'ff ff rr gg' a from dual))
group by null;
Perhaps you also need to split the strings if they contain newlines or are they split already?
I tried to edit my original post but it hasn't appeared, but I figured out a way to solve my issue. I just decided to break the words into rows, since I know how to character count rows, and then reassembled the character counts into a single cell using listagg:
with my_string as (
select regexp_substr (words,'[0-9]+|[a-z]+|[A-Z]+',1,lvl) parsed
from (
select words, level lvl
from letters connect by level <= length(words) - length(replace(words,' ')) + 1)
)
select listagg(length(parsed),' ') within group (order by parsed) word_count
from my_string

Delete certain character based on the preceding or succeeding character - ORACLE

I have used REPLACE function in order to delete email addresses from hundreds of records. However, as it is known, the semicolon is the separator, usually between each email address and anther. The problem is, there are a lot of semicolons left randomly.
For example: the field:
123#hotmail.com;456#yahoo.com;789#gmail.com;xyz#msn.com
Let's say that after I deleted two email addresses, the field content became like:
;456#yahoo.com;789#gmail.com;
I need to clean these fields from these extra undesired semicolons to be like
456#yahoo.com;789#gmail.com
For double semicolons I have used REPLACE as well by replacing each ;; with ;
Is there anyway to delete any semicolon that is not preceded or following by any character?
If you only need to replace semicolons at the start or end of the string, using a regular expression with the anchor '^' (beginning of string) / '$' (end of string) should achieve what you want:
with v_data as (
select '123#hotmail.com;456#yahoo.com;789#gmail.com;xyz#msn.com' value
from dual union all
select ';456#yahoo.com;789#gmail.com;' value from dual
)
select
value,
regexp_replace(regexp_replace(value, '^;', ''), ';$', '') as normalized_value
from v_data
If you also need to replace stray semicolons from the middle of the string, you'll probably need regexes with lookahead/lookbehind.
You remove leading and trailing characters with TRIM:
select trim(both ';' from ';456#yahoo.com;;;789#gmail.com;') from dual;
To replace multiple characters with only one occurrence use REGEXP_REPLACE:
select regexp_replace(';456#yahoo.com;;;789#gmail.com;', ';+', ';') from dual;
Both methods combined:
select regexp_replace( trim(both ';' from ';456#yahoo.com;;;789#gmail.com;'), ';+', ';' ) from dual;
regular expression replace can help
select regexp_replace('123#hotmail.com;456#yahoo.com;;456#yahoo.com;;789#gmail.com',
'456#yahoo.com(;)+') as result from dual;
Output:
| RESULT |
|-------------------------------|
| 123#hotmail.com;789#gmail.com |