I’m trying to dynamically extract a substring from a very long URL. For example, I may have the following URLs:
https://www.google.com/ABCDEF Version=“0.0.00.0” GHIJK
https://www.google.com/ABCDEFGH Version=“0.0.0.0” IJKLM
https://www.google.com/ABC Version=“0.0.0.00” 12345
I am trying to extract the version code only (0.0.0.0).
This is what I have so far:
SELECT SUBSTR(col, INSTR(col, ‘Version=“‘)+9)
FROM table
This query returns the following result:
0.0.00.0” GHIJK … (url continues on)
So, I attempt to find “Version” in the link, so I can start from the same position in each row. This works fine, however I’m having a hard time dynamically locating the ending quote (“). I tried using INSTR in the third parameter of my SUBSTR function, like so:
SELECT SUBSTR(col, INSTR(col, ‘Version=“‘)+9, INSTR(col, ‘“‘))
FROM table
I figured that this would find the position of the ending quote, and then use that number for the length, but it returns a strange output. I’ve also used POSITION, CHARINDEX, LENGTH, and LOCATE. None of these functions work in Oracle.
I think maybe when I put +9 after the first INSTR function, it’s setting the query to a fixed position instead of a dynamic one, but I’m not sure how else to remove ‘Version=“‘.
Here's one option (which, actually, selects what's between double quotes - that's version in your example; if there were some other similar substring, you'd get a wrong result).
with test (col) as
(select 'https://www.google.com/ABCDEF Version="0.0.00.0" GHIJK' from dual union all
select 'https://www.google.com/ABCDEFGH Version="0.0.0.0" IJKLM' from dual union all
select 'https://www.google.com/ABC Version="0.0.0.00" 12345' from dual
)
select col,
replace(regexp_substr(col, '".+"'), '"') version
from test;
which results in
https://www.google.com/ABCDEF Version="0.0.00.0" GHIJK 0.0.00.0
https://www.google.com/ABCDEFGH Version="0.0.0.0" IJKLM 0.0.0.0
https://www.google.com/ABC Version="0.0.0.00" 12345 0.0.0.00
You can still use use INSTR to locate the second " in the string, then subtract the location of the first " to get the length that you need to get. Below is an example query:
SELECT col,
SUBSTR (col, INSTR (col, '"') + 1, INSTR (col, '"', 1, 2) - INSTR (col, '"') - 1) version
FROM test;
You can use REGEXP_SUBSTR() with Version=(\d.*\d?) pattern in order to extract the piece between Version=" and "(your quotes are presumed to be regular double quotes " ")
SELECT REGEXP_SUBSTR(url,'Version="(\d.*\d)"',1,1,null,1) AS version
FROM t
where
the third argument(1) is position,
the fourth argument(1) is occurence, and especially important to use the last one as being capture group (1)
indeed using '"(\d.*\d)"' pattern is enough for the
current data set
or
REGEXP_REPLACE() with capture group \2 as
SELECT REGEXP_REPLACE(url,'^(.*Version=")([^"]*).*','\2') AS version
FROM t
Demo
I have a field with following values, now i want to extract only those rows with "xyz" in the field value mentioned below, can you please help?
Mydata_xyz_aug21
Mydata2_zzz_aug22
Mydata3_xyz_aug33
One more requirement
I want to extract only "aIBM_MyProjectFile" from following string below, can you please help me with this?
finaldata/mydata/aIBM_MyProjectFile.exe.ld
I've tried this but it didn't work.
select
regexp_substr('FinalProject/MyProject/aIBM_MyProjectFile.exe.ld','([^/]*)[\.]') exp
from dual;
To extract substrings between the first pair of underscores, you need to use
regexp_substr('Mydata_xyz_aug21','_([^_]+)_', 1, 1, NULL, 1)
To get the file name without the extension, you need
regexp_substr('FinalProject/MyProject/aIBM_MyProjectFile.exe.ld','.*/([^.]+)', 1, 1, NULL, 1)
Note that each regex contains a capturing group (a pattern inside (...)) and this value is accessed with the last 1 argument to the regexp_substr function.
The _([^_]+)_ pattern finds the first _, then places 1 or more chars other than _ into Group 1 and then matches another _.
The .*/([^.]+) pattern matches the whole text up to the last /, then captures 1 or more chars other than . into Group 1 using ([^.]+).
For the first requirement, it would suffice to use LIKE, as posted in answer above:
SELECT column
FROM table
WHERE column LIKE '%xyz%';
For your second requirement (extraction) you will have to use REGEXP_SUBSTR function:
SELECT REGEXP_SUBSTR ('FinalProject/MyProject/aIBM_MyProjectFile.exe.ld', '.*/([^.]+)', 1, 1, NULL, 1)
FROM DUAL
I hope it helped!
Another way to do this is to skip regexp completely:
WITH
aset AS
(SELECT 'with_extension.txt' txt FROM DUAL
UNION ALL
SELECT 'without_extension' FROM DUAL)
SELECT CASE
WHEN INSTR (txt, '.', -1) > 0
THEN
SUBSTR (txt, 1, INSTR (txt, '.', -1) - 1)
ELSE
txt
END
txt
FROM aset
The result of this is
with_extension
without_extension
A BIG Caveat where the regexp is better:
My method doesn't handle this case correctly:
\this\is.a\test
So after I have gone to all this effort, stay with the regexp solutions. I'll leave this here so that others may learn from it.
I have a string like "1490/2334/5166400411000434" from which I need to derive value after second slash. I tried below logic
select REGEXP_SUBSTR('1490/2334/5166400411000434','[^/]+',1,3) from dual;
it is working fine. But when i dont have value between first and second slash it is returining blank.
For example my string is "1490//5166400411000434" and am trying
select REGEXP_SUBSTR('1490//5166400411000434','[^/]+',1,3) from dual;
it is returning blank. Please suggest me what i am missing.
If I understand well, you may need
regexp_substr(t, '(([^/]*/){2})([^/]*)', 1, 1, 'i', 3)
This handles the first 2 parts like 'xxx/' and then checks for a sequence of non / characters; the parameter 3 is used to get the 3rd matching subexpression, which is what you want.
For example:
with test(t) as (
select '1490/2334/5166400411000434' from dual union all
select '1490//5166400411000434' from dual union all
select '1490//5166400411000434/ramesh/3344' from dual
)
select t, regexp_substr(t, '(([^/]*/){2})([^/]*)', 1, 1, 'i', 3) as substr
from test
gives:
T SUBSTR
---------------------------------- ----------------------------------
1490/2334/5166400411000434 5166400411000434
1490//5166400411000434 5166400411000434
1490//5166400411000434/ramesh/3344 5166400411000434
You can REVERSE() your string and take the value before the first slash. And then reverse again to obtain the desired output.
select reverse(regexp_substr(reverse('1490//5166400411000434'), '[^/]+', 1, 1)) from dual;
It can also be done with basic substring and instr function:
select reverse(SUBSTR(reverse('1490//5166400411000434'), 0, INSTR(reverse('1490//5166400411000434'), '/')-1)) from dual;
Use other options in REGEXP_SUBSTR to match a pattren
select REGEXP_SUBSTR('1490//5166400411000434','(/\d*)/(\d+)',1,1,'x',2) from dual
Basically it is finding the pattren of two / including digits starting from 1 with 1 appearance and ignoring whitespaces ('x') then outputting 2nd subexpression that is in second expression within ()
... pattern,1,1,'x',subexp2)
How to replace multiple whole characters, except those in combinations...?
The below code replaces multiple characters, but it also disturbing those in combinations.
SELECT regexp_replace('a,ca,va,ea,r,y,q,b,g','(a|y|q|g)','X') RESULT FROM dual;
Current output:
RESULT
--------------------
X,cX,vX,eX,r,X,X,b,X
Expected output:
RESULT
------------------------
'X,ca,va,ea,r,X,X,b,X
I just want to replace only separate whole characters('a','y','q','g'), but not the 1 in combinations('ca','va','ea')...
Because you are delimiting with a comma ',' you can combine that like ',a,'
and this will replace only single a's.
you can try follows:
with t as
(
select 'a,ca,va,ea,r,y,q,b,g' str
from dual
)
select substr(sys_connect_by_path(regexp_replace(regexp_substr(str, '[^,]+', 1, level), '^(a|y|q|g)$', 'X'), ','), 2) as str
from t
where connect_by_isleaf = 1
connect by level <= length(regexp_replace(str, '[^,]*')) + 1;
Sadly oracle doesn´t support lookahead and lookbehind. But this is a solution i came up with.
SELECT regexp_replace
(regexp_replace
('a,ca,va,ea,r,y,q,b,g',
'^[ayqg](,)|(,)[ayqg](,)|(,)[ayqg]$',
'\2\4X\1\3'),'(,)[ayqg](,)','\1X\2')
RESULT FROM dual;
I had to use the regexp twice sadly, since it doesn´t find two similar values following after each other and replacing it. ..,a,y,.. is getting replaced as ..,X,y,... So the second call replaces the missing [ayqg] with the exact values. In the first inner regexp call replaces the first and last values.
Maybe this could be simplified into one expression, but i am not that conform with the regex from oracle.
As a explanation i am grouping the commata and basicly replace every ,[ayqg], with ,X, by backreferencing the commata
You would look for word boundaries, which is \b, and which is unfortunately not supported by Oracle's regexp_replace.
So let's look for a non-word character \W or the beginning ^ or ending $ of the text.
select
regexp_replace('a,ca,va,ea,r,y,q,b,g','(^|$|\W)(a|y|q|g)(^|$|\W)','\1X\3') as result
from dual;
In order to not remove the non-word characters, we must have them in the replace string: \1 for the expression in the first parenteses, \3 for the ones in the third. Thus we only change the expression in the second parentheses, which is a, y, q or g, with X.
Unfortunately above gives
X,ca,va,ea,r,X,q,b,X
The q was not replaced, because we recognize ',y,' thus being positioned a 'g,' whereas we'd need to be positioned at ',g,' to recognize g as a word, too.
So we need to replace in iterations (i.e. recursively):
with results(txt, num) as
(
select 'a,ca,va,ea,r,y,q,b,g' as txt, 0 as num from dual
union all
select regexp_replace(txt, '(^|$|\W)(a|y|q|g)(^|$|\W)','\1X\3'), num + 1 as num
from results
where txt <> regexp_replace(txt, '(^|$|\W)(a|y|q|g)(^|$|\W)','\1X\3')
)
select max(txt) keep (dense_rank last order by num) as result
from results;
EDIT: Kevin Esche is right; of course one has to do it only twice. Hence you can also do:
select
regexp_replace(txt, search_str, replace_str) as result
from
(
select
regexp_replace(txt, search_str, replace_str) as txt, search_str, replace_str
from
(
select
'a,ca,va,ea,r,y,q,y,q,b,g' as txt,
'(^|$|\W)(a|y|q|g)(^|$|\W)' as search_str,
'\1X\3' as replace_str
from dual
)
);
with replaced_values as (
SELECT case when length(val)=1 then regexp_replace(val,'(a|y|q|g)','X') else val end new_val, lvl
from (
SELECT regexp_substr('a,ca,va,ea,r,y,q,b,g','[^,]+', 1, LEVEL) val, level lvl FROM dual
connect by regexp_substr('a,ca,va,ea,r,y,q,b,g','[^,]+',1, LEVEL) is not null
) all_values
)
select lISTAGG(new_val, ',') WITHIN GROUP (ORDER BY lvl) RESULT
from replaced_values
This statement pivots data into rows and replaces only lines wich contains one character.
Data are then unpivoted in one rows
This sql works also with empty entries like 'a,,,b,c' and more complex regular expressions:
with t as
(select ',a,,ca,va,ea,bbb,ba,r,y,q,b,g,,,' as str,
',' as delimiter,
'(a|y|q|g|ea|[b]*)' as regexp_expr,
'X' as replace_expr
from dual)
(select substr (sys_connect_by_path(regexp_replace(substr(str,
decode(level - 1, 0, 0, instr(str, ',', 1, level - 1)) + 1,
decode(instr(str, ',', 1, level),
0,
length(str),
instr(str, ',', 1, level) - 1) -
decode(level - 1, 0, 0, instr(str, ',', 1, level - 1))),
'^' || regexp_expr || '$',
replace_expr), ','), 2)
from t
where connect_by_isleaf = 1
connect by level <= length(regexp_replace(str, '[^'|| delimiter||']')) + 1)
Result
,X,,ca,va,X,X,ba,r,X,X,X,X,,,
Don't Know much Oracle, but I would have thought something like this could work. Assuming the delimiter is always a comma.
SELECT
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace('a,ca,va,ea,r,y,q,b,g','(,a,|,y,|,q,|,g,)',',X,') ,'(,a,|,y,|,q,|,g,)',',X,'), '(^a,|^y,|^q,|^g,)','X,'), '(,a$|,y$|,q$|,g$)',',X'), '(^a$|^y$|^q$|^g$)','X')
RESULT FROM test;
The first two parts replaces a single character in commas in the middle, the third part gets those at the start of the string, the fourth is for the end of the string and the fifth is for when then string has just one character.
This answer might will be simplifiable by advanced Regexp use.
How i can replace words?
RS & OS ===> D, LS & IS ==== >
SECTION_ID Output required
1-LS-1991 1-P-1991
1-IS-1991 1-P-1991
1-RS-1991 1- D- 1991
1-OS-1991 1-D-1991