Oracle RegExp_Substr not able to escape comma present between quotes - sql

I have a JSON File, as shown below.
"orderingCustomer":{
"#class":"com.worldwide.sector",
"option":"K",
"addressLine1":"DYNAMIC OFFICE, STREET 2",
"addressLine2":null,
"addressLine3":null,
"partyId":null,
"partyName":"DYNAMIC LTD",
"partyBic":null,
"accountNumber":null
}
My Query does parsing of this JSON, and returns rows based on Comma(,) as delimiter.
SELECT CAST( TRIM( REGEXP_SUBSTR( ( SELECT REPLACE( REGEXP_SUBSTR( DBMS_LOB.SUBSTR( SWIFT_DATA, 32000 ), '["]orderingCustomer["]:{[^}]+' ), '"orderingCustomer":"', '' )
FROM TXN_SWIFT
WHERE ID_TXN_SWIFT = 123 ),
'[^,]+',
1,
3 ) ) AS VARCHAR2( 320 ) )
TRANSTYPE
FROM TXN_SWIFT_OUT_MSG
WHERE MESSAGE_UUID = 12345;
This query works fine, and gives me row-wise results for each keyword based on delimiter (comma). But I have a problem when I search for "addressLine1", where the results is shown as
"addressLine1":"DYNAMIC OFFICE
instead of
"addressLine1":"DYNAMIC OFFICE, STREET 2"
I have tried changing the regular expression to the regex shown
[,(?=(?:\[^"\]*"\[^"\]*")*\[^"\]*$)][1]
But still I am unable to get the data required as shown above, even after replacing the regex from [^,]+ to ,(?=(?:\[^"\]*"\[^"\]*")*\[^"\]*$)
I no longer even get the values. Please suggest, what could be with my query.
(Using 11g version)

Exploit the json syntax:
"[^"]+":("[^"]+"|null)
Basically this mirrors the syntax for representing a single property in json. Supported types are strings and null.
Test it at regex101.
Caveat
This regex does not handle the case of escaped double quotes inside strings. The following modification addresses this issue:
"[^:]+?:("([^\\"]+(\\.)*)+"|null|true|false)
Basically, this pattern interprets a string delimited by double quotes as a sequence of strings without \ or " separated by sequences of escaped characters. The construct includs the case that there are no such escaped characters. (The additional literals cover boolean attributes).
Test it at regex101.
Note
Note that neither arrays nor objects as property values are supported, but as the original pattern obviously does not account for them either, this does not seem to be part of your requirements.
Note that if it would, regex matching get nasty quickly since you'll have to cater for recursion in your pattern ( which i am not even sure whether oracle's regex engine would support ).

Related

Extracting string between two characters in sql oracle database

I need to extract a string that will located between two characters, with always the same pattern
sample string:
A CRN_MOB_H_001 a--> <AVLB>
What is in bold AVLB is what I want to extract, the whole string will always have the same pattern, and everything that is before the < is irrelevant to me.
The string will always have the same pattern:
Some string with possible special characters such as <>, although very unlikely so, it can be ignored if too complicated
a space
then -->
a space
and then the part that is interesting <XXXXXXX>
The XXXXXXX representing the part I want to extract
thank you for your time.
I have tried several things, could not get anywhere I wanted.
Please try this REGEXP_SUBSTR(), which selects what is in the angled brackets when they occur at the end of the string.
Note the WITH clause just sets up test data and is a good way to supply data for people to help you here.
WITH tbl(str) AS (
SELECT 'A CRN_MOB_H_001 a--> <AVLB>' FROM dual
)
SELECT REGEXP_SUBSTR(str, '.*<(.*)>$', 1, 1, NULL, 1) DATA
FROM tbl;
DATA
----
AVLB
1 row selected.

Remove template text on regexp_replace in Oracle's SQL

I am trying to remove template text like &#x; or &#xx; or &#xxx; from long string
Note: x / xx / xxx - is number, The length of the number is unknown, The cell type is CLOB
for example:
SELECT 'H'ello wor±ld' FROM dual
A desirable result:
Hello world
I know that regexp_replace should be used, But how do you use this function to remove this text?
You can use
SELECT REGEXP_REPLACE(col,'&&#\d+;')
FROM t
where
& is put twice to provide escaping for the substitution character
\d represents digits and the following + provides the multiple occurrences of them
ending the pattern with ;
or just use a single ampersand ('&#\d+;') for the pattern as in the case of Demo , since an ampersand has a special meaning for Oracle, a usage is a bit problematic.
In case you wanted to remove the entities because you don't know how to replace them by their character values, here is a solution:
UTL_I18N.UNESCAPE_REFERENCE( xmlquery( 'the_double_quoted_original_string' RETURNING content).getStringVal() )
In other words, the original 'H'ello wor±ld' should be passed to XMLQUERY as '"H'ello wor±ld"'.
And the result will be 'H'ello wo±ld'

Extract string between different special symbols

I am having following string in my query
.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt
beginning with a period from which I need to extract the segment between the final \ and the file extension period, meaning following expected result
ABC__123_123_123_ABC123
Am fairly new to using REGEXP and couldn't help myself to an elegant (or workable) solution with what Q&A here or else. In all queries the pattern is the same in quantity and order but for my growth of knowledge I'd prefer to not just count and cut.
You can use REGEXP_REPLACE function such as
REGEXP_REPLACE(col,'(.*\\)(.*)\.(.*)','\2')
in order to extract the piece starting from the last slash upto the dot. Preceding slashes in \\ and \. are used as escape characters to distinguish the special characters and our intended \ and . characters.
Demo
You need just regexp_substr and simple regexp ([^\]+)\.[^.]*$
select
regexp_substr(
'.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt',
'([^\]+)\.[^.]*$',
1, -- position
1, -- occurence
null, -- match_parameter
1 -- subexpr
) substring
from dual;
([^\]+)\.[^.]*$ means:
([^\]+) - find one or more(+) any characters except slash([] - set, ^ - negative, ie except) and name it as group \1(subexpression #1)
\. - then simple dot (. is a special character which means any character, so we need to "escape" it using \ which is an escape character)
[^.]* - zero or more any characters except .
$ - end of line
So this regexp means: find a substring which consist from: one or more any characters except slash followed by dot followed by zero or more any characters except dot and it should be in the end of string. And subexpr parameter = 1, says oracle to return first subexpression (ie first matched group in (...))
Other parameters you can find in the doc.
Here is my simple full compatible example with Oracle 11g R2, PCRE2 and some other languages.
Oracle 11g R2 using function substr (Reference documentation)
select
regexp_substr(
'.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt',
'((\w)+(_){2}(((\d){3}(_)){3}){1}((\w)+(\d)+){1}){1}',
1,
1
) substring
from dual;
Pattern: ((\w)+(_){2}(((\d){3}(_)){3}){1}((\w)+(\d)+){1}){1}
Result: ABC__123_123_123_ABC123
Just as simple as it can be, regular expressions always follow a minimal standard, as you can see portability also provided, just for the case someone else is interested in going the simplest way.
Hopefully, this will help you out!

How do I extract data between two strings based on a pattern in Oracle SQL

I want to extract the data from a column which is of type CLOB in oracle SQL based on a specific pattern. I tried different things with regex nothing worked so far.
PFB the example on how the data would look like and the expected output.
Sample Data:
I should extract CLOB column preceding the word LIST until one word before the .(dot)
PS: CLOB can have CR LF / Carriage return within the pattern.
Expected Output:
Here is how I would do this. Note a couple of things:
The output preserves newlines that existed in the input. You didn't
say anything about removing them; however, your output doesn't show
them. In any case - they can be removed, if needed, but that is an
unrelated process.
You say "word" but obviously you are using that in a sense different
from the common usage in regular expressions. In regexp, "word
characters" are only letters, digits and underscore; yet your
"words" include brackets, equal sign, and who knows what else. I interpreted the term "word" to mean any
sequence of consecutive non-whitespace characters.
Here is how we can recreate your data. When you ask a question here, this is how you should provide sample data - not as an image that we can't copy and paste in an SQL editor.
CREATE TABLE sample_data( col_a varchar2(20), col_b CLOB );
INSERT INTO sample_data VALUES
('12345', to_clob(
'Created:2/28/2019
Updated:1/19/2021
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[Location=BLAH].[City=BLAH]'));
INSERT INTO sample_data VALUES
('12346', to_clob(
'Created:2/28/2019
Updated:1/19/2021
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[SOC].[RAW]'));
commit;
Then here is the query and the output. Note that, depending on your interface (in my case: SQL Developer, which uses a SQL*Plus-like interface), you may need to change some settings so that the output is not truncated. In particular, in SQL*Plus, CLOB columns are truncated to 80 characters by default; I had to
set long 100
So - query and output:
select col_a, col_b,
regexp_substr(col_b, '(\s|^)(LIST:[^.]*?)\s+\S+\.', 1, 1, null, 2)
as result
from sample_data
;
COL_A COL_B RESULT
----- ------------------------------ ------------------------------
12345 Created:2/28/2019 LIST:[ABC][DEF][GHI]
Updated:1/19/2021 [LMNO][PQRST]
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[Location=BLAH].[City=BLAH]
12346 Created:2/28/2019 LIST:[ABC][DEF][GHI]
Updated:1/19/2021 [LMNO][PQRST]
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[SOC].[RAW]
The regular expression matches a single whitespace character or the beginning of the string ((\s|^)), then the characters LIST: followed by as few consecutive, non-period characters (this will match spaces and newline characters, in particular) as needed to allow a match - which continues with one or more whitespace characters, followed by a single word (string of 1 or more non-whitespace characters) and a literal period (\.).
The expression we must return is enclosed in parentheses, so that we can return it from regexp_substr. Such an expression is called a "capture group". The regexp includes another capture group, (\s|^), out of necessity (alternation), so the capture group we must return is the second in the regexp. This is what the last argument to regexp_substr does: it instructs the function to return the second capture group.
Note a peculiar thing about the period (related to the much more general concept of escaping within bracket expressions): the period must be escaped to represent a literal period, rather than "any character", at the end of the regular expression; however, within the (negated) bracket expression [^.]*?, the period - representing a literal period, not "any character" - is not escaped. Oracle follows the ERE (extended regular expressions) dialect of the POSIX standard, and that standard says that escape sequences are invalid within bracket expressions. This is different from other regular expression dialect, and confuses a lot of users.
One option would be using REPLACE() in order to remove line feed (CHR(10)) and carriage return (CHR(13)), then REGEXP_REPLACE() functions recursively in order to extract the substring after LIST: upto the dot such as
SELECT col_a,
'LIST:'||REGEXP_REPLACE(REPLACE(REPLACE(col_b,CHR(10)),CHR(13)),'(.*LIST:)(\S+)(\..*)','\2') AS result
FROM t;
col_a result
------ -------
12345 LIST:[ABC][DEF][GHI][LMNO][PQRST][Location=BLAH]
12346 LIST:[ABC][DEF][GHI][LMNO][PQRST][SOC]
Demo
There may be more efficient ways to do this, but the following seems to work:
First I replace newline characters with spaces using TRANSLATE, then using regex find anything between LIST: and .. Then I remove the final "word" using SUBSTR and INSTR. I've used a subquery to prevent having to repeat the first steps.
SELECT
SubQuery.COL_A,
SUBSTR(SubQuery.WithWordAndDot, 1, INSTR(SubQuery.WithWordAndDot,' ',-1)-1) AS Result
FROM
(
SELECT
COL_A,
REGEXP_SUBSTR(TRANSLATE(COL_B, CHR(10)||CHR(13), ' '),'LIST:[^\.]+\.') as WithWordAndDot
FROM MyTable
) SubQuery
;

REGEXP to insert special characters, not remove

How would i put double quotes around the two fields that are missing it? Would i be able to use like a INSTR/SUBSTR/REPLACE in one statement to accomplish it?
string := '"ES26653","ABCBEVERAGES","861526999728",606.32,"2017-01-26","2017-01-27","","",77910467,"DOROTHY","","RAPP","14219 PIERCE STREET, APT1","","OMAHA","NE","68144"';
Expected string := '"ES26653","ABCBEVERAGES","861526999728","**606.32**","2017-01-26","2017-01-27","","","**77910467**","DOROTHY","","RAPP","14219 PIERCE STREET, APT1","","OMAHA","NE","68144"';
Please suggest! Thank you.
This answer does not work in this case, because some fields contain commas. I am leaving it in case it helps anyone else.
One rather brute force method for internal fields is:
replace(replace(string, ',', '","'), '""', '"')
This adds double quotes on either side of a comma and then removes double double quotes. You don't need to worry about "". It becomes """" and then back to "".
This can be adapted for the first and last fields as well, but it complicates the expression.
This offering attempts to address a number of end cases:
Addressing issues with first and last fields. Here only the last field is a special case as we look out for the end-of-string $ rather than a comma.
Empty unquoted fields i.e. leading commas, consecutive commas and trailing commas.
Preserving a pair of double quotes within a field representing a single double quote.
The SQL:
WITH orig(str) AS (
SELECT '"ES26653","ABCBEVERAGES","861526999728",606.32,"2017-01-26","2017-01-27","","",77910467,"DOROTHY","","RAPP","14219 PIERCE STREET, APT1","","OMAHA","NE","68144"'
FROM dual
),
rpl_first(str) AS (
SELECT REGEXP_REPLACE(str, '("(([^"]|"")*)"|([^,]*))(,|$)','"\2\4"\5')
FROM orig
)
SELECT REGEXP_REPLACE(str, '"""$','"') fixed_string
FROM rpl_first;
The technique is to find either a quoted field and remember it or a non-quoted field and remember it, terminated by a comma or end-of-string and remember that. The answers is then a " followed by one of the fields followed by " and then the terminator.
The quoted field is basically "[^"]*" where [^"] is a any character that is not a quote and * is repeated zero or more times. This is complicated by the fact the not-a-quote character could also be a pair of quotes so we need an OR construct (|) i.e. "([^"]|"")*". However we must remember just the field inside the quotes so add brackets so we can later back reference just that i.e. "(([^"]|"")*)".
The unquoted field is simply a non-comma repeated zero or more times where we want to remember it all ([^,]*).
So we want to find either of these, the OR construct again i.e. ("(([^"]|"")*)"|([^,]*)). Followed by the terminator, either a comma or end-of-string, which we want to remember i.e. (,|$).
Now we can replace this with one of the two types of field we found enclosed in quotes followed by the terminator i.e. "\2\4"\5. The number n for the back reference \n is just a matter of counting the open brackets.
The second REGEXP_REPLACE is to work around something I suspect is an Oracle bug. If the last field is quoted then a extra pair of quotes is added to the end of the string. This suggests that the end-of-string is being processed twice when it is parsed, which would be a bug. However regexp processing is probably done by a standard library routine so it may be my interpretation of the regexp rules. Comments are welcome.
Oracle regexp documentation can be found at Using Regular Expressions in Database Applications.
My thanks to #Gary_W for his template. Here I am keeping the two separate regexp blocks to separate the bit I can explain from the bit I can't (the bug?).
This method makes 2 passes on the string. First look for a grouping of a double-quote followed by a comma, followed by a character that is not a double-quote. Replace them by referring to them with the shorthand of their group, the first group, '\1', the missing double-quote, the second group '\2'. Then do it again, but the other way around. Sure you could nest the regex_replace calls and end up with one big ugly statement, but just make it 2 statements for easier maintenance. The guy working on this after you will thank you, and this is ugly enough as it is.
SQL> with orig(str) as (
select '"ES26653","ABCBEVERAGES","861526999728",606.32,"2017-01-26","2017
-01-27","","",77910467,"DOROTHY","","RAPP","14219 PIERCE STREET, APT1","","OMAHA
","NE","68144"'
from dual
),
rpl_first(str) as (
select regexp_replace(str, '(",)([^"])', '\1"\2')
from orig
)
select regexp_replace(str, '([^"])(,")', '\1"\2') fixed_string
from rpl_first;
FIXED_STRING
--------------------------------------------------------------------------------
"ES26653","ABCBEVERAGES","861526999728","606.32","2017-01-26","2017-01-27","",""
,"77910467","DOROTHY","","RAPP","14219 PIERCE STREET, APT1","","OMAHA","NE","681
44"
SQL>
EDIT: Changed regex's and added a third step to allow for empty, unquoted fields per Unoembre's comment. Good catch! Also added additional test cases. Always expect the unexpected and make sure to add test cases for all data combinations.
SQL> with orig(str) as (
select '"ES26653","ABCBEVERAGES","861526999728",606.32,"2017-01-26","2
017-01-27","","",77910467,"DOROTHY","","RAPP","14219 PIERCE STREET, APT1","","OM
AHA","NE","68144"'
from dual union
select 'ES26653,"ABCBEVERAGES","861526999728"' from dual union
select '"ES26653","ABCBEVERAGES",861526999728' from dual union
select '1S26653,"ABCBEVERAGES",861526999728' from dual union
select '"ES26653",,861526999728' from dual
),
rpl_empty(str) as (
select regexp_replace(str, ',,', ',"",')
from orig
),
rpl_first(str) as (
select regexp_replace(str, '(",|^)([^"])', '\1"\2')
from rpl_empty
)
select regexp_replace(str, '([^"])(,"|$)', '\1"\2') fixed_string
from rpl_first;
FIXED_STRING
--------------------------------------------------------------------------------
"ES26653","ABCBEVERAGES","861526999728","606.32","2017-01-26","2017-01-27","",""
,"77910467","DOROTHY","","RAPP","14219 PIERCE STREET, APT1","","OMAHA","NE","681
44"
"ES26653","ABCBEVERAGES","861526999728"
"ES26653","","861526999728"
"1S26653","ABCBEVERAGES","861526999728"
"ES26653","ABCBEVERAGES","861526999728"
SQL>