What is this Oracle regexp matching in this production code? - sql

Here's the code that is in production:
dynamic_sql := q'[ with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]]' || q'[') AND
user_code not in ('A','E','I')
order by 1]';
Start at the beginning and search bizz_buzz
Match any one character that is NOT Z
Match any two characters that are not Y6
What's the ']' after the 6?
Then what?

I think that StackOverflow's formatting is causing some of the confusion in the answers. Oracle has a syntax for a string literal, q'[...]', which means that the ... portion is to be interpreted exactly as-is; so for instance it can include single quotes without having to escape each one individually.
But the code formatting here doesn't understand that syntax, so it is treating each single-quote as a string delimiter, which makes the result look different that how Oracle really sees it.
The expression is concatenating two such string literals together. (I'm not sure why - it looks like it would be possible to write this as a single string literal with no issues.) As pointed out in another answer/comment, the resulting SQL string is actually:
with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]') AND
user_code not in ('A','E','I')
order by 1
And also as pointed out in another answer, the [^Y6] portion of the regex matches a single character, not two. So this expression should simply match any string whose first character is not 'Z' and whose second character is neither 'Y' nor '6'.

When not in couples ] means... Well... Itself:
^[^Z][^Y6]]/
^ assert position at start of the string
[^Z] match a single character not present in the list below
Z the literal character Z (case sensitive)
[^Y6] match a single character not present in the list below
Y6 a single character in the list Y6 literally (case sensitive)
] matches the character ] literally
Start at the beginning and search bizz_buzz
Match any one character that is NOT Z
Match any two one characters that is not Y or 6
What's the ']' after the 6? it's a ]

I'm afraid I have to post this here as the comment section is inappropriate for the formatting required. After your edit above that shows the entire statement, I ran this to see what the string ends up being:
select q'[ with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]]' || q'[') AND
user_code not in ('A','E','I')
order by 1]' txt
from dual;
It ended up yielding this:
with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]') AND
user_code not in ('A','E','I')
order by 1
It is apparent now that the closing bracket and quote at the end of the regex belong to the first alternate quote string and not to the regex. This is concatenating 2 alternate quoted strings which is a tad confusing as it sure looked like part of the regex. If anything you are learning the importance of comments for the poor person behind you! Please comment this accordingly when you are done figuring this out. Even include a link to this post.

Related

How do I extract data between two strings based on a pattern in Oracle SQL

I want to extract the data from a column which is of type CLOB in oracle SQL based on a specific pattern. I tried different things with regex nothing worked so far.
PFB the example on how the data would look like and the expected output.
Sample Data:
I should extract CLOB column preceding the word LIST until one word before the .(dot)
PS: CLOB can have CR LF / Carriage return within the pattern.
Expected Output:
Here is how I would do this. Note a couple of things:
The output preserves newlines that existed in the input. You didn't
say anything about removing them; however, your output doesn't show
them. In any case - they can be removed, if needed, but that is an
unrelated process.
You say "word" but obviously you are using that in a sense different
from the common usage in regular expressions. In regexp, "word
characters" are only letters, digits and underscore; yet your
"words" include brackets, equal sign, and who knows what else. I interpreted the term "word" to mean any
sequence of consecutive non-whitespace characters.
Here is how we can recreate your data. When you ask a question here, this is how you should provide sample data - not as an image that we can't copy and paste in an SQL editor.
CREATE TABLE sample_data( col_a varchar2(20), col_b CLOB );
INSERT INTO sample_data VALUES
('12345', to_clob(
'Created:2/28/2019
Updated:1/19/2021
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[Location=BLAH].[City=BLAH]'));
INSERT INTO sample_data VALUES
('12346', to_clob(
'Created:2/28/2019
Updated:1/19/2021
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[SOC].[RAW]'));
commit;
Then here is the query and the output. Note that, depending on your interface (in my case: SQL Developer, which uses a SQL*Plus-like interface), you may need to change some settings so that the output is not truncated. In particular, in SQL*Plus, CLOB columns are truncated to 80 characters by default; I had to
set long 100
So - query and output:
select col_a, col_b,
regexp_substr(col_b, '(\s|^)(LIST:[^.]*?)\s+\S+\.', 1, 1, null, 2)
as result
from sample_data
;
COL_A COL_B RESULT
----- ------------------------------ ------------------------------
12345 Created:2/28/2019 LIST:[ABC][DEF][GHI]
Updated:1/19/2021 [LMNO][PQRST]
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[Location=BLAH].[City=BLAH]
12346 Created:2/28/2019 LIST:[ABC][DEF][GHI]
Updated:1/19/2021 [LMNO][PQRST]
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[SOC].[RAW]
The regular expression matches a single whitespace character or the beginning of the string ((\s|^)), then the characters LIST: followed by as few consecutive, non-period characters (this will match spaces and newline characters, in particular) as needed to allow a match - which continues with one or more whitespace characters, followed by a single word (string of 1 or more non-whitespace characters) and a literal period (\.).
The expression we must return is enclosed in parentheses, so that we can return it from regexp_substr. Such an expression is called a "capture group". The regexp includes another capture group, (\s|^), out of necessity (alternation), so the capture group we must return is the second in the regexp. This is what the last argument to regexp_substr does: it instructs the function to return the second capture group.
Note a peculiar thing about the period (related to the much more general concept of escaping within bracket expressions): the period must be escaped to represent a literal period, rather than "any character", at the end of the regular expression; however, within the (negated) bracket expression [^.]*?, the period - representing a literal period, not "any character" - is not escaped. Oracle follows the ERE (extended regular expressions) dialect of the POSIX standard, and that standard says that escape sequences are invalid within bracket expressions. This is different from other regular expression dialect, and confuses a lot of users.
One option would be using REPLACE() in order to remove line feed (CHR(10)) and carriage return (CHR(13)), then REGEXP_REPLACE() functions recursively in order to extract the substring after LIST: upto the dot such as
SELECT col_a,
'LIST:'||REGEXP_REPLACE(REPLACE(REPLACE(col_b,CHR(10)),CHR(13)),'(.*LIST:)(\S+)(\..*)','\2') AS result
FROM t;
col_a result
------ -------
12345 LIST:[ABC][DEF][GHI][LMNO][PQRST][Location=BLAH]
12346 LIST:[ABC][DEF][GHI][LMNO][PQRST][SOC]
Demo
There may be more efficient ways to do this, but the following seems to work:
First I replace newline characters with spaces using TRANSLATE, then using regex find anything between LIST: and .. Then I remove the final "word" using SUBSTR and INSTR. I've used a subquery to prevent having to repeat the first steps.
SELECT
SubQuery.COL_A,
SUBSTR(SubQuery.WithWordAndDot, 1, INSTR(SubQuery.WithWordAndDot,' ',-1)-1) AS Result
FROM
(
SELECT
COL_A,
REGEXP_SUBSTR(TRANSLATE(COL_B, CHR(10)||CHR(13), ' '),'LIST:[^\.]+\.') as WithWordAndDot
FROM MyTable
) SubQuery
;

How can I remove characters in a string after a specific special character (~) in snowflake sql?

I am using Snowflake SQL. I would like to remove characters from a string after a special character ~. How can I do that?
here is the whole scenario. Let me explain. I do have a string like 'CK#123456~fndkjfgdjkg'. Now, i want only the number after #.And not anything after ~. This is number length varies for that field value. It might be 1 or 5 or 3. And i want to add the condition in where class where this number is equal to check_num from other table after joining. I am trying REGEXP_SUBSTR(A.SRC_TXT, '(?<=CK#)(.+?\b)') = C.CHK_NUM in the where condition. I am getting the error as 'No repititive argument after ?'
You can use a regex for this
-- To remove just the character after a ~
select regexp_replace('fo~o bar','~.', '');
-- returns 'fo bar'
--If you want to keep the ~
select regexp_replace('fo~o bar','~.', '~');
-- returns 'fo~ bar'
--If you want to remove everything after the ~
select regexp_replace('fo~o bar','~.*', '');
--returns 'fo'
If you need to remove other specific character sets after a ~, you can probably do this with a slightly more complicated regex, but I'd need examples of your desired input/output to help with that.
EDIT for updated question
This regex replace should get what you need.
select regexp_replace('CK#123456~fndkjfgdjkg','CK#(\\d*)~.*', '\\1');
-- returns 123456
(\\d*) gets ANY number of digits in a row, and the \\1 causes it to replace the match with what was in the first set of parenthesis, which is your list of digits. the CK# and ~.* are there to make sure the whole string gets matched and replaced.
If the CK# can vary as well, you can use .*? like this.
select regexp_replace('ABCD123HI#123456~fndkjfgdjkg','.*?#(\\d*)~.*', '\\1')
-- returns 123456
I'd probably do something like the following, easy enough but not as cool as RegEx type of functions.
set my_string='fooo~12345';
set search_for_me = '~';
SELECT SUBSTR($my_string, 1, DECODE(position($search_for_me, $my_string), 0, length($my_string), position($search_for_me, $my_string)));
I hope this helps...Rich
It looks like lookahead and lookbehinds are not supported in REGEXP functions, they seem to work in the PATTERN clause of a LIST command. Snowflake documentation makes no mention either way of lookahead or lookbehinds.
In your example:
It seems that the query engine is looking for that repeating argument, where you are attempting a lookbehind
You have not specified what you wanted extracted. You have two capture groups, but in this scenario everything would be returned
Since you are looking to remove everything after ~ you have a delimiter, why not use it in your REGEXP_SUBSTR function?
Try the following:
SELECT $1,REGEXP_SUBSTR($1,'\\w+#(.+?)~',1,1,'is',1)
FROM VALUES
('CK#123456~fndkjfgdjkg')
,('QH#128fklj924~fndkjfgdjkg')
;
This looks for:
One or more word characters
Followed by #
Capturing one or more characters upto and not including ~
Returns the characters within the capture group
You can change the .+? to \\d+? to make sure the pattern is only digits. Backslashes must be escaped with a backslash.
The descriptions for each argument of the function can be found here:
https://docs.snowflake.net/manuals/sql-reference/functions/regexp_substr.html
You could check this!!
select substr('CK#123456~fndkjfgdjkg',4,6) from dual;
OUTPUT
123456
https://docs.snowflake.net/manuals/sql-reference/functions/substr.html

Find phone numbers with unexpected characters using SQL in Oracle?

I need to find rows where the phone number field contains unexpected characters.
Most of the values in this field look like:
123456-7890
This is expected. However, we are also seeing character values in this field such as * and #.
I want to find all rows where these unexpected character values exist.
Expected:
Numbers are expected
Hyphen with numbers is expected (hyphen alone is not)
NULL is expected
Empty is expected
Tried this:
WHERE phone_num is not like ' %[0-9,-,' ' ]%
Still getting rows where phone has numbers.
from https://regexr.com/3c53v address you can edit regex to match your needs.
I am going to use example regex for this purpose
select * from Table1
Where NOT REGEXP_LIKE(PhoneNumberColumn, '^[+]*[(]{0,1}[0-9]{1,4}[)]{0,1}[-\s\./0-9]*$')
You can use translate()
...
WHERE translate(Phone_Number,'a1234567890-', 'a') is NOT NULL
This will strip out all valid characters leaving behind the invalid ones. If all the characters are valid, the result would be NULL. This does not validate the format, for that you'd need to use REGEXP_LIKE or something similar.
You can use regexp_like().
...
WHERE regexp_like(phone_num, '[^ 0123456789-]|^-|-$')
[^ 0123456789-] matches any character that is not a space nor a digit nor a hyphen. ^- matches a hyphen at the beginning and -$ on the end of the string. The pipes are "ors" i.e. a|b matches if pattern a matches of if pattern b matches.
Oracle has REGEXP_LIKE for regex compares:
WHERE REGEXP_LIKE(phone_num,'[^0-9''\-]')
If you're unfamiliar with regular expressions, there are plenty of good sites to help you build them. I like this one

regexp after a word appear

Im using regexp to find the text after a word appear.
Fiddle demo
The problem is some address use different abreviations for big house: Some have space some have dot
Quinta
QTA
Qta.
I want all the text after any of those appear. Ignoring Case.
I try this one but not sure how include multiple start
SELECT
REGEXP_SUBSTR ("Address", '[^QUINTA]+') "REGEXPR_SUBSTR"
FROM Address;
Solution:
I believe this will match the abbreviations you want:
SELECT
REGEXP_REPLACE("Address", '^.*Q(UIN)?TA\.? *|^.*', '', 1, 1, 'i')
"REGEXPR_SUBSTR"
FROM Address;
Demo in SQL fiddle
Explanation:
It tries to match everything from the begging of the string:
until it finds Q + UIN (optional) + TA + . (optional) + any number of spaces.
if it doesn't find it, then it matches the whole string with ^.*.
Since I'm using REGEXP_REPLACE, it replaces the match with an empty string, thus removing all characters until "QTA", any of its alternations, or the whole string.
Notice the last parameter passed to REGEXP_REPLACE: 'i'. That is a flag that sets a case-insensitive match (flags described here).
The part you were interested in making optional uses a ( pattern ) that is a group with the ? quantifier (which makes it optional). Therefore, Q(UIN)?TA matches either "QUINTA" or "QTA".
Alternatively, in the scope of your question, if you wanted different options, you need to use alternation with a |. For example (pattern1|pattern2|etc) matches any one of the 3 options. Also, the regex (QUINTA|QTA) matches exactly the same as Q(UIN)?TA
What was wrong with your pattern:
The construct you were trying ([^QUINTA]+) uses a character class, and it matches any character except Q, U, I, N, T or A, repeated 1 or more times. But it's applied to characters, not words. For example, [^QUINTA]+ matches the string "BCDEFGHJKLMOPRSVWXYZ" completely, and it fails to match "TIA".

Escaping a single quote in Oracle regex query

This is really starting to hurt!
I'm attempting to write a query in Oracle developer using a regex condition
My objective is to find all last names that contain charachters not commonly contained in names (non-alpha, spaces, hyphens and single quotes)
i.e.
I need to find
J00ls
McDonald "Macca"
Smithy (Smith)
and NOT find
Smith
Mckenzie-Smith
El Hassan
O'Dowd
My present query is
select * from dm_name
WHERE regexp_like(last_name, '([^A-Za-z -])')
and batch_id = 'ATEST';
which excludes everything expected except the single quote. When it comes to putting the single quote character, the Oracvel SQL Develoepr parser takes it as a literal.
I've tried:
\' -- but got a "missing right parenthesis" error
||chr(39)|| -- but the search returned nothing
'' -- negated the previous character in the matching group e.g. '([^A-Za-z -''])' made names with '-' return.
I'd appreciate anything you could offer.
Just double the single quote to escape your quote.
So
select *
from dm_name
where regexp_like(last_name, '[^A-Za-z ''-]')
and batch_id = 'ATEST'
See also this sqlfiddle. Note, I tried a similar query in SQL developer and that worked as well as the fiddle.
Note also, for this to work the - character has to be the last character in the group as otherwise it tries to find the group SPACE to ' rather than the character -.
The following works:
select *
from dm_name
WHERE regexp_like(last_name, '([^A-Za-z ''-])');
See this SQLFiddle.
Whether SQL Developer will like it or not is something I cannot attest to as I don't have that product installed.
Share and enjoy.