regexp after a word appear - sql

Im using regexp to find the text after a word appear.
Fiddle demo
The problem is some address use different abreviations for big house: Some have space some have dot
Quinta
QTA
Qta.
I want all the text after any of those appear. Ignoring Case.
I try this one but not sure how include multiple start
SELECT
REGEXP_SUBSTR ("Address", '[^QUINTA]+') "REGEXPR_SUBSTR"
FROM Address;

Solution:
I believe this will match the abbreviations you want:
SELECT
REGEXP_REPLACE("Address", '^.*Q(UIN)?TA\.? *|^.*', '', 1, 1, 'i')
"REGEXPR_SUBSTR"
FROM Address;
Demo in SQL fiddle
Explanation:
It tries to match everything from the begging of the string:
until it finds Q + UIN (optional) + TA + . (optional) + any number of spaces.
if it doesn't find it, then it matches the whole string with ^.*.
Since I'm using REGEXP_REPLACE, it replaces the match with an empty string, thus removing all characters until "QTA", any of its alternations, or the whole string.
Notice the last parameter passed to REGEXP_REPLACE: 'i'. That is a flag that sets a case-insensitive match (flags described here).
The part you were interested in making optional uses a ( pattern ) that is a group with the ? quantifier (which makes it optional). Therefore, Q(UIN)?TA matches either "QUINTA" or "QTA".
Alternatively, in the scope of your question, if you wanted different options, you need to use alternation with a |. For example (pattern1|pattern2|etc) matches any one of the 3 options. Also, the regex (QUINTA|QTA) matches exactly the same as Q(UIN)?TA
What was wrong with your pattern:
The construct you were trying ([^QUINTA]+) uses a character class, and it matches any character except Q, U, I, N, T or A, repeated 1 or more times. But it's applied to characters, not words. For example, [^QUINTA]+ matches the string "BCDEFGHJKLMOPRSVWXYZ" completely, and it fails to match "TIA".

Related

Postgres Regular Expression Positive Lookbehind with Repetition

Working for the first time with Postgres flavor regular expressions and stumped on it's behavior here.
Input: 'xxxaaabc'
Expected: 'xxxaaazzzbc'
Logic: Match M a's, capture group, and add 'zzz' directly after the sequence
Attempts:
select regexp_replace('aaabc','(?!\Sa+)','zzz','i')
Result: zzzxxxaaabc
select regexp_replace('aaabc','(?!\Sa+)','zzz','i')
Result: xxxazzzaabc
select select regexp_replace('xxxaaabc','(?!=a+)','zzz','i')
Result: zzzxxxaaabc
This one seems to get pretty close regexp_replace('aaabc','(?!\Sa+)','zzz','i'), but the repetition of a's isn't working.
Note the (?!\Sa+) pattern matches an empty location that is not immediately followed with a single non-whitespace char and then one or more a chars.
The (?!=a+) pattern matches a location in a string, that is not immediately followed with = and then one or more as.
You can capture one or more as and then use a backreference to the match and then just append the zzz:
select regexp_replace('xxxaaabc','(a+)','\1zzz','i') AS Result;
See the DB fiddle.

How can I remove characters in a string after a specific special character (~) in snowflake sql?

I am using Snowflake SQL. I would like to remove characters from a string after a special character ~. How can I do that?
here is the whole scenario. Let me explain. I do have a string like 'CK#123456~fndkjfgdjkg'. Now, i want only the number after #.And not anything after ~. This is number length varies for that field value. It might be 1 or 5 or 3. And i want to add the condition in where class where this number is equal to check_num from other table after joining. I am trying REGEXP_SUBSTR(A.SRC_TXT, '(?<=CK#)(.+?\b)') = C.CHK_NUM in the where condition. I am getting the error as 'No repititive argument after ?'
You can use a regex for this
-- To remove just the character after a ~
select regexp_replace('fo~o bar','~.', '');
-- returns 'fo bar'
--If you want to keep the ~
select regexp_replace('fo~o bar','~.', '~');
-- returns 'fo~ bar'
--If you want to remove everything after the ~
select regexp_replace('fo~o bar','~.*', '');
--returns 'fo'
If you need to remove other specific character sets after a ~, you can probably do this with a slightly more complicated regex, but I'd need examples of your desired input/output to help with that.
EDIT for updated question
This regex replace should get what you need.
select regexp_replace('CK#123456~fndkjfgdjkg','CK#(\\d*)~.*', '\\1');
-- returns 123456
(\\d*) gets ANY number of digits in a row, and the \\1 causes it to replace the match with what was in the first set of parenthesis, which is your list of digits. the CK# and ~.* are there to make sure the whole string gets matched and replaced.
If the CK# can vary as well, you can use .*? like this.
select regexp_replace('ABCD123HI#123456~fndkjfgdjkg','.*?#(\\d*)~.*', '\\1')
-- returns 123456
I'd probably do something like the following, easy enough but not as cool as RegEx type of functions.
set my_string='fooo~12345';
set search_for_me = '~';
SELECT SUBSTR($my_string, 1, DECODE(position($search_for_me, $my_string), 0, length($my_string), position($search_for_me, $my_string)));
I hope this helps...Rich
It looks like lookahead and lookbehinds are not supported in REGEXP functions, they seem to work in the PATTERN clause of a LIST command. Snowflake documentation makes no mention either way of lookahead or lookbehinds.
In your example:
It seems that the query engine is looking for that repeating argument, where you are attempting a lookbehind
You have not specified what you wanted extracted. You have two capture groups, but in this scenario everything would be returned
Since you are looking to remove everything after ~ you have a delimiter, why not use it in your REGEXP_SUBSTR function?
Try the following:
SELECT $1,REGEXP_SUBSTR($1,'\\w+#(.+?)~',1,1,'is',1)
FROM VALUES
('CK#123456~fndkjfgdjkg')
,('QH#128fklj924~fndkjfgdjkg')
;
This looks for:
One or more word characters
Followed by #
Capturing one or more characters upto and not including ~
Returns the characters within the capture group
You can change the .+? to \\d+? to make sure the pattern is only digits. Backslashes must be escaped with a backslash.
The descriptions for each argument of the function can be found here:
https://docs.snowflake.net/manuals/sql-reference/functions/regexp_substr.html
You could check this!!
select substr('CK#123456~fndkjfgdjkg',4,6) from dual;
OUTPUT
123456
https://docs.snowflake.net/manuals/sql-reference/functions/substr.html

Regex not matching correct string

I am busy building a lookup table for specific names of merchants. I tried to make use of the following regex but it's returning less results than the standard "like" function in Netezza SQL. Please refer to below:
SQL Like function: where trim(upper(a.MRCH_NME)) like '%CNA %' -- returns 4622 matches
Regex function in Netezza SQL: where array_combine(regexp_extract_all(trim(upper(a.MRCH_NME)),'.*CNA\s','i'),'|') = 'CNA' -- returns 2226 matches
I looked at the two result sets and found that strings such as the following aren't matched:
!C CNA INT ARR
*CNA PLATZ 0400
015764 CNA CRAD
C#CNA PARK 0
I made use of the following regex expression: /.*CNA\s'/
Any idea why the above strings aren't being returned as matches?
Thank you.
You probably should be using regexp_like:
SELECT *
FROM yourTable
WHERE REGEXP_LIKE(MRCH_NME, 'CNA[ ]', 'i');
This would be logically identical to the following query using LIKE:
SELECT *
FROM yourTable
WHERE MRCH_NME LIKE '%CNA ';
It seems to me the problem is more with your code rather than the regex. Look: like '%CNA %' returns all entries that contain a CNA substring followed with a literal space anywhere inside the entry. The '.*CNA\s' regex matches any 0+ chars other than newline followed with CNA and **any whitespace char*.
Acc. to this reference, \s matches "a white space character. White space is defined as [\t\n\f\r\p{Z}].
Thus, you should in fact just use
WHERE REGEXP_LIKE(MRCH_NME, 'CNA ', 'i')
or, better with a word boundary check:
WHERE REGEXP_LIKE(MRCH_NME, '\bCNA\b', 'i')
where \b marks a transition from a word to non-word and non-word to word character, thus ensuring a whole word search and justifying the regex usage.
If you do not need to match the merchant name as a whole word, use the regular LIKE with '%CNA %', it should be more efficient.

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

What is this Oracle regexp matching in this production code?

Here's the code that is in production:
dynamic_sql := q'[ with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]]' || q'[') AND
user_code not in ('A','E','I')
order by 1]';
Start at the beginning and search bizz_buzz
Match any one character that is NOT Z
Match any two characters that are not Y6
What's the ']' after the 6?
Then what?
I think that StackOverflow's formatting is causing some of the confusion in the answers. Oracle has a syntax for a string literal, q'[...]', which means that the ... portion is to be interpreted exactly as-is; so for instance it can include single quotes without having to escape each one individually.
But the code formatting here doesn't understand that syntax, so it is treating each single-quote as a string delimiter, which makes the result look different that how Oracle really sees it.
The expression is concatenating two such string literals together. (I'm not sure why - it looks like it would be possible to write this as a single string literal with no issues.) As pointed out in another answer/comment, the resulting SQL string is actually:
with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]') AND
user_code not in ('A','E','I')
order by 1
And also as pointed out in another answer, the [^Y6] portion of the regex matches a single character, not two. So this expression should simply match any string whose first character is not 'Z' and whose second character is neither 'Y' nor '6'.
When not in couples ] means... Well... Itself:
^[^Z][^Y6]]/
^ assert position at start of the string
[^Z] match a single character not present in the list below
Z the literal character Z (case sensitive)
[^Y6] match a single character not present in the list below
Y6 a single character in the list Y6 literally (case sensitive)
] matches the character ] literally
Start at the beginning and search bizz_buzz
Match any one character that is NOT Z
Match any two one characters that is not Y or 6
What's the ']' after the 6? it's a ]
I'm afraid I have to post this here as the comment section is inappropriate for the formatting required. After your edit above that shows the entire statement, I ran this to see what the string ends up being:
select q'[ with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]]' || q'[') AND
user_code not in ('A','E','I')
order by 1]' txt
from dual;
It ended up yielding this:
with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]') AND
user_code not in ('A','E','I')
order by 1
It is apparent now that the closing bracket and quote at the end of the regex belong to the first alternate quote string and not to the regex. This is concatenating 2 alternate quoted strings which is a tad confusing as it sure looked like part of the regex. If anything you are learning the importance of comments for the poor person behind you! Please comment this accordingly when you are done figuring this out. Even include a link to this post.