regexp for all accented characters in Oracle - sql

I am trying to find data that has accented characters. I've tried this:
select *
from xml_tmp
where regexp_like (XMLTYpe.getClobVal(xml_tmp.xml_data), unistr('\0090'))
And it works. It finds all records where the XML data field contains É. The problem is that it only matches the upper-case E with an accent. I tried to write a more generic query to find ALL data with accented vowels (a, e, i, o, u, upper and lowercase, with any accents) using equivalence classes. I wanted a regex to match only accented vowels, but I'm not sure how to get it, as equivalence classes such as [[=e=]] match all e's (with or without accents).
Also, this does not actually work:
select *
from xml_tmp
where regexp_like (XMLTYpe.getClobVal(xml_data),'É');
(using Oracle 10g)

How about
SELECT *
FROM xml_tmp
WHERE REGEXP_LIKE
( REGEXP_REPLACE
( XMLTYpe.getClobVal(xml_tmp.xml_data),
'[aeiouAEIOU]',
'-'
)
'[[=a=][=e=][=i=][=o=][=u=]]'
)
;
? That will eliminate any unaccented vowels before performing the REGEXP_LIKE.
(It's ugly, I know. But it should work.)

After some more experimenting, I have found that this seems to work ok:
select *
from xml_tmp
where regexp_like(XMLTYpe.getClobVal(xml_data),'[^[:graph:][:space:]]')
I had thought that [:graph:] would include all upper and lower case characters, with or without accents, but it seems that it only matches unaccented characters.
Further experimentation shows that this might not work in all cases. Try these queries:
select *
from dual
where regexp_like (unistr('\0090'),'[^[:graph:][:space:]]');
DUMMY
-------
X
(the match succeeded)
So it looks like the character that's been causing me trouble matches this pattern.
select *
from dual
where regexp_like ('É','[^[:graph:][:space:]]');
DUMMY
-------
(the match failed)
When I try to run this query with the accented E as copied-and-pasted, the match fails! I guess whatever I copied-and-pasted is actually different. Ugh, I think I now hate working with changing character encodings.

Related

Regexp_Like to Validate Uppercase Characters [A-Z] and Numbers [0-9] Only

I would like a query using regexp_like within Oracle's SQL which only validates uppercase characters [A-Z] and numbers [0-9]
SELECT *
FROM dual
WHERE REGEXP_LIKE('AAAA1111', '[A-Z, 0-9]')
List item
The select Statement probalby should look like
SELECT 'Yes' as MATCHING
FROM dual
WHERE REGEXP_LIKE ('AAAA1111', '^[A-Z0-9]+$')
Which means that starting from the very first ^ to the last $ letter every character should be upper case or a number. Important: no comma or space between Z and 0. The + stands for at least one or more characters.
Edit: Based on the answer of Barbaros another way of selecting would be possible
SELECT 'Yes' as MATCHING
FROM DUAL
WHERE regexp_like('AAAA1111','^[[:digit:][:upper:]]+$')
Edit: added a DBFiddle
A quick help may be found here and for oracle regular expressions here.
You can use :
select str as "Result String"
from tab
where not regexp_like(str,'[[:lower:] ]')
and regexp_like(str,'[[:alnum:]]')
where not regexp_like with POSIX [^[:lower:]] pattern stands for eliminating the strings
containing lowercase,
and regexp_like with POSIX [[:alnum:]] pattern stands for accepting the strings
without symbols
( containing only letters and numbers even doesn't contain a space because of the trailing space at the end part of [[:lower:] ] )
Demo

Find phone numbers with unexpected characters using SQL in Oracle?

I need to find rows where the phone number field contains unexpected characters.
Most of the values in this field look like:
123456-7890
This is expected. However, we are also seeing character values in this field such as * and #.
I want to find all rows where these unexpected character values exist.
Expected:
Numbers are expected
Hyphen with numbers is expected (hyphen alone is not)
NULL is expected
Empty is expected
Tried this:
WHERE phone_num is not like ' %[0-9,-,' ' ]%
Still getting rows where phone has numbers.
from https://regexr.com/3c53v address you can edit regex to match your needs.
I am going to use example regex for this purpose
select * from Table1
Where NOT REGEXP_LIKE(PhoneNumberColumn, '^[+]*[(]{0,1}[0-9]{1,4}[)]{0,1}[-\s\./0-9]*$')
You can use translate()
...
WHERE translate(Phone_Number,'a1234567890-', 'a') is NOT NULL
This will strip out all valid characters leaving behind the invalid ones. If all the characters are valid, the result would be NULL. This does not validate the format, for that you'd need to use REGEXP_LIKE or something similar.
You can use regexp_like().
...
WHERE regexp_like(phone_num, '[^ 0123456789-]|^-|-$')
[^ 0123456789-] matches any character that is not a space nor a digit nor a hyphen. ^- matches a hyphen at the beginning and -$ on the end of the string. The pipes are "ors" i.e. a|b matches if pattern a matches of if pattern b matches.
Oracle has REGEXP_LIKE for regex compares:
WHERE REGEXP_LIKE(phone_num,'[^0-9''\-]')
If you're unfamiliar with regular expressions, there are plenty of good sites to help you build them. I like this one

Regex not matching correct string

I am busy building a lookup table for specific names of merchants. I tried to make use of the following regex but it's returning less results than the standard "like" function in Netezza SQL. Please refer to below:
SQL Like function: where trim(upper(a.MRCH_NME)) like '%CNA %' -- returns 4622 matches
Regex function in Netezza SQL: where array_combine(regexp_extract_all(trim(upper(a.MRCH_NME)),'.*CNA\s','i'),'|') = 'CNA' -- returns 2226 matches
I looked at the two result sets and found that strings such as the following aren't matched:
!C CNA INT ARR
*CNA PLATZ 0400
015764 CNA CRAD
C#CNA PARK 0
I made use of the following regex expression: /.*CNA\s'/
Any idea why the above strings aren't being returned as matches?
Thank you.
You probably should be using regexp_like:
SELECT *
FROM yourTable
WHERE REGEXP_LIKE(MRCH_NME, 'CNA[ ]', 'i');
This would be logically identical to the following query using LIKE:
SELECT *
FROM yourTable
WHERE MRCH_NME LIKE '%CNA ';
It seems to me the problem is more with your code rather than the regex. Look: like '%CNA %' returns all entries that contain a CNA substring followed with a literal space anywhere inside the entry. The '.*CNA\s' regex matches any 0+ chars other than newline followed with CNA and **any whitespace char*.
Acc. to this reference, \s matches "a white space character. White space is defined as [\t\n\f\r\p{Z}].
Thus, you should in fact just use
WHERE REGEXP_LIKE(MRCH_NME, 'CNA ', 'i')
or, better with a word boundary check:
WHERE REGEXP_LIKE(MRCH_NME, '\bCNA\b', 'i')
where \b marks a transition from a word to non-word and non-word to word character, thus ensuring a whole word search and justifying the regex usage.
If you do not need to match the merchant name as a whole word, use the regular LIKE with '%CNA %', it should be more efficient.

What is this Oracle regexp matching in this production code?

Here's the code that is in production:
dynamic_sql := q'[ with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]]' || q'[') AND
user_code not in ('A','E','I')
order by 1]';
Start at the beginning and search bizz_buzz
Match any one character that is NOT Z
Match any two characters that are not Y6
What's the ']' after the 6?
Then what?
I think that StackOverflow's formatting is causing some of the confusion in the answers. Oracle has a syntax for a string literal, q'[...]', which means that the ... portion is to be interpreted exactly as-is; so for instance it can include single quotes without having to escape each one individually.
But the code formatting here doesn't understand that syntax, so it is treating each single-quote as a string delimiter, which makes the result look different that how Oracle really sees it.
The expression is concatenating two such string literals together. (I'm not sure why - it looks like it would be possible to write this as a single string literal with no issues.) As pointed out in another answer/comment, the resulting SQL string is actually:
with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]') AND
user_code not in ('A','E','I')
order by 1
And also as pointed out in another answer, the [^Y6] portion of the regex matches a single character, not two. So this expression should simply match any string whose first character is not 'Z' and whose second character is neither 'Y' nor '6'.
When not in couples ] means... Well... Itself:
^[^Z][^Y6]]/
^ assert position at start of the string
[^Z] match a single character not present in the list below
Z the literal character Z (case sensitive)
[^Y6] match a single character not present in the list below
Y6 a single character in the list Y6 literally (case sensitive)
] matches the character ] literally
Start at the beginning and search bizz_buzz
Match any one character that is NOT Z
Match any two one characters that is not Y or 6
What's the ']' after the 6? it's a ]
I'm afraid I have to post this here as the comment section is inappropriate for the formatting required. After your edit above that shows the entire statement, I ran this to see what the string ends up being:
select q'[ with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]]' || q'[') AND
user_code not in ('A','E','I')
order by 1]' txt
from dual;
It ended up yielding this:
with cte as
select user_id,
user_name
from user_table
where regexp_like (bizz_buzz,'^[^Z][^Y6]') AND
user_code not in ('A','E','I')
order by 1
It is apparent now that the closing bracket and quote at the end of the regex belong to the first alternate quote string and not to the regex. This is concatenating 2 alternate quoted strings which is a tad confusing as it sure looked like part of the regex. If anything you are learning the importance of comments for the poor person behind you! Please comment this accordingly when you are done figuring this out. Even include a link to this post.

Regular Expression "a{2}" not working

I have a record with emp_name = "Rajat" and it is not getting returned.
My query is -
SELECT * FROM employees WHERE emp_name regexp "a{2}"
Please explain why it is not working
Your regex search for aa
https://www.regex101.com/r/yA1qA2/1
What you need is:
a.*a
https://www.regex101.com/r/hA5yS8/1
a{2} means two consecutive as. Your string doesn't match it.
To match a string with two a characters that might not be consecutive characters you should use a.*a.
a{2} will match two straight a characters, i.e.: aa.
To match 2 alternate a characters you can use:
select * from employees where emp_name REGEXP "^.*a.*a.*$";
I think all you need is LIKE with %:
LIKE - Simple pattern matching
Character Description
% Matches any number of characters, even zero
characters
_ Matches exactly one character
LIKE pattern match... succeeds only if the pattern matches the entire value
So, you can use a more specific
select * from employees where emp_name like '%a_a%';
Or, a more "generic" (allowing more characters than 1 between a and a:
select * from employees where emp_name like '%a%a%';
However, since in MySQL, SQL patterns are case-insensitive by default, so, you might have to use REGEXP with BINARY to narrow down your search results:
Prior to MySQL 3.23.4, REGEXP is case sensitive.
From MySQL 3.23.4 on, if you really want to force a REGEXP comparison
to be case sensitive, use the BINARY keyword to make one of the
strings a binary string.
SELECT * FROM employees WHERE emp_name REGEXP BINARY 'a.*a';