Regular Expression for alphanumeric and some special characters not adjacent - sql

I would like to have a regular expression to make an Oracle SQL REGEXP_LIKE query that checks
if a string starts with one alphanumeric character
if the string ends with one alphanumeric character
if the "body" of the string contains only alphanumeric character OR these authorized characters (written) : hyphen (dash), dot, apostrophe, space
if the authorised characters are NOT adjacent (to avoid something like "he--'''l..'-lo")
I started with this :
^[a-zA-Z0-9]+(a-zA-Z0-9\-\.'|([^\-\.'])\1)*[a-zA-Z0-9]$
I used backslash to escape assuming that dot and hyphen are metacharacters

I think this is what you want:
^[a-zA-Z0-9]+([-.' ][a-zA-Z0-9]|[a-zA-Z0-9])*\w?$
It looks for
at least 1 alphanumeric (alnum),
followed by
either an authorized character followed by an alphanumeric or just an alphanumeric, repeated any number of times (including 0).
optionally followed by
an alnum
This meets your specification. I'm not sure if starts with one alnum and ends with one alnum means that there must be at least 2 alnums, or if they can be the same. If there must be at least 2 of them, remove the last ? (which make the last alnum optional).
Regards

assuming you meant "authorised characters are NOT adjacent to each other"
try something along these lines
^[a-zA-Z0-9]+([a-zA-Z0-9]+[\-\.' ]?)*[a-zA-Z0-9]$
so that the repeating middle part always has one alphanumeric character followed by zero to one special characters.

Related

Regex like telephone number on Hive without prefix (+01)

We have a problem with a regular expression on hive.
We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.
We're trying with this one:
regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]')
but it doesn't work.
Edit: we want it to come out from the select as true (correct number) or false.
To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:
... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'
Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.
The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.
Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.
The pattern would be like this:
Your number can start with either + or 00: ^(\+|00)
It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:
a. It is followed first by a 3 then by anything but 7: 3[0-689]
b. It is followed first by anything but 3 then by any number: [0-24-9]\d
After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$
Putting everything together:
^(\+|00)(3[0-689]|[0-24-9]\d)\d+$
You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3
Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.
You can use
regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')
See the regex demo. Details:
^ - start of string
(?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
\+? - an optional + sign
\d+ - one or more digits
$ - end of string.

Query to replace special characters in phone number field

Can anyone help with a query on how to replace special/non-numeric/hidden characters from a phone number column.
I've tried
LTRIM(RTRIM(REGEXP_REPLACE(
PHONE_NBR,
'[^[:digit:]][:cntrl:][:alpha:][:graph:][:blank:][:print:][:punct:][:space:]~',
'')))
but no luck, there are still a few records which contain non-numeric values.
Your regex is saying to ONLY replace a string consisting of: a non-numeric character followed by a control character, an alpha, a graph, a blank, a print, a punct, a space, and then a tilde.
You should be able to just use '[^[:digit:]]' as your regex, to remove all non-numeric characters.

Oracle REGEXP_LIKE doesn't work as expected

I was testing a regular expression in Oracle SQL and found something I could not understand:
-- NO MATCH
SELECT 1 FROM DUAL WHERE REGEXP_LIKE ('Professor Frank', '(^|\s)Prof[^\s]*(\s|$)');
Above doesn't match, while the following matches:
-- MATCH
SELECT 1 FROM DUAL WHERE REGEXP_LIKE ('Professor Frank', '(^|\s)Prof\S*(\s|$)');
In other regex flavors, It will be like \bProf[^\s]*\b versus \bProf\S*\b and have similar results. Note: Oracle SQL regex does not have \b or word boundary.
Question: Why don't [^\s]* and \S* work the same way in Oracle SQL?
I notice if I remove the (\s|$) at the end, the first regex will match.
In Oracle regular expressions, \s is indeed the escape sequence for a space, but NOT in a matching character set (that is, [.....], or [^....] for excluding one character). In a matching character set, only two characters have a special meaning, - for ranges and ] for closing the set enumeration. They can't be escaped; if needed in the matching set, ] must always be the first character right after the opening [ (it is the ONLY position in which a closing ] stands for itself as a character, and does not denote the end of the matching set), and - must be first or last (best to leave it always to the end of the matching set) - anywhere else it is seen as a range marker. To include (or exclude, if using the [^.....] syntax) a space, just type an actual physical space in the matching set.
Edit: What I said above is not entirely right. There is another special character in a matching set, namely ^. If it is used in the first position, it means "match any character OTHER THAN." In any other position it stands for itself. For example, '[^^]' will match any single character OTHER THAN ^ (the first ^ has special meaning, the second stands in for itself). And, a closing bracket ] stands for itself if it is the second character in brackets, if the first character is ^ (with its SPECIAL meaning). That is, to match any single character OTHER THAN ], we can use the matching pattern '[^]]'.

Query to get records with special characters only

I am trying to write a query which will tell me if certain record is having only the special characters. e.g- "%^&%&^%&" will error however "%HH678*(*))" is fine (as it's having alphanumeric values as well. I have written following query however, it's working fine only for English alphabets and numbers, if column is having some other characters like mandarin then also it's not giving expected value.Any help is highly appreciated.
SELECT * FROM test WHERE REGEXP_LIKE(sampletext, '[^]^A-Z^a-z^0-9^[^.^{^}^ ]' );
You may try this,
regexp_like(text, '^[^A-Za-z0-9]+$')
This would match the text only if the input text contains special chars ie, only chars which are not of letters or digits.
To detect strings containing only characters other than unaccented alphabetic, numeric, or white space characters try this:
regexp_like(text,'^[^a-zA-Z0-9[:space:]]+$')
If you don't think punctuation characters are special than add [:punct:] to the class of characters to ignore.
If you are looking for a specific set of characters you can use a character class of just those characters of interest, for example some common accented characters (note the lack of a leading ^ in the character class []:
regexp_like(text,'^[àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$')

Which Unicode characters are "composing" characters (whose sole purpose is to add accent, tilda)?

This is related to
What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?
This is how I plan to do this:
Use http://msdn.microsoft.com/en-us/library/dd374126%28v=vs.85%29.aspx to turn the string into
KD form.
Basically it'll turn most variation such as superscript into the normal number. Also it decompose tilda and accent into 2 characters.
Next step would be to remove all characters whose sole purpose is tildaing or accenting character.
How do I know which characters are like that? Which characters are just "composing characters"
How do I find such characters? After I find those, how do I get rid of it? Should I scan character by character and remove all such "combining characters?"
For example:
Character from 300 to 362 can be gotten rid off.
Then what?
Combining characters are listed in UnicodeData.txt as having a nonzero Canonical_Combining_Class, and a General_Category of Mn (Mark, nonspacing).
For each character in the string, call GetUnicodeCategory and check the UnicodeCategory for NonSpacingMark, SpacingCombiningMark or EnclosingMark.
You may be able to do it more efficiently using regex, eg Regex.Replace(str, "\p{M}", "").