How can I negate this expression - regex-negation

I have absolutely no clue how to work with regex's. I am less than a beginner.
I want to find any invalid css names from a string, so I can exclude them. Looking online I found a way to select the valid names using this:
/-?[_a-zA-Z]+[_a-zA-Z0-9-]*/g
What I want to do is negate this expression, so that only '1999' is matched in this example input:
holding-page single 1999 contact id-12 contact single single

To "negate" an expression, turn it into a negative look ahead:
/(?<!\S)(?!-?[_a-zA-Z]+[_a-zA-Z0-9-]*)\S+(?!\S)/g
See live demo.
What this does is match a complete term, but one that does not match your positive regex.
A "complete term" is matched using (?<!\S)\S+(?!\S), which is \S+ (one or more non-whitespace) wrapped in negative look arounds for not a non-whitespace to prevent matching part of a term.
Note that "not a non-whitespace" is not the same as "whitespace", because "not a non-whitespace" also matches the start and end of the input, so leading and trailing terms that are invalid will match too.
Your positive regex has been turned into a negative look ahead by enclosing it in (?!...).

Related

How can I use negative lookbehind to exclude fractions?

I have a list of measurements that need to be deconstructed into quantity (numeric) and unit (string). Things like
1 gal.
500lbs
none
2.25gal
4feet twine
2lbs regular and 2lbs lite
All was well and good using \d+(\.\d+)?, but now I have a fraction thrown into the mix:
3/4gal
I need to exclude the fraction from this search so that I can deal with it separately. I'm successfully excluding the numerator (3) by inserting a negative lookahead-- \d+(?!\/)(\.\d+)?, but I can't figure out how to exclude the denominator (4). I think I'm supposed to use a negative lookbehind but I can't figure out how. \d+(?!\/)(?<!\/)(\.\d+)? and \d+(?!\/)(\.\d+)?(?<!\/) still match the 4.
Thanks!
In a construct like this \d+(?!\/)(?<!\/)(\.\d+)? the lookbehind (?<!\/) is always true as the only thing you can match (not assert) before is a digit.
You might also exclude a / on the left of the digits part, and add the lookahead after matching the decimal part.
(?<!/)\d+(?:\.\d+)?(?!/)
Explanation
(?<!/) Negative lookbehind, assert directly to the left of the current postion is not /
\d+ Match 1+ digits
(?:\.\d+)? Match an optional . and 1+ digits
(?!/) Negative lookahead, assert directly to the right of the current position is not /
regex demo
You can match and skip all occurrences of [digits]/[digits] pattern:
\d+\/\d+(*SKIP)(*F)|\d+(?:\.\d+)?
See the regex demo.
The \d+\/\d+(*SKIP)(*F)| part matches one or more digits, /, one or more digits, and then (*SKIP)(*F) makes the regex engine fail the match and start searching for the next match from the failure position, so the 3/5-like substrings won't be able to mess with your output.

Regex like telephone number on Hive without prefix (+01)

We have a problem with a regular expression on hive.
We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.
We're trying with this one:
regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]')
but it doesn't work.
Edit: we want it to come out from the select as true (correct number) or false.
To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:
... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'
Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.
The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.
Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.
The pattern would be like this:
Your number can start with either + or 00: ^(\+|00)
It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:
a. It is followed first by a 3 then by anything but 7: 3[0-689]
b. It is followed first by anything but 3 then by any number: [0-24-9]\d
After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$
Putting everything together:
^(\+|00)(3[0-689]|[0-24-9]\d)\d+$
You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3
Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.
You can use
regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')
See the regex demo. Details:
^ - start of string
(?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
\+? - an optional + sign
\d+ - one or more digits
$ - end of string.

regex not working correctly when the test is fine

For my database, I have a list of company numbers where some of them start with two letters. I have created a regex which should eliminate these from a query and according to my tests, it should. But when executed, the result still contains the numbers with letters.
Here is my regex, which I've tested on https://www.regexpal.com
([^A-Z+|a-z+].*)
I've tested it against numerous variations such as SC08093, ZC000191 and NI232312 which shouldn't match and don't in the tests, which is fine.
My sql query looks like;
SELECT companyNumber FROM company_data
WHERE companyNumber ~ '([^A-Z+|a-z+].*)' order by companyNumber desc
To summerise, strings like SC08093 should not match as they start with letters.
I've read through the documentation for postgres but I couldn't seem to find anything regarding this. I'm not sure what I'm missing here. Thanks.
The ~ '([^A-Z+|a-z+].*)' does not work because this is a [^A-Z+|a-z+].* regex matching operation that returns true even upon a partial match (regex matching operation does not require full string match, and thus the pattern can match anywhere in the string). [^A-Z+|a-z+].* matches a letter from A to Z, +,|or a letter fromatoz`, and then any amount of any zero or more chars, anywhere inside a string.
You may use
WHERE companyNumber NOT SIMILAR TO '[A-Za-z]{2}%'
See the online demo
Here, NOT SIMILAR TO returns the inverse result of the SIMILAR TO operation. This SIMILAR TO operator accepts patterns that are almost regex patterns, but are also like regular wildcard patterns. NOT SIMILAR TO '[A-Za-z]{2}%' means all records that start with two ASCII letters ([A-Za-z]{2}) and having anything after (%) are NOT returned and all others will be returned. Note that SIMILAR TO requires a full string match, same as LIKE.
Your pattern: [^A-Z+|a-z+].* means "a string where at least some characters are not A-Z" - to extend that to the whole string you would need to use an anchored regex as shown by S-Man (the group defined with (..) isn't really necessary btw)
I would probably use a regex that specifies want the valid pattern is and then use !~ instead.
where company !~ '^[0-9].*$'
^[0-9].*$ means "only consists of numbers" and the !~ means "does not match"
or
where not (company ~ '^[0-9].*$')
Not start with a letter could be done with
WHERE company ~ '^[^A-Za-z].*'
demo: db<>fiddle
The first ^ marks the beginning. The [^A-Za-z] says "no letter" (including small and capital letters).
Edit: Changed [A-z] into the more precise [A-Za-z] (Why is this regex allowing a caret?)

REGEXP_REPLACE explanation

Hi may i know what does the below query means?
REGEXP_REPLACE(number,'[^'' ''-/0-9:-#A-Z''[''-`a-z{-~]', 'xy') ext_number
part 1
In terms of explaining what the function function call is doing:
It is a function call to analyse an input string 'number' with a regex (2nd argument) and replace any parts of the string which match a specific string. As for the name after the parenthesis I am not sure, but the documentation for the function is here
part 2
Sorry to be writing a question within an answer here but I cannot respond in comments yet (not enough rep)
Does this regex work? Unless sql uses different syntax this would appear to be a non-functional regex. There are some red flags, e.g:
The entire regex is wrapped in square parenthesis, indicating a set of characters but seems to predominantly hold an expression
There is a range indicator between a single quote and a character (invalid range: if a dash was required in the match it should be escaped with a '\' (backslash))
One set of square brackets is never closed
After some minor tweaks this regex is valid syntax:
^'' ''\-\/0-9:-#A-Z''[''-a-z{-~]`, but does not match anything I can think of, it is important to know what string is being examined/what the context is for the program in order to identify what the regex might be attempting to do
It seems like it is meant to replaces all ASCII control characters in the column or variable number with xy.
[] encloses a class of characters. Any character in that class matches. [^] negates that, hence all characters match, that are not in the class.
- is a range operator, e.g. a-z means all characters from a to z, like abc...xyz.
It seams like characters enclosed in ' should be escaped (The second ' is to escape the ' in the string itself.) At least this would make some sense. (But for none of the DBMS I found having a regexp_replace() function (Postgres, Oracle, DB2, MariaDB, MySQL), I found something in the docs, that would indicate this escape mechanism. They all use \, but maybe I missed something? Unfortunately you didn't tag which DBMS you're actually using!)
Now if you take an ASCII table you'll see, that the ranges in the expression make up all printable characters (counting space as printable) in groups from space to /, 0 to 9, : to #, etc.. Actually it might have been shorter to express it as '' ''-~, space to ~.
Given the negation, all these don't match. The ones left are from NUL to US and DEL. These match and get replaced by xy one by one.

How to negate regex with lookarounds

I have a regex-expression
(?<=#)'|'(?=%)
It successfully matches any apostrophe that is placed around %# in this objective-c string
#"UPDATE RESTAURANTS SET CITY='%#', NAME='%#' ", city, #"Joy's Restaurant";
But I want the opposite thing, to match any apostrophe that is NOT around %# i.e. to only match the apostrophe in Joy's Restaurant in this example.
Any ideas how to do that?
Negative lookarounds are pretty straight forward. Use (?!…) for a negative lookahead and (?<!…) for a negative lookbehind. For example:
(?<!#)'(?!%)
Will match any apostrophe so long as it is not immediately preceded by a # and it is not followed by a %. Notice that you have to remove the alternation (|) as you want to make sure that both lookarounds are satisfied.
Use a Negative Lookbehind and Negative Lookahead instead.
(?<!#)'(?!%)
Live Demo
Alternatively you can use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
'%#'|(')
Live Demo