How can I use negative lookbehind to exclude fractions? - regex-lookarounds

I have a list of measurements that need to be deconstructed into quantity (numeric) and unit (string). Things like
1 gal.
500lbs
none
2.25gal
4feet twine
2lbs regular and 2lbs lite
All was well and good using \d+(\.\d+)?, but now I have a fraction thrown into the mix:
3/4gal
I need to exclude the fraction from this search so that I can deal with it separately. I'm successfully excluding the numerator (3) by inserting a negative lookahead-- \d+(?!\/)(\.\d+)?, but I can't figure out how to exclude the denominator (4). I think I'm supposed to use a negative lookbehind but I can't figure out how. \d+(?!\/)(?<!\/)(\.\d+)? and \d+(?!\/)(\.\d+)?(?<!\/) still match the 4.
Thanks!

In a construct like this \d+(?!\/)(?<!\/)(\.\d+)? the lookbehind (?<!\/) is always true as the only thing you can match (not assert) before is a digit.
You might also exclude a / on the left of the digits part, and add the lookahead after matching the decimal part.
(?<!/)\d+(?:\.\d+)?(?!/)
Explanation
(?<!/) Negative lookbehind, assert directly to the left of the current postion is not /
\d+ Match 1+ digits
(?:\.\d+)? Match an optional . and 1+ digits
(?!/) Negative lookahead, assert directly to the right of the current position is not /
regex demo

You can match and skip all occurrences of [digits]/[digits] pattern:
\d+\/\d+(*SKIP)(*F)|\d+(?:\.\d+)?
See the regex demo.
The \d+\/\d+(*SKIP)(*F)| part matches one or more digits, /, one or more digits, and then (*SKIP)(*F) makes the regex engine fail the match and start searching for the next match from the failure position, so the 3/5-like substrings won't be able to mess with your output.

Related

Regex like telephone number on Hive without prefix (+01)

We have a problem with a regular expression on hive.
We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.
We're trying with this one:
regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]')
but it doesn't work.
Edit: we want it to come out from the select as true (correct number) or false.
To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:
... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'
Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.
The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.
Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.
The pattern would be like this:
Your number can start with either + or 00: ^(\+|00)
It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:
a. It is followed first by a 3 then by anything but 7: 3[0-689]
b. It is followed first by anything but 3 then by any number: [0-24-9]\d
After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$
Putting everything together:
^(\+|00)(3[0-689]|[0-24-9]\d)\d+$
You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3
Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.
You can use
regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')
See the regex demo. Details:
^ - start of string
(?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
\+? - an optional + sign
\d+ - one or more digits
$ - end of string.

REGEX Extract Amount Without Currency

SELECT
ocr_text,
bucket,
REGEXP_EXTRACT('-?[0-9]+(\.[0-9]+)?', ocr_text)
FROM temp
I am trying to extract amounts from a string that will not have currency present. Any number that does not have decimals should not match. Commas should be allowed assuming they follow the correct rules (at hundreds marker)
56 no (missing decimals)
56.45 yes
120 no (missing decimals)
120.00 yes
1200.00 yes
1,200.00 yes
1,200 no (missing decimals)
1200 no (missing decimals)
134.5 no (decimal not followed by 2 digits)
23,00.00 no (invalid comma location)
I'm a noob to REGEX so I know my above statement already does not meet the criteria i've listed. However, i'm already stuck getting the error (INVALID_FUNCTION_ARGUMENT) premature end of char-class on my REGEX_EXTRACT line
Can someone point me in the right direction? How can I resolve my current issue? How can I modify to correctly incorporate the other criteria listed?
Here is a general regex pattern for a positive/negative number with two decimal places and optional thousands comma separators:
(?<!\S)(?:-?[0-9]{1,3}(,[0-9]{3})*(\.[0-9]{2})|-?[0-9]+(\.[0-9]{2}))(?!\S)
Demo
Your updated query:
SELECT
ocr_text,
bucket,
REGEXP_EXTRACT(ocr_text, '(?<!\S)(?:-?[0-9]{1,3}(,[0-9]{3})*(\.[0-9]{2})|-?[0-9]+(\.[0-9]{2}))(?!\S)')
FROM temp;
From the Presto docs I read, it supposedly supports Java's regex syntax. In the event that lookarounds are not working, you may try this version:
SELECT
ocr_text,
bucket,
REGEXP_EXTRACT(ocr_text, '(\s|^)(?:-?[0-9]{1,3}(,[0-9]{3})*(\.[0-9]{2})|-?[0-9]+(\.[0-9]{2}))(\s|$)')
FROM temp;
REGEXP_EXTRACT('^[-]?(\d*.\d*)', ocr_text)
Pattern: ^[-]?(\d*\.\d*)
Explanation:
^ - Start of line
[-]? - With or without negative dash (-)
\d* - 0 or more digits
\. - a decimal (escaped, because in regex decimals are considered special characters)
\d* - 0 or more digits (the decimal part);
$ - End of the line.
Bonus tip: There are helpful tools online to test your regex!
The Below code works to extract the value like all numbers but it catches all, only specific to certain alphabets its not working well. Anyone, please suggest well.
-?\d+\.?\d*
I have done work on NLP using Regex.

How can I negate this expression

I have absolutely no clue how to work with regex's. I am less than a beginner.
I want to find any invalid css names from a string, so I can exclude them. Looking online I found a way to select the valid names using this:
/-?[_a-zA-Z]+[_a-zA-Z0-9-]*/g
What I want to do is negate this expression, so that only '1999' is matched in this example input:
holding-page single 1999 contact id-12 contact single single
To "negate" an expression, turn it into a negative look ahead:
/(?<!\S)(?!-?[_a-zA-Z]+[_a-zA-Z0-9-]*)\S+(?!\S)/g
See live demo.
What this does is match a complete term, but one that does not match your positive regex.
A "complete term" is matched using (?<!\S)\S+(?!\S), which is \S+ (one or more non-whitespace) wrapped in negative look arounds for not a non-whitespace to prevent matching part of a term.
Note that "not a non-whitespace" is not the same as "whitespace", because "not a non-whitespace" also matches the start and end of the input, so leading and trailing terms that are invalid will match too.
Your positive regex has been turned into a negative look ahead by enclosing it in (?!...).

How to negate regex with lookarounds

I have a regex-expression
(?<=#)'|'(?=%)
It successfully matches any apostrophe that is placed around %# in this objective-c string
#"UPDATE RESTAURANTS SET CITY='%#', NAME='%#' ", city, #"Joy's Restaurant";
But I want the opposite thing, to match any apostrophe that is NOT around %# i.e. to only match the apostrophe in Joy's Restaurant in this example.
Any ideas how to do that?
Negative lookarounds are pretty straight forward. Use (?!…) for a negative lookahead and (?<!…) for a negative lookbehind. For example:
(?<!#)'(?!%)
Will match any apostrophe so long as it is not immediately preceded by a # and it is not followed by a %. Notice that you have to remove the alternation (|) as you want to make sure that both lookarounds are satisfied.
Use a Negative Lookbehind and Negative Lookahead instead.
(?<!#)'(?!%)
Live Demo
Alternatively you can use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
'%#'|(')
Live Demo

Regular expression to match specific variations of function

I am trying to construct a regular expression to find the text of the following variations.
NSLocalizedString(#"TEXT")
NSLocalizedStringFromTable(#"TEXT")
NSLocalizedStringWithDefaultValue(#"TEXT")
...
The goal is to extract TEXT. I have been able to construct a regex for each individual function or macro, e.g., (?<=NSLocalizedString)\(#"(.*?)". However, I am looking for a solution that does the job no matter what the name of the function as long as it starts with NSLocalizedString.
I assumed it was as simple as (?<=NSLocalizedString\w+)\(#"(.*?)", but that does't seem to do the trick.
How about this one?
/NSLocalizedString\w*\(#"(.*)"\)/
Explanation:
NSLocalizedString 'NSLocalizedString'
\w+ word characters (a-z, A-Z, 0-9, _) (0 or
more times (matching the most amount
possible))
\( '('
#" '#"'
( group and capture to \1:
.* any character except \n (0 or more times
(matching the most amount possible))
) end of \1
" '"'
\) ')'
The only reason your regex doesn't work is because the regex engine doesn't support variable length lookbehinds. The (?<=NSLocalizedString\w+) is variable length so can't be used.
Firstly it needs to be \w* not \w+, to allow your first example string to match.
If you move the \w* outside the lookbehind (?<=NSLocalizedString)\w* it will work just fine.
Alternatively, since you have to use a capturing group to grab the text value anyway, theres no need for the lookbehind at all. Change the (?<= to a (?: and it becomes a non-capturing group (which can be variable length), and then just grab your text value from group 1.
Your attempt was:
(?<=NSLocalizedString\w+)\(#"(.*?)"
Both of these minor changes should make it work:
(?<=NSLocalizedString)\w*\(#"(.*?)"
(?:NSLocalizedString\w*)\(#"(.*?)"
The following is actually not supported in Objective-C:
The solution that will extract exactly TEXT without using any groups is:
NSLocalizedString\w*\(#"\K[^"]*
It avoids the need to use a negative lookbehind (which can't be used for reasons I explain below) by using the \K modifier, which chops off anything before it from the match.