REGEX Extract Amount Without Currency - sql

SELECT
ocr_text,
bucket,
REGEXP_EXTRACT('-?[0-9]+(\.[0-9]+)?', ocr_text)
FROM temp
I am trying to extract amounts from a string that will not have currency present. Any number that does not have decimals should not match. Commas should be allowed assuming they follow the correct rules (at hundreds marker)
56 no (missing decimals)
56.45 yes
120 no (missing decimals)
120.00 yes
1200.00 yes
1,200.00 yes
1,200 no (missing decimals)
1200 no (missing decimals)
134.5 no (decimal not followed by 2 digits)
23,00.00 no (invalid comma location)
I'm a noob to REGEX so I know my above statement already does not meet the criteria i've listed. However, i'm already stuck getting the error (INVALID_FUNCTION_ARGUMENT) premature end of char-class on my REGEX_EXTRACT line
Can someone point me in the right direction? How can I resolve my current issue? How can I modify to correctly incorporate the other criteria listed?

Here is a general regex pattern for a positive/negative number with two decimal places and optional thousands comma separators:
(?<!\S)(?:-?[0-9]{1,3}(,[0-9]{3})*(\.[0-9]{2})|-?[0-9]+(\.[0-9]{2}))(?!\S)
Demo
Your updated query:
SELECT
ocr_text,
bucket,
REGEXP_EXTRACT(ocr_text, '(?<!\S)(?:-?[0-9]{1,3}(,[0-9]{3})*(\.[0-9]{2})|-?[0-9]+(\.[0-9]{2}))(?!\S)')
FROM temp;
From the Presto docs I read, it supposedly supports Java's regex syntax. In the event that lookarounds are not working, you may try this version:
SELECT
ocr_text,
bucket,
REGEXP_EXTRACT(ocr_text, '(\s|^)(?:-?[0-9]{1,3}(,[0-9]{3})*(\.[0-9]{2})|-?[0-9]+(\.[0-9]{2}))(\s|$)')
FROM temp;

REGEXP_EXTRACT('^[-]?(\d*.\d*)', ocr_text)
Pattern: ^[-]?(\d*\.\d*)
Explanation:
^ - Start of line
[-]? - With or without negative dash (-)
\d* - 0 or more digits
\. - a decimal (escaped, because in regex decimals are considered special characters)
\d* - 0 or more digits (the decimal part);
$ - End of the line.
Bonus tip: There are helpful tools online to test your regex!

The Below code works to extract the value like all numbers but it catches all, only specific to certain alphabets its not working well. Anyone, please suggest well.
-?\d+\.?\d*
I have done work on NLP using Regex.

Related

regex decimal with negative

with the help of this thread I have tweak a regex for my use.
Decimal number regular expression, where digit after decimal is optional
So far I have this /^-?[1-9]$|^,\d+$|^0,\d$|^[1-9]\d*,\d*$
It works for
-12
0
1
It does not work for
-1,
-1,1
what I want is:
-1,1
may a kind soul explain what I am missing or doing wrong with this regex pls? Thank you very much!
I am tweaking it with : https://regexr.com/6sjkh but it's been 2 hours and I still don't know what I am missing. I don't know if language is important but I am in visual basic.
The thread you are quoting in your question has already some good answers. However if you do not need to also capture + and do not care about numbers without the leading zero before the comma and no thousand separator I would recommend the following:
/^-?\d+(,\d+)?/
The ^ indicates the beginning so that no other text in front is allowed
The -? is allowing a negative sign ( you need to escape this with \ as - can be a special character
\d+ expects 1 or more digits
The last part in parentheses covers the fractional part which as a whole can occure 0 or one time due to the question mark at the end
, is the decimal separator (if you want to have dots you need to put her ., as a dot is the special character for any character and also needs to be escaped if you look for a real dot)
\d+ is again one or more digits after the comma
I hope this helps a bit. Besides this I can also recommend using an online regular expression tool like https://regex101.com/

How can I use negative lookbehind to exclude fractions?

I have a list of measurements that need to be deconstructed into quantity (numeric) and unit (string). Things like
1 gal.
500lbs
none
2.25gal
4feet twine
2lbs regular and 2lbs lite
All was well and good using \d+(\.\d+)?, but now I have a fraction thrown into the mix:
3/4gal
I need to exclude the fraction from this search so that I can deal with it separately. I'm successfully excluding the numerator (3) by inserting a negative lookahead-- \d+(?!\/)(\.\d+)?, but I can't figure out how to exclude the denominator (4). I think I'm supposed to use a negative lookbehind but I can't figure out how. \d+(?!\/)(?<!\/)(\.\d+)? and \d+(?!\/)(\.\d+)?(?<!\/) still match the 4.
Thanks!
In a construct like this \d+(?!\/)(?<!\/)(\.\d+)? the lookbehind (?<!\/) is always true as the only thing you can match (not assert) before is a digit.
You might also exclude a / on the left of the digits part, and add the lookahead after matching the decimal part.
(?<!/)\d+(?:\.\d+)?(?!/)
Explanation
(?<!/) Negative lookbehind, assert directly to the left of the current postion is not /
\d+ Match 1+ digits
(?:\.\d+)? Match an optional . and 1+ digits
(?!/) Negative lookahead, assert directly to the right of the current position is not /
regex demo
You can match and skip all occurrences of [digits]/[digits] pattern:
\d+\/\d+(*SKIP)(*F)|\d+(?:\.\d+)?
See the regex demo.
The \d+\/\d+(*SKIP)(*F)| part matches one or more digits, /, one or more digits, and then (*SKIP)(*F) makes the regex engine fail the match and start searching for the next match from the failure position, so the 3/5-like substrings won't be able to mess with your output.

Regex like telephone number on Hive without prefix (+01)

We have a problem with a regular expression on hive.
We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.
We're trying with this one:
regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]')
but it doesn't work.
Edit: we want it to come out from the select as true (correct number) or false.
To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:
... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'
Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.
The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.
Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.
The pattern would be like this:
Your number can start with either + or 00: ^(\+|00)
It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:
a. It is followed first by a 3 then by anything but 7: 3[0-689]
b. It is followed first by anything but 3 then by any number: [0-24-9]\d
After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$
Putting everything together:
^(\+|00)(3[0-689]|[0-24-9]\d)\d+$
You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3
Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.
You can use
regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')
See the regex demo. Details:
^ - start of string
(?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
\+? - an optional + sign
\d+ - one or more digits
$ - end of string.

Regexp - recognize trailing 0 for decimal numbers?

I need a regular expression which can find a trail of "0" in the decimal space.
F.e. following format should be recognized:
1.0
1.00
1.000
etc...
Is there somekind of "wildcard" for that?
Any idea?
Thanks,
KS
Please see the following regular expression that matches trailing zeros on the end of any decimal. Note that this does match trailing zeros in the case where there is a meaningful decimal value, but zeros occur after.
https://regex101.com/r/BJvLrO/1/
When working with regex, it is very valuable to always use a tool like https://regex101.com or https://regexr.com. Both of these tools will help you truly understand the Regular Expression. Try hovering your mouse over the different elements of the regex in my example and the tool will describe each part. You can also read the "Explanation" section to on the right side.

Regex issue using ICU regex/regexkitlite

Starting a new question as my other question solved a different issue with the regex.
Here's my regex:
(?i)\\d{1,4}(?<!v(?:ol)?\\.?\\s?)(?![^\\(]*\\))
Regex split up for clarity:
(?i) - case insensitive
\\d{1,4} - a number with 1-4 digits
(?<!v(?:ol)?\\.?\\s?) the number cannot be preceded by 'v', 'v.', 'vol', 'vol.', with or without a space on the end.
(?![^\\(]*\\)) - Number cannot be inside parentheses.
It all works except for the 'vol.' bit.:
#"Words words 342 words (2342) (words 2 words) (words).ext" result 342 - correct.
#"Words - words words (2010) (words 2 words) (words).ext" result nil - correct.
#"words words v34 35.ext" result 34 - incorrect.
#"Words vol.342 343 (1234) (3 words) (desc).ext" result 342 - incorrect.
What am I doing wrong with my 'vol.' section?
You need to put the lookbehind before the number. Also, you need to add digits as illegal characters inside the lookbehind, or the 4 in v.34 will match. Try
(?i)(?<!v(?:ol)?\\.?\\s*\\d*)\\d{1,4}(?![^(]*\\))
This is expecting (edit: wrongly, as it turns out) that regexkitlite supports infinite repetition inside lookbehind which not many regex flavors do.
A look into the docs shows that it does support finite (but variable) repetition inside lookbehind, and if you are aware that the following will only work if there is at most one space between vol. and the number, then you could try
(?i)(?<!v(?:ol)?\\.?\\s?)(?<!\\d)\\d{1,4}(?![^(]*\\))