How to negate regex with lookarounds - objective-c

I have a regex-expression
(?<=#)'|'(?=%)
It successfully matches any apostrophe that is placed around %# in this objective-c string
#"UPDATE RESTAURANTS SET CITY='%#', NAME='%#' ", city, #"Joy's Restaurant";
But I want the opposite thing, to match any apostrophe that is NOT around %# i.e. to only match the apostrophe in Joy's Restaurant in this example.
Any ideas how to do that?

Negative lookarounds are pretty straight forward. Use (?!…) for a negative lookahead and (?<!…) for a negative lookbehind. For example:
(?<!#)'(?!%)
Will match any apostrophe so long as it is not immediately preceded by a # and it is not followed by a %. Notice that you have to remove the alternation (|) as you want to make sure that both lookarounds are satisfied.

Use a Negative Lookbehind and Negative Lookahead instead.
(?<!#)'(?!%)
Live Demo
Alternatively you can use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
'%#'|(')
Live Demo

Related

How can I use negative lookbehind to exclude fractions?

I have a list of measurements that need to be deconstructed into quantity (numeric) and unit (string). Things like
1 gal.
500lbs
none
2.25gal
4feet twine
2lbs regular and 2lbs lite
All was well and good using \d+(\.\d+)?, but now I have a fraction thrown into the mix:
3/4gal
I need to exclude the fraction from this search so that I can deal with it separately. I'm successfully excluding the numerator (3) by inserting a negative lookahead-- \d+(?!\/)(\.\d+)?, but I can't figure out how to exclude the denominator (4). I think I'm supposed to use a negative lookbehind but I can't figure out how. \d+(?!\/)(?<!\/)(\.\d+)? and \d+(?!\/)(\.\d+)?(?<!\/) still match the 4.
Thanks!
In a construct like this \d+(?!\/)(?<!\/)(\.\d+)? the lookbehind (?<!\/) is always true as the only thing you can match (not assert) before is a digit.
You might also exclude a / on the left of the digits part, and add the lookahead after matching the decimal part.
(?<!/)\d+(?:\.\d+)?(?!/)
Explanation
(?<!/) Negative lookbehind, assert directly to the left of the current postion is not /
\d+ Match 1+ digits
(?:\.\d+)? Match an optional . and 1+ digits
(?!/) Negative lookahead, assert directly to the right of the current position is not /
regex demo
You can match and skip all occurrences of [digits]/[digits] pattern:
\d+\/\d+(*SKIP)(*F)|\d+(?:\.\d+)?
See the regex demo.
The \d+\/\d+(*SKIP)(*F)| part matches one or more digits, /, one or more digits, and then (*SKIP)(*F) makes the regex engine fail the match and start searching for the next match from the failure position, so the 3/5-like substrings won't be able to mess with your output.

Objective C - RegEx - Invalid Range when trying to match spaces [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

How can I negate this expression

I have absolutely no clue how to work with regex's. I am less than a beginner.
I want to find any invalid css names from a string, so I can exclude them. Looking online I found a way to select the valid names using this:
/-?[_a-zA-Z]+[_a-zA-Z0-9-]*/g
What I want to do is negate this expression, so that only '1999' is matched in this example input:
holding-page single 1999 contact id-12 contact single single
To "negate" an expression, turn it into a negative look ahead:
/(?<!\S)(?!-?[_a-zA-Z]+[_a-zA-Z0-9-]*)\S+(?!\S)/g
See live demo.
What this does is match a complete term, but one that does not match your positive regex.
A "complete term" is matched using (?<!\S)\S+(?!\S), which is \S+ (one or more non-whitespace) wrapped in negative look arounds for not a non-whitespace to prevent matching part of a term.
Note that "not a non-whitespace" is not the same as "whitespace", because "not a non-whitespace" also matches the start and end of the input, so leading and trailing terms that are invalid will match too.
Your positive regex has been turned into a negative look ahead by enclosing it in (?!...).

Regular expression to extract a number of steps

I have a localized string that looks something like this in English:
"
5 Mile(s)
5,252 Step(s)
"
My app is localized both in left-to-right and right-to-left languages so I don't want to make assumptions either about the ordering of the step(s) or about the formatting of the number (e.g. 5,252 can be 5.252 depending on user locale). So I need to account for possibilities that can include things like
Step(s) 5.252
as well as what's above.
A few other caveats
All I know is that if the Step(s) line is in there, it will be on its own line (hence in my regex I require \n at each end of the string)
No guarantee that the Mile(s) information will be in the string at all, let alone whether it will be before or after Step(s)
Here's my attempt at pattern extraction:
NSString *patternString = [NSString stringWithFormat:#"\\n(([0-9,\\.]*)\s*%#|%#\s*([0-9,\\.]*))\\n",
NSLocalizedString(#"Step(s)",nil), NSLocalizedString(#"Step(s)",nil)];
There appear to be two problems with this:
XCode is indicating Unknown escape sequence '\s' for the second \s in the pattern string above
No matches are being found even for strings like the following:
0.2 Mile(s)
1,482 Step(s)
Ideally I would extract the 1,482 out of this string in a way that is localization friendly. How should I modify my regex?
as far as the regex, perhaps this approach might work - it simply matches (with named groups) each couplet of numbers in sequence, with the assumption the first is miles and the second is steps. Decimals in the . or , form are optional:
(?<miles>\d+(?:[.,]\d+)?).*?(?<steps>\d+(?:[.,]\d+)?)
(and i think it should be \\s) - i'm not an ios guy, but if you can use a regex literal it would be way more readable.
regular expression demo
First I'd like to ask - Why is Mile(s) mentioned in the question at all?
And now to my two bits - you could simply use a positive look-ahead:
^(?=.*Step\(s\))[^\d]*(\d+(?:[.,]\d+)?)
It makes sure the expected word is present on the line, and then captures the number on it, allowing for localized, optional, decimal separator and decimals. This way it doesn't matter if the numer is before, or after, the "word".
It doesn't take localization of the "word" into account, but that you seem to have handled by yourself ;)
See it here at regex101.
Your regex is close, although in Obj-C you need to double-escape the \s and (s):
^(([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$
In your NSLocalizedString you likely also need to escape the parentheses enclosing (s):
NSString *patternString = [NSString stringWithFormat:#"^(([\\d,.]+)\\s%#|%#\\s([\\d,.]+))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
If you don't escape (s) then the regex engine is probably going to interpret it as a capture group.
Looking at NSLog you can see what the pattern actually reads like:
NSLog(#"patternString: %#", patternString);
Output:
patternString: ^(([\d,.]+)\sStep\(s\)|Step\(s\)\s([\d,.]+))$
Since you mentioned the Mile(s) part may not be in the string at all I'm assuming it isn't relevant to the regular expression. As I understand from the question, you just need to capture the number of steps and nothing else. On this basis, here's a modified version of your existing regex:
NSString *patternString =
[NSString stringWithFormat:#"^(?:([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
Demo:
https://www.regex101.com/r/Q6ff1b/1
This is based on the following tips/modifications:
Use the m (= UREGEX_MULTILINE) flag option when creating the regex to specify that ^ and $ match the start and end of each line. This is more sophisticated than using \n as it will also handle the start and end of the string where this might not be present. See here.
Always use a double backslash (\\) for regex escaping - otherwise NSString will interpret the single backslash to be escaping the next character and convert it before it gets to the regex.
Literal parentheses need to be escaped - e.g. Step\\(s\\) instead of Step(s).
Characters within a character class (i.e. anything within the [] square brackets) don't need to be escaped - so it would be . rather than \\. - the latter.
If you are using (x|y|...) as a choice and don't need it to be a capturing group, use ?: after the first parenthesis to ensure it doesn't get captured - i.e. (?:x|y|...).

Regex in Objective-C and regexpal

I created a regex expression and tested with a string in Rexpal.
Then, i tested it in Objective-C with the same string and i get no result.
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"^Page(\\d+| \\d+)(:| :|)$" options:NSRegularExpressionCaseInsensitive error:&regError];
As you can see, i did add another '\' char before the 'd', but i get no result at all.
If i change the regex expression from "^Page(\d+| \d+)(:| :|)$" to "Page(\d+| \d+)(:| :|)", i get way too many results. It's like my 'AND' statements were understoud as 'OR' statements. Anybody got an idea of what is happening?
EDIT :
For the regex expression "^page(\d+| \d+)(:| :|)$" for the string "page 15 :", will return me with 3 solutions "page 15 :", "15", and ":". I only want the first one. Like i said, it's like my AND is transformed in a OR/AND. I would like the number and the semi-colon (or not like my regex says) to always be attached to 'page'
Turn on multi-line option. then anchors ^$ will mean begining and end of line.
Instead, by default, ^$ mean begin/end of entire string.
In RxPal you can see Match at line-breaks (m) option is checked.
edit
If you are getting too much sub-expression data, then you should replace the
context into cluster groups. (..) -> (?: ..).
This is an 'extended' context.
If you can't do that, then just go with the data in group 0, which is the entire match, and ignore the rest. Not sure how to do this.
As pointed out by sln (he solved that issue), you don't match anything in C because you have to turn multiline option on (with m), and you do match in regexpal because it is on.
Regarding your regex, it could be improved with ^(Page\s*\d+\s*:?\s*)$. The question mark means that the preceding character doesn't have to be here, the \s matches any whitespace-type character (whitespace, tab, etc).
Regarding your selecting issue, parenthesis in regex are what catches variables. So if you do (Page( \d+|\d+)) you'll have two different variables. What you wanted was (Page(?: \d+|\d+)), since (?: ) counts as parenthesis not assigning any value. But | aren't usually used when a simple ? does the trick.