NSRegularExpression - Probleme with a Pattern - objective-c

i've wrote a little program to find a string in a string which works fine so far. But i have a problem with NSRegularExpression - i need the right Pattern for my special case and stuck.
NSString *strRegExp = [NSString stringWithFormat:#"?trunk/%#/%#/+\\([a-zA-Z0-9_\\-\\.])+/Host-1", inputstrse , inputstrsno];
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:strRegExp options: NSRegularExpressionCaseInsensitive error:NULL];
NSArray *arrayOfAllMatches = [regex matchesInString:inputurl options:0 range:NSMakeRange(0, [inputurl length])];
The NSRegularExpression pattern should match string the look like this:
trunk/%#/%#/some-text-1/Host-1
trunk/test/1/5-text-text/Host-1
Where trunk/%#/%#/ and /Host-1 stays always the same. Only the part in the middle is variable and always looks like this:
NUMBER-Some-Text -> 5-Hello-World -> /trunk/test/1/5-hello-world/Host-1
I've tried it with different RegExp as you see here: "?trunk/%#/%#/+\([a-zA-Z0-9_\-\.])+/Host-1", but i still seems not to work, maybe someone can help me.
Maybe there is a Probleme when i build the pattern with:
NSString *strRegExp = [NSString stringWithFormat:#"?trunk/%#/%#/+\\([a-zA-Z0-9_\\-\\.])+/Host-1", inputstrse , inputstrsno];
And use it later like that:
regularExpressionWithPattern:strRegExp
I hope someone can help me - i'm new to RegularExpressions.

Generally, expressing a Regex as "I want to match a number of letters, then a dash, then a number" and so on is the easiest way to construct one. Also, using a tool such as http://www.regexr.com simplifies a lot.
From what I understand you want to match the following:
trunk/test/1/[some number]-[some text]-[some other text]/Host-1
If so, then the following regular expression should cut it:
trunk\/test\/1\/[0-9]*-[a-zA-Z]*-[a-zA-Z]*\/Host-1
It does the following:
trunk\/test\/1\/: Match the constant string trunk/test/1/ (The backslashes are escapes)
[0-9]*-: Match any number of digits followed by a -
[a-zA-Z]*-: Match any number of letters followed by a -
[a-zA-Z]*: Match any number of letters
\/Host-1: Match the constant string /Host-1/
Here is a link to RegExr which you can use if you want to experiment with different input data or changes to the regex: http://regexr.com/39tgn
The following string was provided in the comments: trunk\test\/1\/.*\/Host-1. It's a bit less strict but does the job as well.

I don't know Objective-C but your regex has a bunch of oddities, if I remove those I get something that I think you'd want to achieve.
Your first character is a ?, that can't be, it's a quantifier in regex that says something about the preceding character (or class or group). If it's the first character, there is no preceding char.
/+\\ <-- unsure what you were trying to do here, but it means '1 or more / followed by \'
[a-zA-Z0-9_\\-\\.] can be done much shorter like: [\w.-] and if you place the + within the parentheses it will capture the entire unknown string in capture group 1.
From comments: So %# is a variable text, the first is always just letters, the 2nd is always just numbers. That would be [a-zA-Z]+ and \d+ respectively in a regex. But actually I would use [^/]+ (any character that isn't /) so that the code doesn't break when someone puts a different character in this path like trunk/this_text/4/.../Host-1 which would break on the _.
Combined this makes (changed after comments):
trunk/[^/]+/[^/]+/([\w.-]+)/Host-1
Debuggex Demo
Now note that this is without escaping to get the proper string into the regex engine, but if Objective-C is anything like C# then a string started with #"..." doesn't need escaping.

Related

Regular expression to extract a number of steps

I have a localized string that looks something like this in English:
"
5 Mile(s)
5,252 Step(s)
"
My app is localized both in left-to-right and right-to-left languages so I don't want to make assumptions either about the ordering of the step(s) or about the formatting of the number (e.g. 5,252 can be 5.252 depending on user locale). So I need to account for possibilities that can include things like
Step(s) 5.252
as well as what's above.
A few other caveats
All I know is that if the Step(s) line is in there, it will be on its own line (hence in my regex I require \n at each end of the string)
No guarantee that the Mile(s) information will be in the string at all, let alone whether it will be before or after Step(s)
Here's my attempt at pattern extraction:
NSString *patternString = [NSString stringWithFormat:#"\\n(([0-9,\\.]*)\s*%#|%#\s*([0-9,\\.]*))\\n",
NSLocalizedString(#"Step(s)",nil), NSLocalizedString(#"Step(s)",nil)];
There appear to be two problems with this:
XCode is indicating Unknown escape sequence '\s' for the second \s in the pattern string above
No matches are being found even for strings like the following:
0.2 Mile(s)
1,482 Step(s)
Ideally I would extract the 1,482 out of this string in a way that is localization friendly. How should I modify my regex?
as far as the regex, perhaps this approach might work - it simply matches (with named groups) each couplet of numbers in sequence, with the assumption the first is miles and the second is steps. Decimals in the . or , form are optional:
(?<miles>\d+(?:[.,]\d+)?).*?(?<steps>\d+(?:[.,]\d+)?)
(and i think it should be \\s) - i'm not an ios guy, but if you can use a regex literal it would be way more readable.
regular expression demo
First I'd like to ask - Why is Mile(s) mentioned in the question at all?
And now to my two bits - you could simply use a positive look-ahead:
^(?=.*Step\(s\))[^\d]*(\d+(?:[.,]\d+)?)
It makes sure the expected word is present on the line, and then captures the number on it, allowing for localized, optional, decimal separator and decimals. This way it doesn't matter if the numer is before, or after, the "word".
It doesn't take localization of the "word" into account, but that you seem to have handled by yourself ;)
See it here at regex101.
Your regex is close, although in Obj-C you need to double-escape the \s and (s):
^(([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$
In your NSLocalizedString you likely also need to escape the parentheses enclosing (s):
NSString *patternString = [NSString stringWithFormat:#"^(([\\d,.]+)\\s%#|%#\\s([\\d,.]+))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
If you don't escape (s) then the regex engine is probably going to interpret it as a capture group.
Looking at NSLog you can see what the pattern actually reads like:
NSLog(#"patternString: %#", patternString);
Output:
patternString: ^(([\d,.]+)\sStep\(s\)|Step\(s\)\s([\d,.]+))$
Since you mentioned the Mile(s) part may not be in the string at all I'm assuming it isn't relevant to the regular expression. As I understand from the question, you just need to capture the number of steps and nothing else. On this basis, here's a modified version of your existing regex:
NSString *patternString =
[NSString stringWithFormat:#"^(?:([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
Demo:
https://www.regex101.com/r/Q6ff1b/1
This is based on the following tips/modifications:
Use the m (= UREGEX_MULTILINE) flag option when creating the regex to specify that ^ and $ match the start and end of each line. This is more sophisticated than using \n as it will also handle the start and end of the string where this might not be present. See here.
Always use a double backslash (\\) for regex escaping - otherwise NSString will interpret the single backslash to be escaping the next character and convert it before it gets to the regex.
Literal parentheses need to be escaped - e.g. Step\\(s\\) instead of Step(s).
Characters within a character class (i.e. anything within the [] square brackets) don't need to be escaped - so it would be . rather than \\. - the latter.
If you are using (x|y|...) as a choice and don't need it to be a capturing group, use ?: after the first parenthesis to ensure it doesn't get captured - i.e. (?:x|y|...).

Regex (searching for function(#"string content") to get "string content"

I have a little regex problem (don't we all sometimes).
The few pieces of code are from Objective C but regex expressions are still the same I believe.
I have two functions called
NSString * CRLocalizedString(NSString *key)
NSString * CRLocalizedArgString(NSString *key, ...)
These are scattered around my project for localisation.
Now I want to find them all.
Well go to directory, parse all files, etc
All fine there.
The regexes I use on the files are
[NSRegularExpression regularExpressionWithPattern:#"CRLocalizedString\\(#\\\"[^)]+\\\"\\)" options:0 error:&error];
[NSRegularExpression regularExpressionWithPattern:#"CRLocalizedArgString\\([^)]+\\)" options:0 error:&error];
And this works perfect except that my terminates character is an ).
The problem occurs with function calls like this
CRLocalizedString(#"Happy =), o so happy =D");
CRLocalizedArgString(#"Filter (%i)", 0.75f);
The regex ends the string at "Filter (%i" and at "Happy =)".
And this is where my regex knowledge ends and I do not now what to do anymore.
I thought using ");" as an end but this isn't always the case.
So I was hoping someone here knew something for me (complete different things then regex are also allowed of course)
Kind regards
Saren
Let's write your first regex without the extra level of C escapes:
CRLocalizedString\(#\"[^)]+\"\)
You don't have to escape a " for a regex, so let's get rid of those extra backslashes:
CRLocalizedString\(#"[^)]+"\)
So, you want to match a quoted string using "[^)]+". But that doesn't match every quoted string.
What is a quoted string? It's a ", followed by any number of string atoms, followed by another ". What is a string atom? It's any character except " or \, or a \ followed by any character. So here's a regex for a quoted string:
"([^"\\]|\\.)*"
Sticking that back into your first regex, we get this:
CRLocalizedString\(#"([^"\\]|\\.)*"\)
Here's a link to a regex tester demonstrating that regex.
Quoting it in an Objective-C string literal gives us this:
#"CRLocalizedString\\(#\"([^\"\\\\]|\\\\.)*\"\\)"
It is impossible to write a regex to match calls to CRLocalizedArgString in the general case, because such calls can take arbitrary expressions as arguments, and regexes cannot match arbitrary expressions (because they can contain arbitrary levels of nested parentheses, which regexes cannot match).
You could just hope that there are no parentheses in the argument list, and use this regex:
CRLocalizedArgString\(#"([^"\\]|\\.)*"[^)]*\)
Here's a link to a regex tester demonstrating that regex.
Quoting it in an Objective-C string literal gives us this:
#"CRLocalizedArgString\\(#\"([^\"\\\\]|\\\\.)*\"[^)]*\\)"

Regex in Objective-C and regexpal

I created a regex expression and tested with a string in Rexpal.
Then, i tested it in Objective-C with the same string and i get no result.
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"^Page(\\d+| \\d+)(:| :|)$" options:NSRegularExpressionCaseInsensitive error:&regError];
As you can see, i did add another '\' char before the 'd', but i get no result at all.
If i change the regex expression from "^Page(\d+| \d+)(:| :|)$" to "Page(\d+| \d+)(:| :|)", i get way too many results. It's like my 'AND' statements were understoud as 'OR' statements. Anybody got an idea of what is happening?
EDIT :
For the regex expression "^page(\d+| \d+)(:| :|)$" for the string "page 15 :", will return me with 3 solutions "page 15 :", "15", and ":". I only want the first one. Like i said, it's like my AND is transformed in a OR/AND. I would like the number and the semi-colon (or not like my regex says) to always be attached to 'page'
Turn on multi-line option. then anchors ^$ will mean begining and end of line.
Instead, by default, ^$ mean begin/end of entire string.
In RxPal you can see Match at line-breaks (m) option is checked.
edit
If you are getting too much sub-expression data, then you should replace the
context into cluster groups. (..) -> (?: ..).
This is an 'extended' context.
If you can't do that, then just go with the data in group 0, which is the entire match, and ignore the rest. Not sure how to do this.
As pointed out by sln (he solved that issue), you don't match anything in C because you have to turn multiline option on (with m), and you do match in regexpal because it is on.
Regarding your regex, it could be improved with ^(Page\s*\d+\s*:?\s*)$. The question mark means that the preceding character doesn't have to be here, the \s matches any whitespace-type character (whitespace, tab, etc).
Regarding your selecting issue, parenthesis in regex are what catches variables. So if you do (Page( \d+|\d+)) you'll have two different variables. What you wanted was (Page(?: \d+|\d+)), since (?: ) counts as parenthesis not assigning any value. But | aren't usually used when a simple ? does the trick.

OS X Using literal asterisk in regular expression

I'm writing a program to make text that begins with /* and ends with */ a different color (syntax highlighting for a C comment). When I try this
#"/\*.*\*/";
I get unknown escape sequence. So I figured that to get a literal asterisk I had to use this
#"/[*].*[*]/";
and I get no errors, but when I use this code
commentPattern = #"/[*].*[*]/";
reg = [NSRegularExpression regularExpressionWithPattern:commentPattern options:kNilOptions error:nil];
results = [reg matchesInString:self.string options:kNilOptions range:NSMakeRange(0, [self.string length])];
for (NSTextCheckingResult *result in results)
{
[self setTextColor:[NSColor colorWithCalibratedRed:0.0 green:0.7 blue:0.0 alpha:1.0] range:result.range];
}
the text color of the comments doesn't change, but I don't see anything wrong with my regular expression. Can someone tell me why this wont work? I don't think it's a problem with the way I get the results or change their color, because I use the same method for other regular expressions.
You want to use this: "\\*".
\* is the escape sequence for * in regular expressions, but in C strings, \ also begins an escaped character token, so you have to escape that as well.
#"/\*.*\*/";
I get unknown escape sequence.
A string first converts escape sequences in the string, then the result is handed over to the regex engine. For instance, an escape sequence might be \t, which represents a tab, or \n which represents a newline. The string first converts an escape sequence to a special code. Your error is saying that \* is not a legal escape sequence for an NSString.
The regex engine needs to see a literal back slash followed by a *. To get a literal back slash in a string you need to write \\. However, for readability I prefer using a character class like you did with your second attempt.
You should NSLog what the results array contains to see what matches you are getting. If the matches are what you expect, then the problem is not with the regex.

Objective C Regex for tokenizing a sentence

I have some text in the following format:
{{st1:[[word1]]-[[word2]]s [[word1]] [[word3]]}} {{st2:[[word2]] [[word3]] [[word1]]-[[word4]]s.}}
I want to filter out sentences (signature {{st[0-9]: }}) which contain the given word (signature [[word]] ). Hence, if I am searching for [[word1]] , the output should be
{{st1:[[word1]]-[[word2]]s [[word1]] [[word3]]}}
{{st2:[[word2]] [[word3]] [[word1]]-[[word4]]s.}}
while if I am searching for [[word4]] , the output should be
{{st2:[[word2]] [[word3]] [[word1]]-[[word4]]s.}}
I have written the following code so far, but cant achieve the above. Please help me to correct it.
NSString* aString = #"{{st1:[[word1]]-[[word2]]s [[word1]] [[word3]]}} {{st2:[[word2]] [[word3]] [[word1]]-[[word4]]s.}}";
NSString *regexString = #"\\{\\{st[1-9]:.*(word).*\\}\\}";
for(NSString *match in [aString componentsMatchedByRegex:regexString])
NSLog(#"%#", match);
I am using RegexKitLite , but am open to any other suggestions.
The issue with your current regex is that the .* will match the }} at the end of a sentence, you need to make sure that the inner part of your regex can never proceed past a }}, here is one way to do this:
\\{\\{st[1-9]:[^}]*(word)[^}]*\\}\\}
If a single closing curly brace is a valid character for part of a word it gets a little more complicated, but you should be able to replace the [^}]* with (\}?[^}])* or (?:(?!\}\}).)*.