Objective C Regex for tokenizing a sentence - objective-c

I have some text in the following format:
{{st1:[[word1]]-[[word2]]s [[word1]] [[word3]]}} {{st2:[[word2]] [[word3]] [[word1]]-[[word4]]s.}}
I want to filter out sentences (signature {{st[0-9]: }}) which contain the given word (signature [[word]] ). Hence, if I am searching for [[word1]] , the output should be
{{st1:[[word1]]-[[word2]]s [[word1]] [[word3]]}}
{{st2:[[word2]] [[word3]] [[word1]]-[[word4]]s.}}
while if I am searching for [[word4]] , the output should be
{{st2:[[word2]] [[word3]] [[word1]]-[[word4]]s.}}
I have written the following code so far, but cant achieve the above. Please help me to correct it.
NSString* aString = #"{{st1:[[word1]]-[[word2]]s [[word1]] [[word3]]}} {{st2:[[word2]] [[word3]] [[word1]]-[[word4]]s.}}";
NSString *regexString = #"\\{\\{st[1-9]:.*(word).*\\}\\}";
for(NSString *match in [aString componentsMatchedByRegex:regexString])
NSLog(#"%#", match);
I am using RegexKitLite , but am open to any other suggestions.

The issue with your current regex is that the .* will match the }} at the end of a sentence, you need to make sure that the inner part of your regex can never proceed past a }}, here is one way to do this:
\\{\\{st[1-9]:[^}]*(word)[^}]*\\}\\}
If a single closing curly brace is a valid character for part of a word it gets a little more complicated, but you should be able to replace the [^}]* with (\}?[^}])* or (?:(?!\}\}).)*.

Related

NSRegularExpression - Probleme with a Pattern

i've wrote a little program to find a string in a string which works fine so far. But i have a problem with NSRegularExpression - i need the right Pattern for my special case and stuck.
NSString *strRegExp = [NSString stringWithFormat:#"?trunk/%#/%#/+\\([a-zA-Z0-9_\\-\\.])+/Host-1", inputstrse , inputstrsno];
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:strRegExp options: NSRegularExpressionCaseInsensitive error:NULL];
NSArray *arrayOfAllMatches = [regex matchesInString:inputurl options:0 range:NSMakeRange(0, [inputurl length])];
The NSRegularExpression pattern should match string the look like this:
trunk/%#/%#/some-text-1/Host-1
trunk/test/1/5-text-text/Host-1
Where trunk/%#/%#/ and /Host-1 stays always the same. Only the part in the middle is variable and always looks like this:
NUMBER-Some-Text -> 5-Hello-World -> /trunk/test/1/5-hello-world/Host-1
I've tried it with different RegExp as you see here: "?trunk/%#/%#/+\([a-zA-Z0-9_\-\.])+/Host-1", but i still seems not to work, maybe someone can help me.
Maybe there is a Probleme when i build the pattern with:
NSString *strRegExp = [NSString stringWithFormat:#"?trunk/%#/%#/+\\([a-zA-Z0-9_\\-\\.])+/Host-1", inputstrse , inputstrsno];
And use it later like that:
regularExpressionWithPattern:strRegExp
I hope someone can help me - i'm new to RegularExpressions.
Generally, expressing a Regex as "I want to match a number of letters, then a dash, then a number" and so on is the easiest way to construct one. Also, using a tool such as http://www.regexr.com simplifies a lot.
From what I understand you want to match the following:
trunk/test/1/[some number]-[some text]-[some other text]/Host-1
If so, then the following regular expression should cut it:
trunk\/test\/1\/[0-9]*-[a-zA-Z]*-[a-zA-Z]*\/Host-1
It does the following:
trunk\/test\/1\/: Match the constant string trunk/test/1/ (The backslashes are escapes)
[0-9]*-: Match any number of digits followed by a -
[a-zA-Z]*-: Match any number of letters followed by a -
[a-zA-Z]*: Match any number of letters
\/Host-1: Match the constant string /Host-1/
Here is a link to RegExr which you can use if you want to experiment with different input data or changes to the regex: http://regexr.com/39tgn
The following string was provided in the comments: trunk\test\/1\/.*\/Host-1. It's a bit less strict but does the job as well.
I don't know Objective-C but your regex has a bunch of oddities, if I remove those I get something that I think you'd want to achieve.
Your first character is a ?, that can't be, it's a quantifier in regex that says something about the preceding character (or class or group). If it's the first character, there is no preceding char.
/+\\ <-- unsure what you were trying to do here, but it means '1 or more / followed by \'
[a-zA-Z0-9_\\-\\.] can be done much shorter like: [\w.-] and if you place the + within the parentheses it will capture the entire unknown string in capture group 1.
From comments: So %# is a variable text, the first is always just letters, the 2nd is always just numbers. That would be [a-zA-Z]+ and \d+ respectively in a regex. But actually I would use [^/]+ (any character that isn't /) so that the code doesn't break when someone puts a different character in this path like trunk/this_text/4/.../Host-1 which would break on the _.
Combined this makes (changed after comments):
trunk/[^/]+/[^/]+/([\w.-]+)/Host-1
Debuggex Demo
Now note that this is without escaping to get the proper string into the regex engine, but if Objective-C is anything like C# then a string started with #"..." doesn't need escaping.

Regex (searching for function(#"string content") to get "string content"

I have a little regex problem (don't we all sometimes).
The few pieces of code are from Objective C but regex expressions are still the same I believe.
I have two functions called
NSString * CRLocalizedString(NSString *key)
NSString * CRLocalizedArgString(NSString *key, ...)
These are scattered around my project for localisation.
Now I want to find them all.
Well go to directory, parse all files, etc
All fine there.
The regexes I use on the files are
[NSRegularExpression regularExpressionWithPattern:#"CRLocalizedString\\(#\\\"[^)]+\\\"\\)" options:0 error:&error];
[NSRegularExpression regularExpressionWithPattern:#"CRLocalizedArgString\\([^)]+\\)" options:0 error:&error];
And this works perfect except that my terminates character is an ).
The problem occurs with function calls like this
CRLocalizedString(#"Happy =), o so happy =D");
CRLocalizedArgString(#"Filter (%i)", 0.75f);
The regex ends the string at "Filter (%i" and at "Happy =)".
And this is where my regex knowledge ends and I do not now what to do anymore.
I thought using ");" as an end but this isn't always the case.
So I was hoping someone here knew something for me (complete different things then regex are also allowed of course)
Kind regards
Saren
Let's write your first regex without the extra level of C escapes:
CRLocalizedString\(#\"[^)]+\"\)
You don't have to escape a " for a regex, so let's get rid of those extra backslashes:
CRLocalizedString\(#"[^)]+"\)
So, you want to match a quoted string using "[^)]+". But that doesn't match every quoted string.
What is a quoted string? It's a ", followed by any number of string atoms, followed by another ". What is a string atom? It's any character except " or \, or a \ followed by any character. So here's a regex for a quoted string:
"([^"\\]|\\.)*"
Sticking that back into your first regex, we get this:
CRLocalizedString\(#"([^"\\]|\\.)*"\)
Here's a link to a regex tester demonstrating that regex.
Quoting it in an Objective-C string literal gives us this:
#"CRLocalizedString\\(#\"([^\"\\\\]|\\\\.)*\"\\)"
It is impossible to write a regex to match calls to CRLocalizedArgString in the general case, because such calls can take arbitrary expressions as arguments, and regexes cannot match arbitrary expressions (because they can contain arbitrary levels of nested parentheses, which regexes cannot match).
You could just hope that there are no parentheses in the argument list, and use this regex:
CRLocalizedArgString\(#"([^"\\]|\\.)*"[^)]*\)
Here's a link to a regex tester demonstrating that regex.
Quoting it in an Objective-C string literal gives us this:
#"CRLocalizedArgString\\(#\"([^\"\\\\]|\\\\.)*\"[^)]*\\)"

Regex in Objective-C and regexpal

I created a regex expression and tested with a string in Rexpal.
Then, i tested it in Objective-C with the same string and i get no result.
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"^Page(\\d+| \\d+)(:| :|)$" options:NSRegularExpressionCaseInsensitive error:&regError];
As you can see, i did add another '\' char before the 'd', but i get no result at all.
If i change the regex expression from "^Page(\d+| \d+)(:| :|)$" to "Page(\d+| \d+)(:| :|)", i get way too many results. It's like my 'AND' statements were understoud as 'OR' statements. Anybody got an idea of what is happening?
EDIT :
For the regex expression "^page(\d+| \d+)(:| :|)$" for the string "page 15 :", will return me with 3 solutions "page 15 :", "15", and ":". I only want the first one. Like i said, it's like my AND is transformed in a OR/AND. I would like the number and the semi-colon (or not like my regex says) to always be attached to 'page'
Turn on multi-line option. then anchors ^$ will mean begining and end of line.
Instead, by default, ^$ mean begin/end of entire string.
In RxPal you can see Match at line-breaks (m) option is checked.
edit
If you are getting too much sub-expression data, then you should replace the
context into cluster groups. (..) -> (?: ..).
This is an 'extended' context.
If you can't do that, then just go with the data in group 0, which is the entire match, and ignore the rest. Not sure how to do this.
As pointed out by sln (he solved that issue), you don't match anything in C because you have to turn multiline option on (with m), and you do match in regexpal because it is on.
Regarding your regex, it could be improved with ^(Page\s*\d+\s*:?\s*)$. The question mark means that the preceding character doesn't have to be here, the \s matches any whitespace-type character (whitespace, tab, etc).
Regarding your selecting issue, parenthesis in regex are what catches variables. So if you do (Page( \d+|\d+)) you'll have two different variables. What you wanted was (Page(?: \d+|\d+)), since (?: ) counts as parenthesis not assigning any value. But | aren't usually used when a simple ? does the trick.

OS X Using literal asterisk in regular expression

I'm writing a program to make text that begins with /* and ends with */ a different color (syntax highlighting for a C comment). When I try this
#"/\*.*\*/";
I get unknown escape sequence. So I figured that to get a literal asterisk I had to use this
#"/[*].*[*]/";
and I get no errors, but when I use this code
commentPattern = #"/[*].*[*]/";
reg = [NSRegularExpression regularExpressionWithPattern:commentPattern options:kNilOptions error:nil];
results = [reg matchesInString:self.string options:kNilOptions range:NSMakeRange(0, [self.string length])];
for (NSTextCheckingResult *result in results)
{
[self setTextColor:[NSColor colorWithCalibratedRed:0.0 green:0.7 blue:0.0 alpha:1.0] range:result.range];
}
the text color of the comments doesn't change, but I don't see anything wrong with my regular expression. Can someone tell me why this wont work? I don't think it's a problem with the way I get the results or change their color, because I use the same method for other regular expressions.
You want to use this: "\\*".
\* is the escape sequence for * in regular expressions, but in C strings, \ also begins an escaped character token, so you have to escape that as well.
#"/\*.*\*/";
I get unknown escape sequence.
A string first converts escape sequences in the string, then the result is handed over to the regex engine. For instance, an escape sequence might be \t, which represents a tab, or \n which represents a newline. The string first converts an escape sequence to a special code. Your error is saying that \* is not a legal escape sequence for an NSString.
The regex engine needs to see a literal back slash followed by a *. To get a literal back slash in a string you need to write \\. However, for readability I prefer using a character class like you did with your second attempt.
You should NSLog what the results array contains to see what matches you are getting. If the matches are what you expect, then the problem is not with the regex.

Filtering out substring from NSString . . .maybe using regex

Here is my problem:
I am trying to filter out html tags from an NSString object.
Most fixes for this simply remove everything falling between a < and a >, as well as those characters themselves. I am trying to figure out a way to remove the "< . . . >" substring ONLY if it does not contain white space or newline characters.
The way i was thikning about doing it looks something like this
while ([source rangeOfString#"someRegEx" options:NSRegularExpressionSearch].location != NSNotFound) {
//find the range of the substring
//check for newlines/whitespace characters
//replace occurrences of the string with "" if it doesn't have them
}
Firstly, does this seem like a good approach? Secondly, I'm having a lot of problems with figuring out what that regex would look like... does anyone have any ideas what it might look like?
This seems like a fine approach, provided the tags you're looking for really never contain whitespace, as m.buettner points out. The regex would look something like this:
<[^\s]*?>
The [^\s] is a negated character class which matches anything but whitespace characters. The ? makes the * lazy instead of greedy. So this regex in English means "Match a '<', then the smallest possible number of non-whitespace characters, then a '>'."
This is a helpful page.
Maybe you should consider employing an NSXMLParser, described here.
You get quite a rich set of delegate methods to extract whatever you like from the string.