NSRegularExpression matching and replacing with exclude - objective-c

I'm working on a small iOS App and got stuck with creating a pattern using NSRegularExpression class. I need a pattern that I can use to look for and match a special word and replace it later but I need to exclude this word from match in case it has already been replaced by this match. So if user processes given text several times the replacement goes only once.
Example:
I need to find and replace all "yes" in any given text with "probably yes". But I need to exclude replacement of "yes" in "probably yes" in case user processes text one more time so it won't look like "probably probably yes"
NSRegularExpression *regexYesReplace = [NSRegularExpression regularExpressionWithPattern:#"some pattern" options:0 error:&error];
NSString *replacementStringYesReplace = #"probably yes";
replacedText = [regexYesReplace stringByReplacingMatchesInString:afterText options:options range:range withTemplate:replacementStringYesReplace];
I tried to implement pattern from this question and fixed syntax for NSRegularExpression but it didn't work out.
Regex replace text but exclude when text is between specific tag
May be someone had the same problem. Thanks in advance

You can use negative look-behind
(?<!probably )yes
Regex Demo

Related

CoreData NSPredicate MATCHES regex

I have a CoreData table with a field holding a string of a series of numbers separated by commas. I want to be able to run a fetch with a predicate that will match against a given specific number.
For example if fieldName = "12,52,66,89,2,8"
And I want to search for 2, then it should match the second to last number in the string and include that record in the results.
Using the regular expression:
^2|,2,|,2
I have found it working satisfactorily for my test cases, testing it using this site for example: https://www.regexpal.com/
However, when I pass this into an NSPredicate for a NSFetchRequest, I can't get it to match
NSNumber *val = #2;
NSString *regex = [NSString stringWithFormat:#"^%#|,%#,|,%#", val, val, val];
NSPredicate *pred = [NSPredicate predicateWithFormat:#"fieldName MATCHES %#", regex];
Replacing the MATCHES with a CONTAINS val makes it work, but of course it will also incorrectly match any occurrence of the digits.
I suspect I am missing something stupid about formatting for CoreData (or regex), but I've tried many variations, and I'm hoping a kind soul reading this will put me out of my misery :)
Disclaimer: I haven't used Objective C. This answer is based on my regex knowledge and some documentation.
MATCHES
The left hand expression equals the right hand expression using a regex-style comparison according to ICU v3 (for more details see the ICU User Guide for Regular Expressions).
That sounds like how Java uses the method "matches" in which case "^2|,2,|,2" can never match the entire string. This differs from regexpal which will always search the text. The regex you would need is more like
.*\b2\b.*
(the ^$ are assumed in Java). Another option is to split the string.

Objective C - NSRange and rangeOfString

I have a little problem with NSRange and rangeOfString. I want to search a substring in a given string which is working fine, but only to find a exact string and theres the problem i need to find a substring which begins always the same and ends always the same. I tried it already with something like that:
match = [strIn rangeOfString: #"truni/begin/*/end"];
But thats not working. So i need a way to to do this. Here is the specific part of the Code in full:
NSRange match;
match = [strIn rangeOfString: #"turni/begin/sHjeUUej/end"];
NSRange range = NSMakeRange(match.location, match.length);
NSString *strOut = [strIN substringWithRange:range];
You see the string "turni/begin/sHjeUUej/end" will always be the same except for the part "sHjeUUej". Hope someone can help me.
Thanks in advance.
Use a regular expression with:
- (NSRange)rangeOfString:(NSString *)aString options:(NSStringCompareOptions)mask
with an option of RegularExpressionSearch.
See ICU User Guide Regular Expressions for information on creating regular expressions.
you can use prefix/suffix
if ([strIn hasPrefix:#"truni/begin/"] && [strIn hasSuffix:#"end"]) {
//match
You can use a simpler solution if you make sure that your string always starts with turni/begin/ and ends with /end.
You can use:
NSString *strOut = [[strIn stringByReplacingOccurrencesOfString:#"turni/begin/" withString:#""] stringByReplacingOccurrencesOfString:#"/end" withString:#""];
With that, you can retrieve the string between the two others with only one line of code and less comparations.

replace matches in NSString with template using NSRegularExpression

I'm trying to detect <br> or <Br> or < br>,... in NSString and replace it with \n.
I use NSRegularExpression and i wrote this code:
NSString *string = #"123 < br><br>1245; Ross <Br>Test 12<br>";
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"<[* ](br|BR|bR|Br|br)>" options:NSRegularExpressionCaseInsensitive error:&error];
NSString *modifiedString = [regex stringByReplacingMatchesInString:string options:0 range:NSMakeRange(0, [string length]) withTemplate:#"\n"];
NSLog(#"%#", modifiedString);
it works fine but it replace first matching only, not replacing all matches. Please help me to detect all matches and replace them.
Thanks
You currently don't handle an arbitrary amount of white space. For good measure you should also handle white space after br and also handle the closing slash since <br /> is the correct way of writing the line break in HTML.
You would end up with an pattern that looks like this
<\s*(br|BR|bR|Br|br)\s*\/*>
or written as a NSRegularExpression
NSError *error = NULL;
NSRegularExpression *regex =
[NSRegularExpression regularExpressionWithPattern:#"<\\s*(br|BR|bR|Br|br)\\s*\\/*>"
options:0
error:&error];
Edit
You could also make the pattern more compact by separating the two letters
<\s*([bB][rR])\s*\/*>
You're close, you need to have it handle any number of spaces after your initial <, and handle if it doesn't have any space at all.
Using your example, you can use the regex <\s*(br|BR|bR|Br|br)> to have it accept the 0 to N spaces before your BR works. You can also simplify it a little bit more by making it case insensitive with i, which allows for a cleaner looking regex to handle all the variations on BR you will see. To do that, use (?i)<\s*br>.
I think for completeness you can also include an arbitrary amount of space AFTER the br, just to handle anything that could be thrown. I agree with adding in some sort of catch for a /> to end the pattern, since <br/> is valid HTML as well. It makes the regex look a little more crazy, but it boils down to just adding the other 3 pieces.
(?i)<\s*br\s*\/?\s*>
It looks really scary, but breaks down very simply into a few parts:
(?i) turns on case insensitive to handle the variations on the br.
<\s* is the start of the tag directly followed by an arbitrary number of spaces.
br\s* is your br chars followed by an arbitrary number of spaces.
\/? is to handle 0 or 1 instances of the closing slash (to handle HTML valid tags like <br/> and <br>.
\s*> is handling an arbitrary number of spaces and then the closing >.

Use regex to evalue string for repeated section

I haven't used regular expressions yet in objective-c. What I'm trying to do right now is evaluate a string to see if it contains a 4 or 5 character repeating pattern - any pattern, it doesn't matter. For instance, a string like #"testA54RqA54Rq" would return a true value from the regex, while a string like #"testA54Rq" would not. Right now I'm just generating all possible 4 and 5 character substrings and matching them to each other, but obviously this is extremely inefficient. Where can I find some resources about how to start using regular expressions in objective C? If anyone's been in this situation before a small example would be nice.
-EDIT-
I would also like to have somthing like #"testQWEr30BKRe40" return true (pattern of 4 letters followed by 2 numbers). I'm not sure if this is possible.
You probably want to look at:
https://developer.apple.com/library/ios/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html
The actual regex I believe would just be: (\\w{4,5})\\1
NSString *regexStr = #"(\\w{4,5})\\1";
NSError *error = nil;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:regexStr options:0 error:&error];
if ((regex==nil) && (error!=nil)) {
warn( #"Regex failed for: %#, error was: %#", string, error);
} else {
}
For exact patterns you will be able to do such validation with regex (.{4,5})\\1
If you want to do category pattern, such as 4 letters followed by 2 numbers, then you have to:
replace all letters with one constant letter (for example replace [a-zA-Z] with X)
replace all numbers with one constant number (for example replace \\d with 0)
validate such modified input with the same regex as shown above

Optimizing scanning large text and matching against list of words or phrases

I'm working on an app that takes an article (simple HTML page), and a list of vocabulary terms (each may be a word, a phrase, or even a sentence), and creates a link for each term it finds. The problem is that for larger texts with more terms it takes a long time. Currently we are dealing with this by initially displaying the unmarked text, processing the links in the background, and finally reloading the web view when processing finishes. Still, it can take a while and some of our users are not happy with it.
Right now the app uses a simple loop on the terms, doing a replacement in the HTML. Basically:
for (int i=0; i<terms.count; i++){
NSString *term = [terms objectAtIndex:i];
NSString *replaceString = [NSString stringWithFormat:#"<a href="myUrl:\\%d>%#</a>", i, term];
htmlString = [htmlString stringByReplacingOccurrencesOfString:term
withString:replaceString
options:NSCaseInsensitiveSearch
range:NSMakeRange(0, [htmlString length] )];
}
However, we are dealing with multiple languages, so there is not just one replacement per term, but twenty! That's because we have to deal with punctuation at the beginning (upside-down question marks in Spanish) and end of each term. We have to replace "term", "term.", and "term?" with an appropriate hyperlink.
Is there a more efficient method I could use to get this HTML marked up?
I need to keep the index of the original term so that it can be retrieved later when the user clicks the link.
You could process the text as follows:
Instead of looping over the vocabluary, split the text into words and look up each word in the vocabluary.
Create some index, hash table or dictionary to make the lookup efficient.
Don't use stringByReplacingOccurrencesOfString. Each time it's called it makes a copy of the whole text and won't release the memory until the autopool is drained. (Interestingly, you haven't run into memory problems yet.) Instead use a NSMutableString instance where you append each word (and the characters between them), either as it was in the original text or decorated as a link.
What you're doing right now is this:
for each vocabulary word 'term'
search the HTML text for instances of term
replace each instance of term with an appropriate hyperlink
If you have a large text, then each search takes that much longer. Further, every time you do a replacement, you have to create a new string containing a copy of the text to do the replacement on, since stringByReplacingOccurrencesOfString:withString:options:range: returns a new string rather than modifying the existing string. Multiply that by N replacements.
A better option would be to make a single pass through the string, searching for all terms at once, and building up the resulting output string in a mutable string to avoid a Shlemiel the Painter-like runtime.
For example, you could use regular expressions like so:
// Create a regular expression that is an alternation of all of the vocabulary
// words. You only need to create this once at startup.
NSMutableString *pattern = [[[NSMutableString alloc] init] autorelease];
[pattern appendString:#"\\b("];
BOOL isFirstTerm = YES;
for (NSString *term in vocabularyList)
{
if (!isFirstTerm)
{
[pattern appendString:#"|"];
isFirstTerm = NO;
}
[pattern appendString:term];
}
[pattern appendString:#")\\b"];
// Create regular expression object
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:&error];
// Replace vocabulary matches with a hyperlink
NSMutableString *htmlCopy = [[htmlString mutableCopy] autorelease];
[regex replaceMatchesInString:htmlCopy
options:0
range:NSMakeRange(0, [htmlString length])
withTemplate:#"\\1"];
// Now use htmlCopy
Since the string replace function your calling is Order N (it scans an replaces n words) and you're doing it for m vocabulary terms, you have an n^2 algorithm.
If you could do it in one pass, that would be optimal (order n - n words in html). The idea of presenting the un-replaced text first is still a good one unless it's unnoticeable even for large docs.
How about a hashset of vocabulary words, scan through the html word by (skipping html markup) and if the current scanned word is in the hash set, append that to the target buffer instead of the scanned word. That allows you to have 2 X the html content + 1 hash of vocabulary words in memory at most.
There are two approaches.
Hash Maps - if maximal length of you phrases is limited for example by two, you can iterate over all words and bigrams(2-words) and check them in HashMap - complexity is liniar, since Hash is constant time in ideal
Automaton theory
You can combine simple automatons which mach strings to single one and evaluation faster(i.e. dynamic programming). For example we have "John Smith"|"John Stuard" merge them and we get John S(mith|tuard) it is so called prefix optimisation(http://code.google.com/p/graph-expression/wiki/RegexpOptimization)
More advenced algorithm can be found here http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
I like this approach more becouse there are no limitation of phrase length and it allow to combine complex regexps.