Match several times in RegexKitLite - objective-c

I'm trying to get some info out of a document. I'm trying to match the info I need with regex, which matches 3 numbers within a string. It works fine, but it only matches the first occurance. I need it to match an unlimited number of times because I don't know how many times this string will occur.
NSString *regex = #"String containing data:(\\d+) and more data:(\\d+) and so on";
NSArray *captures = [document captureComponentsMatchedByRegex:regex];
for(NSString *match in captures){
NSLog(#"%#",match);
}
The above code prints out 3 strings - The entire string, the first data and the second data. All good, but now I need it to keep searching the document, because similar strings will occur n times.
How do I do this? And is there any way to group the matches into an array for each string or something like that?

Use the arrayOfCaptureComponentsMatchedByRegex: method. That will return an NSArray of NSArray objects, and each nested NSArray object will have the captures (index 0 being the string, index 1 being the first capture, etc).

Related

CoreData NSPredicate MATCHES regex

I have a CoreData table with a field holding a string of a series of numbers separated by commas. I want to be able to run a fetch with a predicate that will match against a given specific number.
For example if fieldName = "12,52,66,89,2,8"
And I want to search for 2, then it should match the second to last number in the string and include that record in the results.
Using the regular expression:
^2|,2,|,2
I have found it working satisfactorily for my test cases, testing it using this site for example: https://www.regexpal.com/
However, when I pass this into an NSPredicate for a NSFetchRequest, I can't get it to match
NSNumber *val = #2;
NSString *regex = [NSString stringWithFormat:#"^%#|,%#,|,%#", val, val, val];
NSPredicate *pred = [NSPredicate predicateWithFormat:#"fieldName MATCHES %#", regex];
Replacing the MATCHES with a CONTAINS val makes it work, but of course it will also incorrectly match any occurrence of the digits.
I suspect I am missing something stupid about formatting for CoreData (or regex), but I've tried many variations, and I'm hoping a kind soul reading this will put me out of my misery :)
Disclaimer: I haven't used Objective C. This answer is based on my regex knowledge and some documentation.
MATCHES
The left hand expression equals the right hand expression using a regex-style comparison according to ICU v3 (for more details see the ICU User Guide for Regular Expressions).
That sounds like how Java uses the method "matches" in which case "^2|,2,|,2" can never match the entire string. This differs from regexpal which will always search the text. The regex you would need is more like
.*\b2\b.*
(the ^$ are assumed in Java). Another option is to split the string.

How to get MFA pattern using regex in Objective-C?

I am using the regex ((.*)?:)?(.*)\\/([0-9]+|[n])? to match pattern of type module:function/arity, where arity can be any number >= 0 or the string n.
Success cases should match:
foo:bar/1
bar/1
foo:bar/0
foo:bar/n
bar/n
This seems to work fine at https://regex101.com/r/AtI5Nw/3, but using the following code, I am getting only one match group for "mod:func/1".
+ (NSArray<NSTextCheckingResult *> *)matchesInString:(NSString *)string withExpression:(NSRegularExpression *)pattern {
return [pattern matchesInString:string options:0 range:NSMakeRange(0, [string length])];
}
I tried with "mod:func/1" string and I am getting only one match. How to get all matching groups as in the screenshot? I want to get the module, function and arity parts from the string.
It's been awhile since I've done this, but...
matchesInString:... returns an array of NSTextCheckingResult objects. Each object represents a single match of the entire regex within the string.
Each NSTextCheckingResult object encapsulates a number of "ranges" (see numberOfRanges property). You then use rangeAtIndex: to extract the range of each group within that match instance.
If each target is in a separate string, you don't need matchesInString:..., simply use firstMatchInString:... to obtain the one, and only, NSTextCheckingResult for your string. You can then extract each group by getting its range, then return to the original string to extract the text of that component.

Get invisible decimal value in NSString objc

I have a string with random names with an invisible decimal value as prefix . The decimal = the names length. I need to retrieve the names. Obviously they are of different length. I want the names in an array so my idea is to use stringByReplacingOccurrencesOfString:withString. I implement the word "trunk" at the beginning and end of names. Though I am having trouble accessing the index corresponding at the end of the name (decimal value), here is my code :
trimmed1 = [[trimmed1 stringByReplacingOccurrencesOfString:sp withString:#"trunk"]mutableCopy];
NSString *trunk = #"trunk%d";// add the ghost decimal at the end of prefix in order to get its value;
NSRange range =[trimmed1 rangeOfString:trunk];
int ghost= [trunk characterAtIndex:5];
NSMutableString *mu = [NSMutableString stringWithString:trimmed1];
[mu insertString : #"trunk" atIndex :range.location+range.length+ ghost];
I get the error [__NSCFString insertString:atIndex:]: Range or index out of bounds.
You are misunderstanding what %d means.
In a format used for creating a string it means "insert the value of an integer argument formatted as a string".
When matching one string against another it means "match the characters %d", I.e. it is not special in anyway.
You are getting an error as your string does not contain the characters "trunk%d". If you check the return value of rangeOfString: you will find it is returning a failure indication - read the documentation for how to test for that value.
For the simple task of matching an arbitrary decimal number trying looking at the NSString method rangeOfCharactersFromSet:.
You can also solve this problem with the classes NSScanner and NSRegularExpression.
HTH

Use regex to evalue string for repeated section

I haven't used regular expressions yet in objective-c. What I'm trying to do right now is evaluate a string to see if it contains a 4 or 5 character repeating pattern - any pattern, it doesn't matter. For instance, a string like #"testA54RqA54Rq" would return a true value from the regex, while a string like #"testA54Rq" would not. Right now I'm just generating all possible 4 and 5 character substrings and matching them to each other, but obviously this is extremely inefficient. Where can I find some resources about how to start using regular expressions in objective C? If anyone's been in this situation before a small example would be nice.
-EDIT-
I would also like to have somthing like #"testQWEr30BKRe40" return true (pattern of 4 letters followed by 2 numbers). I'm not sure if this is possible.
You probably want to look at:
https://developer.apple.com/library/ios/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html
The actual regex I believe would just be: (\\w{4,5})\\1
NSString *regexStr = #"(\\w{4,5})\\1";
NSError *error = nil;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:regexStr options:0 error:&error];
if ((regex==nil) && (error!=nil)) {
warn( #"Regex failed for: %#, error was: %#", string, error);
} else {
}
For exact patterns you will be able to do such validation with regex (.{4,5})\\1
If you want to do category pattern, such as 4 letters followed by 2 numbers, then you have to:
replace all letters with one constant letter (for example replace [a-zA-Z] with X)
replace all numbers with one constant number (for example replace \\d with 0)
validate such modified input with the same regex as shown above

Optimizing scanning large text and matching against list of words or phrases

I'm working on an app that takes an article (simple HTML page), and a list of vocabulary terms (each may be a word, a phrase, or even a sentence), and creates a link for each term it finds. The problem is that for larger texts with more terms it takes a long time. Currently we are dealing with this by initially displaying the unmarked text, processing the links in the background, and finally reloading the web view when processing finishes. Still, it can take a while and some of our users are not happy with it.
Right now the app uses a simple loop on the terms, doing a replacement in the HTML. Basically:
for (int i=0; i<terms.count; i++){
NSString *term = [terms objectAtIndex:i];
NSString *replaceString = [NSString stringWithFormat:#"<a href="myUrl:\\%d>%#</a>", i, term];
htmlString = [htmlString stringByReplacingOccurrencesOfString:term
withString:replaceString
options:NSCaseInsensitiveSearch
range:NSMakeRange(0, [htmlString length] )];
}
However, we are dealing with multiple languages, so there is not just one replacement per term, but twenty! That's because we have to deal with punctuation at the beginning (upside-down question marks in Spanish) and end of each term. We have to replace "term", "term.", and "term?" with an appropriate hyperlink.
Is there a more efficient method I could use to get this HTML marked up?
I need to keep the index of the original term so that it can be retrieved later when the user clicks the link.
You could process the text as follows:
Instead of looping over the vocabluary, split the text into words and look up each word in the vocabluary.
Create some index, hash table or dictionary to make the lookup efficient.
Don't use stringByReplacingOccurrencesOfString. Each time it's called it makes a copy of the whole text and won't release the memory until the autopool is drained. (Interestingly, you haven't run into memory problems yet.) Instead use a NSMutableString instance where you append each word (and the characters between them), either as it was in the original text or decorated as a link.
What you're doing right now is this:
for each vocabulary word 'term'
search the HTML text for instances of term
replace each instance of term with an appropriate hyperlink
If you have a large text, then each search takes that much longer. Further, every time you do a replacement, you have to create a new string containing a copy of the text to do the replacement on, since stringByReplacingOccurrencesOfString:withString:options:range: returns a new string rather than modifying the existing string. Multiply that by N replacements.
A better option would be to make a single pass through the string, searching for all terms at once, and building up the resulting output string in a mutable string to avoid a Shlemiel the Painter-like runtime.
For example, you could use regular expressions like so:
// Create a regular expression that is an alternation of all of the vocabulary
// words. You only need to create this once at startup.
NSMutableString *pattern = [[[NSMutableString alloc] init] autorelease];
[pattern appendString:#"\\b("];
BOOL isFirstTerm = YES;
for (NSString *term in vocabularyList)
{
if (!isFirstTerm)
{
[pattern appendString:#"|"];
isFirstTerm = NO;
}
[pattern appendString:term];
}
[pattern appendString:#")\\b"];
// Create regular expression object
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:&error];
// Replace vocabulary matches with a hyperlink
NSMutableString *htmlCopy = [[htmlString mutableCopy] autorelease];
[regex replaceMatchesInString:htmlCopy
options:0
range:NSMakeRange(0, [htmlString length])
withTemplate:#"\\1"];
// Now use htmlCopy
Since the string replace function your calling is Order N (it scans an replaces n words) and you're doing it for m vocabulary terms, you have an n^2 algorithm.
If you could do it in one pass, that would be optimal (order n - n words in html). The idea of presenting the un-replaced text first is still a good one unless it's unnoticeable even for large docs.
How about a hashset of vocabulary words, scan through the html word by (skipping html markup) and if the current scanned word is in the hash set, append that to the target buffer instead of the scanned word. That allows you to have 2 X the html content + 1 hash of vocabulary words in memory at most.
There are two approaches.
Hash Maps - if maximal length of you phrases is limited for example by two, you can iterate over all words and bigrams(2-words) and check them in HashMap - complexity is liniar, since Hash is constant time in ideal
Automaton theory
You can combine simple automatons which mach strings to single one and evaluation faster(i.e. dynamic programming). For example we have "John Smith"|"John Stuard" merge them and we get John S(mith|tuard) it is so called prefix optimisation(http://code.google.com/p/graph-expression/wiki/RegexpOptimization)
More advenced algorithm can be found here http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
I like this approach more becouse there are no limitation of phrase length and it allow to combine complex regexps.