Using NSRegularExpression to extract URLs on the iPhone - objective-c

I'm using the following code on my iPhone app, taken from here to extract all URLs from striped .html code.
I'm only being able to extract the first URL, but I need an array containing all URLs. My NSArray isn't returning NSStrings for each URL, but the objects descriptions only.
How do I make my arrayOfAllMatches return all URLs, as NSStrings?
-(NSArray *)stripOutHttp:(NSString *)httpLine {
// Setup an NSError object to catch any failures
NSError *error = NULL;
// create the NSRegularExpression object and initialize it with a pattern
// the pattern will match any http or https url, with option case insensitive
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?" options:NSRegularExpressionCaseInsensitive error:&error];
// create an NSRange object using our regex object for the first match in the string httpline
NSRange rangeOfFirstMatch = [regex rangeOfFirstMatchInString:httpLine options:0 range:NSMakeRange(0, [httpLine length])];
NSArray *arrayOfAllMatches = [regex matchesInString:httpLine options:0 range:NSMakeRange(0, [httpLine length])];
// check that our NSRange object is not equal to range of NSNotFound
if (!NSEqualRanges(rangeOfFirstMatch, NSMakeRange(NSNotFound, 0))) {
// Since we know that we found a match, get the substring from the parent string by using our NSRange object
NSString *substringForFirstMatch = [httpLine substringWithRange:rangeOfFirstMatch];
NSLog(#"Extracted URL: %#",substringForFirstMatch);
NSLog(#"All Extracted URLs: %#",arrayOfAllMatches);
// return all matching url strings
return arrayOfAllMatches;
}
return NULL;
}
Here is my NSLog output:
Extracted URL: http://example.com/myplayer
All Extracted URLs: (
"<NSExtendedRegularExpressionCheckingResult: 0x106ddb0>{728, 53}{<NSRegularExpression: 0x106bc30> http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)? 0x1}",
"<NSExtendedRegularExpressionCheckingResult: 0x106ddf0>{956, 66}{<NSRegularExpression: 0x106bc30> http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)? 0x1}",
"<NSExtendedRegularExpressionCheckingResult: 0x106de30>{1046, 63}{<NSRegularExpression: 0x106bc30> http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)? 0x1}",
"<NSExtendedRegularExpressionCheckingResult: 0x106de70>{1129, 67}{<NSRegularExpression: 0x106bc30> http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)? 0x1}"
)

The method matchesInString:options:range: returns an array of NSTextCheckingResult objects. You can use fast enumeration to iterate through the array, pull out the substring of each match from your original string, and add the substring to a new array.
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?" options:NSRegularExpressionCaseInsensitive error:&error];
NSArray *arrayOfAllMatches = [regex matchesInString:httpLine options:0 range:NSMakeRange(0, [httpLine length])];
NSMutableArray *arrayOfURLs = [[NSMutableArray alloc] init];
for (NSTextCheckingResult *match in arrayOfAllMatches) {
NSString* substringForMatch = [httpLine substringWithRange:match.range];
NSLog(#"Extracted URL: %#",substringForMatch);
[arrayOfURLs addObject:substringForMatch];
}
// return non-mutable version of the array
return [NSArray arrayWithArray:arrayOfURLs];

Try NSDataDetector
NSDataDetector *linkDetector = [NSDataDetector dataDetectorWithTypes:NSTextCheckingTypeLink error:nil];
NSArray *matches = [linkDetector matchesInString:text options:0 range:NSMakeRange(0, [text length])];

With NSDataDetector using Swift :
let types: NSTextCheckingType = .Link
var error : NSError?
let detector = NSDataDetector(types: types.rawValue, error: &error)
var matches = detector!.matchesInString(text, options: nil, range: NSMakeRange(0, count(text)))
for match in matches {
println(match.URL!)
}
Using Swift 2.0:
let text = "http://www.google.com. http://www.bla.com"
let types: NSTextCheckingType = .Link
let detector = try? NSDataDetector(types: types.rawValue)
guard let detect = detector else {
return
}
let matches = detect.matchesInString(text, options: .ReportCompletion, range: NSMakeRange(0, text.characters.count))
for match in matches {
print(match.URL!)
}
Using Swift 3.0
let text = "http://www.google.com. http://www.bla.com"
let types: NSTextCheckingResult.CheckingType = .link
let detector = try? NSDataDetector(types: types.rawValue)
let matches = detector?.matches(in: text, options: .reportCompletion, range: NSMakeRange(0, text.characters.count))
for match in matches! {
print(match.url!)
}

to get all links from a given string
NSRegularExpression *expression = [NSRegularExpression regularExpressionWithPattern:#"(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\".,<>?«»“”‘’]))" options:NSRegularExpressionCaseInsensitive error:NULL];
NSString *someString = #"www.facebook.com/link/index.php This is a sample www.google.com of a http://abc.com/efg.php?EFAei687e3EsA sentence with a URL within it.";
NSArray *matches = [expression matchesInString:someString options:NSMatchingCompleted range:NSMakeRange(0, someString.length)];
for (NSTextCheckingResult *result in matches) {
NSString *url = [someString substringWithRange:result.range];
NSLog(#"found url:%#", url);
}

I found myself so nauseated by the complexity of this simple operation ("match ALL the substrings") that I made a little library I am humbly calling Unsuck which adds some sanity to NSRegularExpression in the form of from and allMatches methods. Here's how you'd use them:
NSRegularExpression *re = [NSRegularExpression from: #"(?i)\\b(https?://.*)\\b"]; // or whatever your favorite regex is; Hossam's seems pretty good
NSArray *matches = [re allMatches:httpLine];
Please check out the unsuck source code on github and tell me all the things I did wrong :-)
Note that (?i) makes it case insensitive so you don't need to specify NSRegularExpressionCaseInsensitive.

Related

Regular expression to grub usernames from string

i need to find usernames (like twitter ones) in strings, for example, if the string is:
"Hello, #username! How are you? And #username2??"
I want to isolate/extract #username and #username2
Do you know how to do it in Objective-C, i found this for Python regex for Twitter username but does not work for me
I tried it like this, but is not working:
NSString *comment = #"Hello, #username! How are you? And #username2??";
NSError *error = nil;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"(?<=^|(?<=[^a-zA-Z0-9-\\.]))#([A-Za-z]+[A-Za-z0-9-]+)" options:0 error:&error];
NSArray *matches = [regex matchesInString:comment options:0 range:NSMakeRange(0, comment.length)];
for (NSTextCheckingResult *match in matches) {
NSRange wordRange = [match rangeAtIndex:1];
NSString *username = [comment substringWithRange:wordRange];
NSLog(#"searchUsersInComment result --> %#", username);
}
(?<=^|(?<=[^a-zA-Z0-9-\\.]))#([A-Za-z]+[A-Za-z0-9-]+) is to neglect emails and grab only usernames, as your string doesn't contain any emails, you should just use #([A-Za-z]+[A-Za-z0-9-]+)
Your regex is wrong. You need to modify it to:
NSString *comment = #"Hello, #username! How are you? And #username2??";
NSError *error = nil;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"#([A-Za-z]+[A-Za-z0-9-]+)" options:0 error:&error];
NSArray *matches = [regex matchesInString:comment options:0 range:NSMakeRange(0, comment.length)];
for (NSTextCheckingResult *match in matches) {
NSRange wordRange = [match rangeAtIndex:1];
NSString *username = [comment substringWithRange:wordRange];
NSLog(#"searchUsersInComment result --> %#", username);
}
FYI: Any subpattern inside a pair of parentheses will be captured as a group. In practice, this can be used to extract information like phone numbers or emails from all sorts of data.
Imagine for example that you had a command line tool to list all the image files you have in the cloud. You could then use a pattern such as ^(IMG\d+.png)$ to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern ^(IMG\d+).png$ which only captures the part before the period.
I would suggest you to read about regex strings: http://regexone.com/lesson/capturing_groups

In objective-c is it possible to get the position of a regex match within the string

If I have the string "Hello World", is it possible to use NSRegularExpression with the pattern #"World" to get the position of the match, i.e. in the "Hello World" example the position/index of the match should be "6"?
in php I'd use preg_match with the "PREG_OFFSET_CAPTURE" flag to achieve this, does objective-c support this?
You can do it the Cocoa way:
NSRegularExpression *regex = [[NSRegularExpression alloc] initWithPattern:#"world" options:0 error:NULL];
// omitted error checking for the sake of simplicity
NSString *str = #"Hello world!";
[regex enumerateMatchesInString:str
options:0
range:NSMakeRange(0, str.length)
usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop)
{
NSLog(#"Match at [%d, %d]", result.range.location, result.range.length);
}];
[regex release];
Or the POSIX way (this may be convenient for you, since you want only one match, and this function/method returns the match range directly):
#include <regex.h>
- (NSRange)matchString:(NSString *)string toRegex:(NSString *)regex
{
regex_t regex_obj;
regmatch_t match;
const char *regex_str;
const char *match_str;
int error;
regex_str = [regex UTF8String];
error = regcomp(&regex_obj, regex_str, REG_EXTENDED);
if (error)
{
return NSMakeRange(NSNotFound, 0);
}
match_str = [string UTF8String];
error = regexec(&regex_obj, match_str, 1, &match, 0);
if (error)
{
return NSMakeRange(NSNotFound, 0);
}
regfree(&regex_obj);
return NSMakeRange(match.rm_so, match.rm_eo - match.rm_so);
}
This is somewhat long in Cocoa, but you can do it:
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression
regularExpressionWithPattern:#"world"
options:NSRegularExpressionSearch
error:&error];
NSString *str = #"Hello, world!";
NSTextCheckingResult *match = [regex firstMatchInString:str
options:0
range:NSMakeRange(0, [str length])];
if (match) {
NSRange matchRange = [match range];
NSLog(#"%lu", matchRange.location);
}
This prints 7.
If you're going to make a lot of use of RegEx's, I recommend looking at RegexKit or RegexKitLite.
Yes it is possible. You can use the NSRegularExpression method, rangeOfFirstMatchInString:options:range: which returns the range of the first match. You could also do this with the NSString method rangeOfString: if you don't need to use REGEX.

How to strip down the string?

I have a really long string, I just want to extract some certain string inside that string. How can I do that?
for example I have:
this is the image <img src="http://vnexpress.net/Files/Subject/3b/bd/67/6f/chungkhoan-xanhdiem2.jpg"> and it is very beautiful.
and yes now i want to get substring this long string and get only http://vnexpress.net/Files/Subject/3b/bd/67/6f/chungkhoan-xanhdiem2.jpg
Please show me how I can do this.
You can use regular expressions for this:
NSRegularExpression* regex = [[NSRegularExpression alloc] initWithPattern:#"src=\"([^\"]*)\"" options:NSRegularExpressionCaseInsensitive error:nil];
NSString *text = #"this is the image <img src=\"http://vnexpress.net/Files/Subject/3b/bd/67/6f/chungkhoan-xanhdiem2.jpg\"> and it is very beautiful.";
NSArray *imgs = [regex matchesInString:text options:0 range:NSMakeRange(0, [text length])];
if (imgs.count != 0) {
NSTextCheckingResult* r = [imgs objectAtIndex:0];
NSLog(#"%#", [text substringWithRange:[r rangeAtIndex:1]]);
}
This regular expression is the heart of the solution:
src="([^"]*)"
It matches the content of the src attribute, and captures the content between the quotes (note a pair of parentheses). This caption is then retrieved in [r rangeAtIndex:1], and used to extract the part of the string that you are looking for.
You should use a regular expression, probably using the NSRegularExpression class.
Here's an example that does exactly what you want (from here):
- (NSString *)stripOutHttp:(NSString *)httpLine
{
// Setup an NSError object to catch any failures
NSError *error = NULL;
// create the NSRegularExpression object and initialize it with a pattern
// the pattern will match any http or https url, with option case insensitive
NSRegularExpression *regex = [NSRegularExpression
regularExpressionWithPattern:#"https?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?"
options:NSRegularExpressionCaseInsensitive
error:&error];
// create an NSRange object using our regex object for the first match in the string httpline
NSRange rangeOfFirstMatch = [regex rangeOfFirstMatchInString:httpLine
options:0
range:NSMakeRange(0, [httpLine length])];
// check that our NSRange object is not equal to range of NSNotFound
if (!NSEqualRanges(rangeOfFirstMatch, NSMakeRange(NSNotFound, 0)))
{
// Since we know that we found a match, get the substring from the parent
// string by using our NSRange object
NSString *substringForFirstMatch = [httpLine substringWithRange:rangeOfFirstMatch];
NSLog(#"Extracted URL: %#",substringForFirstMatch);
// return the matching string
return substringForFirstMatch;
}
return NULL;
}
NSString *urlString = nil;
NSString *htmlString = //Your string;
NSScanner *scanner = [NSScanner scannerWithString:htmlString];
[scanner scanUpToString:#"<img" intoString:nil];
if (![scanner isAtEnd]) {
[scanner scanUpToString:#"http" intoString:nil];
NSCharacterSet *charset = [NSCharacterSet characterSetWithCharactersInString:#">"];
[scanner scanUpToCharactersFromSet:charset intoString:&urlString];
}
NSLog(#"%#", urlString);

Unable to extract information using NSRegularExpression

I am developing an iPhone application which will use NSRegularExpression to match pattern from a string to extract information. Here I am trying to extract the mailTo link from an email header. I have successfully retrieved the email header string and now I am applying the the search pattern using NSregularExpression to get the email id from the header string.
This is the header text from where I want to extract mailTo :
List-Unsubscribe: <mailto:suksh-1142-5451-d8135921c2e2d40400ab02fa31eda529#usub.mailserv.in>?subject=Unsubscribe>,<http://suksh.mailserv.in/suksh/?p=unsubscribe&mid=5451&uid=d8135921c2e2d40400ab02fa31eda529>>
This is the search pattern:
mailto:(?<address>[^\?^>]+)\??(?<params>[^>]+)?
my code is like this
NSString *str= #"List-Unsubscribe: <mailto:suksh-1142-5451-d8135921c2e2d40400ab02fa31eda529#usub.mailserv.in>?subject=Unsubscribe>,<http://suksh.mailserv.in/suksh/?p=unsubscribe&mid=5451&uid=d8135921c2e2d40400ab02fa31eda529>>";
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"mailto:(?<address>[^\?^>]+)\??(?<params>[^>]+)?"];
NSRange rangeOfFirstMatch = [regex rangeOfFirstMatchInString:str options:0 range:NSMakeRange(0, [str length])];
if (!NSEqualRanges(rangeOfFirstMatch, NSMakeRange(NSNotFound, 0))) {
NSString *substringForFirstMatch = [str substringWithRange:rangeOfFirstMatch];
NSLog(#"Extracted URL: %#",substringForFirstMatch);
}
but when I am going to create the NSRegularExpression object by the help of regularExpressionWithPattern: it is returning nil object.
Please assist me what can be the issue.
Thanks in advance
The pattern string will be processed twice: once by the compiler, then by NSRegularExpression. You must escape backslashes to ensure the compiler doesn't process each "\?".
Neither the NSRegularExpression nor ICU documentation mentions support for named capture groups ((?<name>pattern)); that could cause the parsing of the pattern to fail or the match to fail.
Use regularExpressionWithPattern:options:error: when you create the regular expression so you can get an error object, which will tell you why construction failed.
NSError *theError;
// '?\?(' is to prevent '??(' from being interpreted as a trigraph
NSString *pattern = #"mailto:(?<address>[^\\?^>]+)\\?\?(?<params>[^>]+)?";
NSRegularExpression *regex;
NSRange rangeOfFirstMatch;
regex = [NSRegularExpression regularExpressionWithPattern:pattern
options:0 error:&theError];
if (regex) {
rangeOfFirstMatch = [regex rangeOfFirstMatchInString:str
options:0 range:NSMakeRange(0, [str length])];
if (!NSEqualRanges(rangeOfFirstMatch, NSMakeRange(NSNotFound, 0))) {
NSString *substringForFirstMatch = [str substringWithRange:rangeOfFirstMatch];
NSLog(#"Extracted URL: %#",substringForFirstMatch);
}
} else {
// couldn't compile RE
NSAlert *errorAlert;
if (theError) {
errorAlert = [NSAlert alertWithError:theError];
} else {
NSString *errorMsg = #"Couldn't parse unsubscribe header because the pattern /%#/ isn't a valid regular expression.";
errorAlert = [NSAlert
alertWithMessageText:#"Invalid Pattern"
defaultButton:nil
alternateButton:nil
otherButton:nil
informativeTextWithFormat:[NSString stringWithFormat:errorMsg, pattern]];
}
[theAlert runModal]; // Ignore return value.
}

NSRegularExpression to extract text between two XML tags

How to extract the value "6" between the "badgeCount" tags using NSRegularExpression. Following is the response from the server:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><badgeCount>6</badgeCount><rank>2</rank><screenName>myName</screenName>
Following is the code I tried but not getting success. Actually it goes in else part and prints "Value of regex is nil":
NSString *responseString = [[NSString alloc] initWithBytes:[responseDataForCrntUser bytes] length:responseDataForCrntUser.length encoding:NSUTF8StringEncoding];
NSError *error;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"(?<=badgeCount>)(?:[^])*?(?=</badgeCount)" options:0 error:&error];
if (regex != nil) {
NSTextCheckingResult *firstMatch = [regex firstMatchInString:responseString options:0 range:NSMakeRange(0, [responseString length])];
NSLog(#"NOT NIL");
if (firstMatch) {
NSRange accessTokenRange = [firstMatch rangeAtIndex:1];
NSString *value = [urlString substringWithRange:accessTokenRange];
NSLog(#"Value: %#", value);
}
}
else
NSLog(#"Value of regex is nil");
If you could provide sample code that would be much appreciated.
NOTE: I don't want to use NSXMLParser.
Example:
NSString *xml = #"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><badgeCount>6</badgeCount><rank>2</rank><screenName>myName</screenName>";
NSString *pattern = #"<badgeCount>(\\d+)</badgeCount>";
NSRegularExpression *regex = [NSRegularExpression
regularExpressionWithPattern:pattern
options:NSRegularExpressionCaseInsensitive
error:nil];
NSTextCheckingResult *textCheckingResult = [regex firstMatchInString:xml options:0 range:NSMakeRange(0, xml.length)];
NSRange matchRange = [textCheckingResult rangeAtIndex:1];
NSString *match = [xml substringWithRange:matchRange];
NSLog(#"Found string '%#'", match);
NSLog output:
Found string '6'
To do it in swift 3.0
func getMatchingValueFrom(strXML:String, tag:String) -> String {
let pattern : String = "<"+tag+">(.*?)</"+tag+">" // original didn't work: "<"+tag+">(\\d+)</"+tag+">"
let regexOptions = NSRegularExpression.Options.caseInsensitive
do {
let regex = try NSRegularExpression(pattern: pattern, options: regexOptions)
let textCheckingResult : NSTextCheckingResult = regex.firstMatch(in: strXML, options: NSRegularExpression.MatchingOptions(rawValue: UInt(0)), range: NSMakeRange(0, strXML.count))!
let matchRange : NSRange = textCheckingResult.range(at: 1)
let match : String = (strXML as NSString).substring(with: matchRange)
return match
} catch {
print(pattern + "<-- not found in string -->" + strXML )
return ""
}
}
P.S : This is corresponding swift solution of #zaph's solution in obj-c