Parse domain name from URL string - objective-c

How would I parse a domain name in Objective-C?
For example if my string value was "http://www.google.com" I would like to parse out the string "google"

I think the question is a tiny bit invalid. A host is determined by its FQDN (fully qualified domain name) which, in your example, is www.google.com. It's not the same as mail.google.com or www.google.info or google.com. To single out "google" is not trivial and does not make much sense from URL perspective.
If you'd like to just parse the URL more-or-less intelligently, I think you can do the following:
Use NSURL's -host method to get the scheme and path/query stripped correctly.
Use NSString's -componentsSeparatedByString: method to get an array of the domain name's "components".
Ignore the last component.
If there's only one component left (or it may be enough to take the second-last component), you're done.
If the first component contains "www" like www3, "ftp", "mail" or something of their kind, you can ignore it too if you like. The rest may be of interest, depending on your needs.
Test your algorithm against ten thousand URLs to get a sense of futility of this task ;)

In iOS 7 you can use the NSURLComponents class:
NSURLComponents *components = [[NSURLComponents alloc] initWithString:#"http://stackoverflow.com/questions/2333972/objective-c-parse-domain-name-from-url-string"];
NSAssert([components.host isEqualToString:#"stackoverflow.com"], nil);

Related

Inconsistencies in URL encoding methods across Objective-C and Swift

I have the following Objective-C code:
[#"http://www.google.com" stringByAddingPercentEncodingWithAllowedCharacters:[NSCharacterSet URLPathAllowedCharacterSet]];
// http%3A//www.google.com
And yet, in Swift:
"http://www.google.com".addingPercentEncoding(withAllowedCharacters: .urlPathAllowed)
// http://www.google.com
To what can I attribute this discrepancy?
..and for extra credit, can I rely on this code to encode for url path reserved characters while passing a full url like this?
The issue actually rests in the difference between NSString method stringByAddingPercentEncodingWithAllowedCharacters and String method addingPercentEncoding(withAllowedCharacters:). And this behavior has been changing from version to version. (It looks like the latest beta of iOS 11 now restores this behavior we used to see.)
I believe the root of the issue rests in the particulars of how paths are percent encoded. Section 3.3 of RFC 3986 says that colons are permitted in paths except in the first segment of a relative path.
The NSString method captures this notion, e.g. imagine a path whose first directory was foo: (with a colon) and a subdirectory of bar: (also with a colon):
NSString *string = #"foo:/bar:";
NSCharacterSet *cs = [NSCharacterSet URLPathAllowedCharacterSet];
NSLog(#"%#", [string stringByAddingPercentEncodingWithAllowedCharacters:cs]);
That results in:
foo%3A/bar:
The : in the first segment of the page is percent encoded, but the : in subsequent segments are not. This captures the logic of how to handle colons in relative paths per RFC 3986.
The String method addingPercentEncoding(withAllowedCharacters:), however, does not do this:
let string = "foo:/bar:"
os_log("%#", string.addingPercentEncoding(withAllowedCharacters: .urlPathAllowed)!)
Yields:
foo:/bar:
Clearly, the String method does not attempt that position-sensitive logic. This implementation is more in keeping with the name of the method (it considers solely what characters are "allowed" with no special logic that tries to guess, based upon where the allowed character appears, whether it's truly allowed or not.)
I gather that you are saddled with the code supplied in the question, but we should note that this behavior of percent escaping colons in relative paths, while interesting to explain what you experienced, is not really relevant to your immediate problem. The code you have been provided is simply incorrect. It is attempting to percent encode a URL as if it was just a path. But, it’s not a path; it’s a URL, which is a different thing with its own rules.
The deeper insight in percent encoding URLs is to acknowledge that different components of a URL allow different sets of characters, i.e. they require different percent encoding. That’s why NSCharacterSet has so many different URL-related character sets.
You really should percent encode the individual components, percent encoding each with the character set allowed for that type of component. Only when the individual components are percent encoded should they then be concatenated together to form the whole the URL.
Alternatively, NSURLComponents is designed precisely for this purpose, getting you out of the weeds of percent-encoding the individual components yourself. For example:
var components = URLComponents(string: "http://httpbin.org/post")!
let foo = URLQueryItem(name: "foo", value: "bar & baz")
let qux = URLQueryItem(name: "qux", value: "42")
components.queryItems = [foo, qux]
let url = components.url!
That yields the following, with the & and the two spaces properly percent escaped within the foo value, but it correctly left the & in-between foo and qux:
http://httpbin.org/post?foo=bar%20%26%20baz&qux=42
It’s worth noting, though, that NSURLComponents has a small, yet fairly fundamental flaw: Specifically, if you have query values, NSURLQueryItem, that could have + characters, most web services need that percent escaped, but NSURLComponents won’t. If your URL has query components and if those query values might include + characters, I’d advise against NSURLComponents and would instead advise percent encoding the individual components of a URL yourself.

Add path segment at last part of URL with ZnUrl

I am using Pharo 3 and I want to add a path segment as the last part of an URL for example http://example.com/myapp?key1=param1&key2=param2 and I want to get /myParam added to the last part. With ZnUrl I tried with #addSegment:
(ZnUrl fromString: 'http://example.com/myapp?key1=param1&key2=param2')
addPathSegment: 'myParam'
but results in
http://example.com/myapp/myParam?key1=param1&key2=param2
How could I configure the ZnUrl to get?
http://example.com/myapp?key1=param1&key2=param2/myParam
The thing you are describing is not a valid URL:
So what you are talking about is not an addition of a path segment, but rather string concatenation.
You can consider doing:
ZnUrl fromString: 'http://example.com/myapp?key1=param1&key2=param2/myParam'
or if you get a url from somewhere else,
(self asString, '/myParam') asUrl
should work too.
You can also do more magic to get everything to work, but in a first place you have to redesign your URL structure, to fit the standards (if you can influence it)

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.

How to compare URLs when one is in /Users/john format and other in file://localhost/Users/etc

I am enumerating through directories which returns URLs in the form:
file://localhost/Users/john/Documents/static.gif
I want to check these results against URLs in the form of:
/Users/john
Specifically, I want to know if the first URL is contained in the second.
I've been going through the various NSURL methods and can't find a method that will allow me to convert one form into the other for easy comparison, or actually do the comparison for me.
You can use the path method to get the strings. The first URL will become #"/localhost/Users/john/Documents/static.gif" and second remains the same.
You can check where second URL contains the first using,
if ( [[URL1 path] hasPrefix:[URL2 path]] ) {
NSLog(#"Contained");
}

Find a url for a file in html using a regular expression

I've set myself a somewhat ambitious first task in learning regular expressions (and one which relates to a problem I'm trying to solve). I need to find any instance of a url that ends in .m4v, in a big html string.
My first attempt was this for jpg files
http.*jpg
Which of course seems correct on first glance, but of course returns stuff like this:
http://domain.com/page.html" title="Misc"><img src="http://domain.com/image.jpg
Which does match the expression in theory. So really, I need to put something in http.*m4v that says 'only the closest instance between http and m4v'. Any ideas?
As you've noticed, an expression such as the following is greedy:
http:.*\.jpg
That means it reads as much input as possible while satisfying the expression.
It's the "*" operator that makes it greedy. There's a well-defined regex technique to making this non-greedy… use the "?" modifier after the "*".
http:.*?\.jpg
Now it will match as little as possible while still satisifying the expression (i.e. it will stop searching at the first occurrence of ".jpg".
Of course, if you have a .jpg in the middle of a URL, like:
http://mydomain.com/some.jpg-folder/foo.jpg
It will not match the full URL.
You'll want to define the end of the URL as something that can't be considered part of the URL, such as a space, or a new line, or (if the URL in nested inside parentheses), a closing parenthesis. This can't be solved with just one little regex however if it's included in written language, since URLs are often ambiguous.
Take for example:
At this page, http://mysite.com/puppy.html, there's a cute little puppy dog.
The comma could technically be a part of a URL. You have to deal with a lot of ambiguities like this when looking for URLs in written text, and it's hard not to have bugs due to the ambiguities.
EDIT | Here's an example of a regex in PHP that is a quick and dirty solution, being greedy only where needed and trying to deal with the English language:
<?php
$str = "Checkout http://www.foo.com/test?items=bat,ball, for info about bats and balls";
preg_match('/https?:\/\/([a-zA-Z0-9][a-zA-Z0-9-]*)(\.[a-zA-Z0-9-]+)*((\/[^\s]*)(?=[\s\.,;!\?]))\b/i', $str, $matches);
var_dump($matches);
It outputs:
array(5) {
[0]=>
string(38) "http://www.foo.com/test?items=bat,ball"
[1]=>
string(3) "www"
[2]=>
string(4) ".com"
[3]=>
string(20) "/test?items=bat,ball"
[4]=>
string(20) "/test?items=bat,ball"
}
The explanation is in the comments.
Perl, ruby, php and javascript should all work with these:
/(http:\/\/(?:(?:(?!\http:\/\/).))+\.jpg)/
The URLs will be stored in the matched groups. Tested this out against "http://a.com/b.jpg-folder/c.jpg http://mydomain.com/some.jpg-folder/foo.jpg" and it worked correctly without being too greedy.