Am I parsing html source code in the best way? - objective-c

I want to extract the body paragraphs from a web page and store them into a string.
First, I obtain the entire source code using
NSString *sourceCode = [NSString stringWithContentsOfURL:[NSURL URLWithString:currentLink] encoding:NSUTF8StringEncoding error:&error];
The body paragraphs begin after <!-- (START) Pagination Content Wrapper --> and ends before <!-- (END) Pagination Content Wrapper -->
so I plan to split the string like so
NSString *startingPt = #"<!-- (START) Pagination Content Wrapper -->";
NSString *endingPt = #"<!-- (END) Pagination Content Wrapper -->";
NSString *sub = [sourceCode substringFromIndex:NSMaxRange([str rangeOfString:startingPt])];
sub = [sourceCode substringToIndex:[s rangeOfString:endingPt].location;
Then I would use stringByReplacingOccurrencesOfString:withString: to replace the remaining html tags with #""
Is there a better way to achieve my goal?

You're going to have to find the HTML tags before you remove them. Unless you know for a fact that there are a limited number of tags that this system will ever need to use, you shouldn't hard-code a list of them in your code. And with -stringByReplacingOccurrences..., you need an exact string, with all of the arguments ID and class tags, etc., which makes it even more subject to change.
Unless you're going to use the third-party extension suggested by vishy, which looks like it does what you need, you're going to have to do something like this:
1) Find the first occurrence of "<" in the string
2) See if the "<" is escaped.
3) If not, find the next ">".
4) See if that is escaped.
5) If not, create an NSRange for the tag (from "<" to ">") and use -stringByReplacingCharactersInRange to get rid of it.
6) Repeat until you don't find any more unescaped "<".
This will leave you with de-HTMLified text, but NOT plain text. You will still see HTML escapes, and just as importantly, there is no guarantee that the whitespace (which is ignored in HTML) will make any sense once the HTML is removed.

After obtaining the sub string removing START & END, you can simply use NSString+HTML categories to escape the html tags, its a very good categories to implement html encoding, decoding and more, and main is it you can use it for your NSString instances no need to create a separate objects for that purpose.
Objective C HTML escape/unescape Here you can find more discussions on it.
These are the methods available as suggested in that post & i like it.
- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;

Related

Using Placeholders in a URL string

I have a url that retrieves data from a Web API which looks like this with the entry of "Pizza Hut":
NSString *urlString = #"https://api.nutritionix.com/v1_1/search/Pizza Hut?results=0%3A20&cal_min=0&cal_max=50000&fields=item_name%2Cbrand_name%2Citem_id%2Cbrand_id&appId=MY_APP_ID&appKey=MY_APP_KEY";
This URL will return all the menu items of Pizza Hut.
Now I want to take a step beyond hard coding values, and so I created a text box where users can enter their own restaurant, and the web api should return data.
Here is an example of that:
NSString *urlString = [NSString stringWithFormat:#"https://api.nutritionix.com/v1_1/search/%#?results=0%3A20&cal_min=0&cal_max=50000&fields=item_name%2Cbrand_name%2Citem_id%2Cbrand_id&appId=MY_APP_ID&appKey=MY_APP_KEY", searchText.text];
All I did here was change the "Pizza Hut" to "%#".
However, I get a warning from the compiler saying:
"More '%' conversions than data arguments. As you would expect, the API returns no data, for this code doesn't seem to be working.
How would I re-write this string so that I could put the placeholder in there?
You have other percent symbols that need to be escaped properly. You want:
NSString *urlString = [NSString stringWithFormat:#"https://api.nutritionix.com/v1_1/search/%#?results=0%%3A20&cal_min=0&cal_max=50000&fields=item_name%%2Cbrand_name%%2Citem_id%%2Cbrand_id&appId=MY_APP_ID&appKey=MY_APP_KEY", searchText.text];
Basically, add a 2nd % symbol before all of the % symbols that you actually want to appear in the string.
BTW - make sure you properly escape the search text so special characters (such as spaces) are properly encoded.

Understanding urls correctly

I'm writing RSS reader and taking article urls from feeds, but often have invalid urls while parsing with NSXMLParser. Sometimes have extra symbols at the end of url(for example \n,\t). This issue I fixed.
Most difficult trouble is urls with queries that have characters not allowed to be url-encoded.
Working url for URL-request http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa
'#' character will replaced to "%23" by "stringByAddingPercentEscapesUsingEncoding:" method and will not work. Site will say what page not found. I believe after '#' character is a query string.
Are there a way to get(encode) any url from feeds correctly, at least always removing a query strings from xml?
There two approaches you could use to create a legal URL string by either using stringByAddingPercentEncodingWithAllowedCharacters or by using CFURL core foundation class which gives you a whole range of options.
Example 1 (NSCharacterSet):
NSString *nonFormattedURL = #"http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa";
NSLog(#"%#", [nonFormattedURL stringByAddingPercentEncodingWithAllowedCharacters:[[NSCharacterSet illegalCharacterSet] invertedSet]]);
This still keep the hash tag in place by inverting the illegalCharacterSet in NSCharacterSet object. If you like more control you also create your own mutable set.
Example 2 (CFURL.h):
NSString *nonFormattedURL = #"http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa";
CFAllocatorRef allocator = CFAllocatorGetDefault();
CFStringRef formattedURL = CFURLCreateStringByAddingPercentEscapes(allocator,
(__bridge CFStringRef) nonFormattedURL,
(__bridge CFStringRef) #"#", //leave unescaped
(__bridge CFStringRef) #"", // legal characters to be escaped like / = # ? etc
NSUTF8StringEncoding); // encoding
NSLog(#"%#", formattedURL);
Does the same as above code but with way more control: replacing certain characters with the equivalent percent escape sequence based on the encoding specified, see logs for example.

NSString and NSMutableString concatenation

I have three strings (a NSString, a NSMutableString, and another NSString) which I need to concatenate into a mutable string, in that order, to display as the source for a UIWebView. Comming from a PHP/JavaScript/HTML background, my knowledge of concatenation is pretty much this:
var concatenatedString = string1 + string2 + string3;
I presume that sort of thing won't work in Objective-C, so I'm wondering how to go about pulling them all together properly.
To give a bit of setting for this, the first string (NSString) is the header and canvas element of a web page, the second string (NSMutableString) is javascript from a text field that the user can define to manipulate the canvas element, and the third string (NSString) is the end tags of the web page.
Also, rather than initially creating the NSMutableString, should I just referance the UITextView.text to the get the user's text when concatenating the whole thing, or should I pull the text from the UITextView first?
NSMutableString *concatenatedString = [[NSString stringWithFormat:#"%#%#%#", string1, string2, string3] mutableCopy];
The other two answers are correct in that they answer the question as you asked it. But by your description of what you want to do there is a much easier way. Use a format.
Assuming string1 and string3 will always be the same and only string2 will change,which is what it sounds like you are doing you can write something like this.
static NSString *formatString = #"String1Text%#String3Text";
NSString *completeString = [NSString stringWithFormat:formatString,self.myTextFieldName.text];
NSLog(#"%#",completeString);
The %# in the format says to insert the description of the object following the format.(The description of an NSString is the stringValue.)
Assuming you have a UITextField named myTextFieldName, that currently contains the text 'String2Text' Then this will be the output:
'String1TextString2TextString3Text'
In this way you only create 1 instance of an NSString format for the whole class no matter how many times you call this code.
To me it sounds like you don't need a mutable string at all. Feel free to leave a comment if I misunderstood anything.
Response to comment:
I'm not sure how you are implementing 'moves to test it out again' but, let's say you have a button named 'testJavaScript'. The IBAction method connected to that button would have the first two lines in it. So each time you pushed the button it would make a new formatted NSString filled with the current contents of the textfield. Once this string was formed it could not be changed. But it won't matter since next time it will make another.
NSString *concatenatedString = [string1 stringByAppendingFormat:#"%#%#", string2, string3];
You can make the resulting string mutable (if you really need to) by adding mutableCopy as shown in the answer by #Vinnie.

Manipulating HTML

I need to read a HTML file and search for some tags in it. Based on the results, some tags would need to be removed, other ones changed and maybe refining some attributes — to then write the file back.
Is NSXMLDocument the way to go? I don't think that a parser is really needed in this case, it could even mean more work. And I don't want to touch the entire file, all I need to do is to load the file in memory, change some things, and save it again.
Note that, I'll be dealing with HTML, and not XHTML. Could that be a problem for NSXMLDocument? Maybe some unmatched tags or un-closed ones could make it stop working.
NSXMLDocument is the way to go. That way you can use Xpath/Xquery to find the tags you want. Bad HTML might be a problem but you can set NSXMLDocumentTidyHTML and it should be OK unless it's really bad.
NSRange startRange = [string rangeOfString:#"<htmlTag>"];
NSRange endRange = [string rangeOfString:#"</htmlTag>"];
NSString *subStr = [string subStringWithRange:NSMakeRange(startRange.location+startRange.length, endRange.location-startRange.location-startRange.length)];
NSString *finalStr = [string stringByReplacingOccurencesOfString:substr];
and then write finalstr to the file.
This is what I would do, note that I don't exactly know what the advantages of using NSXMLDocument would be, this should do it perfectly.
NSXMLDocument will possibly fail, due to the fact that HTML pages are not well formed, but you can try with NSXMLDocumentTidyHTML/NSXMLDocumentTidyXML (you can use them both to improve results) as outlined here and also have a look a this for tan approach at modifying the HTML.

How do I find non-length-specified substrings in a string in Objective-C?

I'm trying, for the first time in my life, to contribute to open source software. Therefore I'm trying to help out on this ticket, as it seems to be a good "beginner ticket".
I have successfully got the string from the Twitter API: however, it's in this format:
Tweetie for Mac
What I want to extract from this string is the URL (http://twitter.com) and the name of the Twitter client (Tweetie for Mac). How can I do this in Objective-C? As the URL's aren't the same I can't search for a specified index, and the same applies for the client name.
Assuming you have the HTML link already and aren't parsing an entire HTML page.
//Your HTML Link
NSString *link = [urlstring text];
//Length of HTML href Link
int length = [link length];
//Range of the first quote
NSRange firstQuote = [link rangeOfString:#"\""];
//Subrange to search for another quote in the HTML href link
NSRange nextQuote = NSMakeRange(firstQuote.location+1, length-firstQuote.location-1);
//Range of the second quote after the first
NSRange secondQuote = [link rangeOfString:#"\"" options:NSCaseInsensitiveSearch range:nextQuote];
//Extracts the http://twitter.com
NSRange urlRange = NSMakeRange(firstQuote.location+1, (secondQuote.location-1) - (firstQuote.location));
NSString *url = [link substringWithRange:urlRange];
//Gets the > right before Tweetie for Mac
NSRange firstCaret = [link rangeOfString:#">"];
//This appears at the start of the href link, we want the next one
NSRange firstClosedCaret = [link rangeOfString:#"<"];
NSRange nextClosedCaret = NSMakeRange(firstClosedCaret.location+1, length-firstClosedCaret.location-1);
//Gets the < right after Tweetie for Mac
NSRange secondClosedCaret = [link rangeOfString:#"<" options:NSCaseInsensitiveSearch range:nextClosedCaret];
//Range of the twitter client
NSRange rangeOfTwitterClient = NSMakeRange(firstCaret.location+1, (secondClosedCaret.location-1)-(firstCaret.location));
NSString *twitterClient = [link substringWithRange:rangeOfTwitterClient];
you know that this portion of the string will be the same:
...
so what you really want is a search to the first " and the closing > for the beginning of the a tag.
The easiest way to do this would be to find what is in the quotes (see this link for how to search NSStrings) and then get the text after the second to last > for your actual name.
You could also use an NSXMLParser as that works on XML specifically, but that may be overkill for this case.
I haven't looked at Adium source but you should check if there are any categories available that extend e.g. NSString with methods for parsing html/xml to more usable structures, like a node tree for example. Then you could simply walk the tree and search for the required attributes.
If not, you may either parse it yourself by dividing the string into tokens (tag open, tag close, tag attributes, quoted strings and so on), then look for the required attributes. Alternatively you could even use a regular expression if the strings always consist of a single html anchor element.
I know it's been discussed many times that regular expressions simply don't work for html parsing, but this is a specific scenario where it's actually reasonable. Better than running a full-blown, generic html/xml parser. That would be, as slycrel said, an overkill.