Understanding urls correctly - objective-c

I'm writing RSS reader and taking article urls from feeds, but often have invalid urls while parsing with NSXMLParser. Sometimes have extra symbols at the end of url(for example \n,\t). This issue I fixed.
Most difficult trouble is urls with queries that have characters not allowed to be url-encoded.
Working url for URL-request http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa
'#' character will replaced to "%23" by "stringByAddingPercentEscapesUsingEncoding:" method and will not work. Site will say what page not found. I believe after '#' character is a query string.
Are there a way to get(encode) any url from feeds correctly, at least always removing a query strings from xml?

There two approaches you could use to create a legal URL string by either using stringByAddingPercentEncodingWithAllowedCharacters or by using CFURL core foundation class which gives you a whole range of options.
Example 1 (NSCharacterSet):
NSString *nonFormattedURL = #"http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa";
NSLog(#"%#", [nonFormattedURL stringByAddingPercentEncodingWithAllowedCharacters:[[NSCharacterSet illegalCharacterSet] invertedSet]]);
This still keep the hash tag in place by inverting the illegalCharacterSet in NSCharacterSet object. If you like more control you also create your own mutable set.
Example 2 (CFURL.h):
NSString *nonFormattedURL = #"http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa";
CFAllocatorRef allocator = CFAllocatorGetDefault();
CFStringRef formattedURL = CFURLCreateStringByAddingPercentEscapes(allocator,
(__bridge CFStringRef) nonFormattedURL,
(__bridge CFStringRef) #"#", //leave unescaped
(__bridge CFStringRef) #"", // legal characters to be escaped like / = # ? etc
NSUTF8StringEncoding); // encoding
NSLog(#"%#", formattedURL);
Does the same as above code but with way more control: replacing certain characters with the equivalent percent escape sequence based on the encoding specified, see logs for example.

Related

How to discover if a c-string can be encoded to NSString with a given encoding

I am trying to implement code that converts const char * to NSString. I would like to try multiple encodings in a specified order until I find one that works. Unfortunately, all the initWith... methods on NSString say that the results are undefined if the encoding doesn't work.
In particular, (sometimes) I would like to try first to encode as NSMacOSRomanStringEncoding which never seems to fail. Instead it just encodes gobbledygook. Is there some kind of check I can perform ahead of time? (Like canBeConvertedToEncoding but in the other direction?)
Instead of trying encodings one by one until you find a match, consider asking NSString to help you out here by using +[NSString stringEncodingForData:encodingOptions:convertedString:usedLossyConversion:], which, given string data and some options, may be able to detect the encoding for you, and return it (along with the actual decoded string).
Specifically for your use-case, since you have a list of encodings you'd like to try, the encodingOptions parameter will allow you to pass those encodings in using the NSStringEncodingDetectionSuggestedEncodingsKey.
So, given a C string and some possible encoding options, you might be able to do something like:
NSString *decodeCString(const char *source, NSArray<NSNumber *> *encodings) {
NSData * const cStringData = [NSData dataWithBytesNoCopy:(void *)source length:strlen(source) freeWhenDone:NO];
NSString *result = nil;
BOOL usedLossyConversion = NO;
NSStringEncoding determinedEncoding = [NSString stringEncodingForData:cStringData
encodingOptions:#{NSStringEncodingDetectionSuggestedEncodingsKey: encodings,
NSStringEncodingDetectionUseOnlySuggestedEncodingsKey: #YES}
convertedString:&result
usedLossyConversion:&usedLossyConversion];
/* Decide whether to do anything with `usedLossyConversion` and `determinedEncoding. */
return result;
}
Example usage:
NSString *result = decodeCString("Hello, world!", #[#(NSShiftJISStringEncoding), #(NSMacOSRomanStringEncoding), #(NSASCIIStringEncoding)]);
NSLog(#"%#", result); // => "Hello, world!"
If you don't 100% care about using only the list of encodings you want to try, you can drop the NSStringEncodingDetectionUseOnlySuggestedEncodingsKey option.
One thing to note about the encoding array you pass in: although the documentation doesn't promise that the suggested encodings are attempted in order, spelunking through the disassembly of the (current) method implementation shows that the array is enumerated using fast enumeration (i.e., in order). I can imagine that this could change in the future (or have been different in the past) so if this is somehow a hard requirement for you, you could theoretically work around it by repeatedly calling +stringEncodingForData:encodingOptions:convertedString:usedLossyConversion: one encoding at a time in order, but this would likely be incredibly expensive given the complexity of this method.

Using Placeholders in a URL string

I have a url that retrieves data from a Web API which looks like this with the entry of "Pizza Hut":
NSString *urlString = #"https://api.nutritionix.com/v1_1/search/Pizza Hut?results=0%3A20&cal_min=0&cal_max=50000&fields=item_name%2Cbrand_name%2Citem_id%2Cbrand_id&appId=MY_APP_ID&appKey=MY_APP_KEY";
This URL will return all the menu items of Pizza Hut.
Now I want to take a step beyond hard coding values, and so I created a text box where users can enter their own restaurant, and the web api should return data.
Here is an example of that:
NSString *urlString = [NSString stringWithFormat:#"https://api.nutritionix.com/v1_1/search/%#?results=0%3A20&cal_min=0&cal_max=50000&fields=item_name%2Cbrand_name%2Citem_id%2Cbrand_id&appId=MY_APP_ID&appKey=MY_APP_KEY", searchText.text];
All I did here was change the "Pizza Hut" to "%#".
However, I get a warning from the compiler saying:
"More '%' conversions than data arguments. As you would expect, the API returns no data, for this code doesn't seem to be working.
How would I re-write this string so that I could put the placeholder in there?
You have other percent symbols that need to be escaped properly. You want:
NSString *urlString = [NSString stringWithFormat:#"https://api.nutritionix.com/v1_1/search/%#?results=0%%3A20&cal_min=0&cal_max=50000&fields=item_name%%2Cbrand_name%%2Citem_id%%2Cbrand_id&appId=MY_APP_ID&appKey=MY_APP_KEY", searchText.text];
Basically, add a 2nd % symbol before all of the % symbols that you actually want to appear in the string.
BTW - make sure you properly escape the search text so special characters (such as spaces) are properly encoded.

Removing unicode and backslash escapes from NSString converted from NSData

I am converting the response data from a web request to an NSString in the following manner:
NSData *data = self.responseData;
if (!data) {
return nil;
}
NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(CFStringConvertIANACharSetNameToEncoding((__bridge CFStringRef)[self.response textEncodingName]));
NSString *responseString = [[NSString alloc] initWithData:data encoding:encoding];
However the resulting string looks like this:
"birthday":"04\/01\/1990",
"email":"some.address\u0040some.domain.com"
What I would like is
"birthday":"04/01/1990",
"email":"some.address#some.domain.com"
without the backslash escapes and unicode. What is the cleanest way to do this?
The response seems to be JSON-encoded. So simply decode the response string using a JSON library (SBJson, JsonKit etc.) to get the correct form.
You can replace (or remove) characters using NSString's stringByReplacingCharactersInRange:withString: or stringByReplacingOccurrencesOfString:withString:.
To remove (convert) unicode characters, use dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES (from this answer).
I'm sorry if the following has nothing to do with your case: Personally, I would ask myself where did that back-slashes come from in the first place. For example, for JSON, I'd know that some sort of JSON serializer on the other side escapes some characters (so the slashes are really there, in the response, and that is not a some weird bug in Cocoa). That way I'd able to tell for sure which characters I have to handle and how. Or maybe I'd use some kind of library to do that for me.

Unihan: combining UTF-8 chars

I am using data that involves Chinese Unihan characters in an Objective-C app. I am using a voice recognition program (cmusphinx) that returns a phrase from my data. It returns UTF-8 characters and when returning a Chinese character (which is three bytes) it separates it into three separate characters.
Example: When I want 人 to, I see: ‰∫∫. This is the proper in coding (E4 BA BA), but my code sees the returned value as three seperate characters rather than one.
Actually, my function is receiving the phrase as an NSString, (due to a wrap around) which uses UTF-16. I tried using Objective-C's built in conversion methods (to UTF-8 and from UTF-16), but these keep my string as three characters.
How can I decode these three separate characters into the one utf-8 codepoint for the Chinese character?
Or how can I properly encode it?
This is code fragment dealing with the cstring returned from sphinx and its encoding to a NSString:
const char * hypothesis = ps_get_hyp(pocketSphinxDecoder, &recognitionScore, &utteranceID);
NSString *hypothesisString = [[NSString alloc] initWithCString:hypothesis encoding:NSMacOSRomanEncoding];
Edit: From looking at the addition to your post, you actually do have control over the string encoding. In that case, why are you creating the string with NSMacOSRomanEncoding when you're expecting utf-8? Just change that to NSUTF8StringEncoding.
It sounds like what you're saying is you're being given an NSString that contains UTF-8 data that's being interpreted as a single-byte encoding (e.g. ISO-Latin-1, MacRoman, etc). I'm assuming here that you have no control over the code that creates the NSString, because if you did then the solution is just to change the encoding it's initializing with.
In any case, what you're asking for is a way to take the data in the string and convert it back to UTF-8. You can do this by creating an NSData from the NSString using whatever encoding its was originally created with (you need to know this much, at least, or it won't work), and then you can create a new NSString from the same data using UTF-8.
From the example character you gave (人) it looks like it's being interpreted as MacRoman, so lets go with that. The following code should convert it back:
- (NSString *)fixEncodingOfString:(NSString *)input {
CFStringEncoding cfEncoding = kCFStringEncodingMacRoman;
NSStringEncoding encoding = CFStringCovnertEncodingToNSStringEncoding(cfEncoding);
NSData *data = [input dataUsingEncoding:encoding];
if (!data) {
// the string wasn't actually in MacRoman
return nil;
}
NSString *output = [[[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding] autorelease];
}

Xcode - EXEC_BAD_ACCESS when concatenting a large string

I'm getting a EXEC_BAD_ACCESS when concatenting a large string.
I've read from a feed and to create my webview I build up my string like:
NSString *pageData = #"<h1>header</h1>";
pageData = [pageData stringByAppendingFormat#"<p>"];
pageData = [pageData stringByAppendingFormat#"self.bodyText"];
pageData = [pageData stringByAppendingFormat#"</p>"];
etc
The problem I've got is self.bodytext is 21,089 characters with spaces when I do a count on word.
Is there a better method for doing this?
Thanks
You would definitely want to use NSMutableString for something like this:
NSMutableString * pageData = [NSMutableString stringWithCapacity:0];
[pageData appendFormat:#"<h1>header</h1>"];
[pageData appendFormat:#"<p>"];
...
NSMutableString is designed for this kind of sequential concatenation, where the basic NSString class is really not meant to be used in this manner. Your original code would actually allocate a new NSString every time you called stringByAppendFormat:, and then procede to copy into it all of the thousands of characters you had already appended. This could easily result in an out of memory error, since the size of the temporary strings would be growing exponentially as you add more and more calls.
Using NSMutableString will not re-copy all of the string data when you call appendFormat:, since the mutable string maintains an internal buffer and simply tacks new strings on to the end of it. Depending on the size of your string, you may want to reserve a huge chunk of memory ahead of time (use a meaningful number for the ...WithCapacity: argument). But there is no need to go that route unless you actually run into performance issues.
There are a few problems with your sample code:
You should be using a NSMutableString to build up an output string by appending multiple parts. NSString is an immutable class which means that each time you call stringByAppendingFormat: you are incurring the overhead of creating an additional new NSString object which will need to be collected and released by the autorelease pool.
NSMutableString * pageData = [NSMutableString stringWithCapacity:0];
You should use appendString: on your NSMutableString to append content, instead of stringByAppendingFormat: or appendFormat:. The format methods are intended for creating new strings based on a format specifier which includes special fields as placeholders. See Formatting String Objects for more details. When you're using stringByAppendingFormat: with just a literal string like your code has, you are incurring the overhead of parsing the string for the non-existant placeholders, and more importantly, if the string happens to have a placeholder (or something that looks like one) in it, you'll end up with the EXEC_BAD_ACCESS crash that you are getting. Most likely this happening when your bodyText is appended. Thus if you simply want to append a '' to your NSMutableString do something like this:
[pageData appendString:#"<p>"];
If you want to append the contents of the self.bodyText property to the string, you shouldn't put the name of the property inside of a string literal (i.e. #"self.bodyText" is the literal string "self.bodyText", not the contents of the property. Try:
[pageData appendString:self.bodyText];
As an example, you could actually combine all three lines of your sample code by using a format specification:
pageData = [pageData stringByAppendingFormat:#"<p>%#</p>", self.bodyText];
In the format specification %# is a placeholder that means insert the result of sending the description or descriptionWithLocale: message to the object. For an NSString this is simply the contents of the string.
I doubt the length of the string is really a problem. A 50,000-character string is only about 100 KB. But you want to be very careful about using format strings. If your string contains something that looks like a formatting specifier, there had better be a corresponding argument or you'll get garbage if you're lucky and a crash if you're not. I suspect this is the error, since there is no other obvious problem from your description. Be careful about what you put in there, and avoid ever putting dynamic text in a format string — just put a %# in the format string and pass the dynamic text as an argument.
Use appendString: instead of appendFormat: when dealing with arbitrary strings.
pageData = [pageData stringByAppendingString:#"<p>"];
pageData = [pageData stringByAppendingString:#"self.bodyText"];
pageData = [pageData stringByAppendingString:#"</p>"];
or do not use an arbitrary string as the format:
pageData = [pageData stringByAppendingFormat:#"<p>%#</p>" , #"self.bodyText"];
If you are building the string up in pieces, use NSMutableString instead of several stringBy calls.
Remember that % is a special character for formatted strings and for url escapes, so if bodyText contains a url it could easily cause a crash.