Removing unicode and backslash escapes from NSString converted from NSData - objective-c

I am converting the response data from a web request to an NSString in the following manner:
NSData *data = self.responseData;
if (!data) {
return nil;
}
NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(CFStringConvertIANACharSetNameToEncoding((__bridge CFStringRef)[self.response textEncodingName]));
NSString *responseString = [[NSString alloc] initWithData:data encoding:encoding];
However the resulting string looks like this:
"birthday":"04\/01\/1990",
"email":"some.address\u0040some.domain.com"
What I would like is
"birthday":"04/01/1990",
"email":"some.address#some.domain.com"
without the backslash escapes and unicode. What is the cleanest way to do this?

The response seems to be JSON-encoded. So simply decode the response string using a JSON library (SBJson, JsonKit etc.) to get the correct form.

You can replace (or remove) characters using NSString's stringByReplacingCharactersInRange:withString: or stringByReplacingOccurrencesOfString:withString:.
To remove (convert) unicode characters, use dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES (from this answer).
I'm sorry if the following has nothing to do with your case: Personally, I would ask myself where did that back-slashes come from in the first place. For example, for JSON, I'd know that some sort of JSON serializer on the other side escapes some characters (so the slashes are really there, in the response, and that is not a some weird bug in Cocoa). That way I'd able to tell for sure which characters I have to handle and how. Or maybe I'd use some kind of library to do that for me.

Related

Understanding urls correctly

I'm writing RSS reader and taking article urls from feeds, but often have invalid urls while parsing with NSXMLParser. Sometimes have extra symbols at the end of url(for example \n,\t). This issue I fixed.
Most difficult trouble is urls with queries that have characters not allowed to be url-encoded.
Working url for URL-request http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa
'#' character will replaced to "%23" by "stringByAddingPercentEscapesUsingEncoding:" method and will not work. Site will say what page not found. I believe after '#' character is a query string.
Are there a way to get(encode) any url from feeds correctly, at least always removing a query strings from xml?
There two approaches you could use to create a legal URL string by either using stringByAddingPercentEncodingWithAllowedCharacters or by using CFURL core foundation class which gives you a whole range of options.
Example 1 (NSCharacterSet):
NSString *nonFormattedURL = #"http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa";
NSLog(#"%#", [nonFormattedURL stringByAddingPercentEncodingWithAllowedCharacters:[[NSCharacterSet illegalCharacterSet] invertedSet]]);
This still keep the hash tag in place by inverting the illegalCharacterSet in NSCharacterSet object. If you like more control you also create your own mutable set.
Example 2 (CFURL.h):
NSString *nonFormattedURL = #"http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa";
CFAllocatorRef allocator = CFAllocatorGetDefault();
CFStringRef formattedURL = CFURLCreateStringByAddingPercentEscapes(allocator,
(__bridge CFStringRef) nonFormattedURL,
(__bridge CFStringRef) #"#", //leave unescaped
(__bridge CFStringRef) #"", // legal characters to be escaped like / = # ? etc
NSUTF8StringEncoding); // encoding
NSLog(#"%#", formattedURL);
Does the same as above code but with way more control: replacing certain characters with the equivalent percent escape sequence based on the encoding specified, see logs for example.

NSData subStringFromIndex: Equivalent

I'm receiving a stream of NSData characters with around at least 50 characters. Usually, I would try and convert this to an NSString and use the subStringFromIndex: selector, but it seems like NSString is NULL terminating (correct me if I'm wrong) and I'd rather skip the data / string conversion. Does anyone know if there is a way to get the charecter at specific index in NSData? For example, say that the data returned is:
<12345678 9abcdefg hjiklmno>
Lets say I would like to get the 7 and the 8 out, and just those two alone. To get the 7 and 8, I've looked into trying something like this:
NSData *dataTrimmed = [data subdataWithRange:NSMakeRange(7, -19)];
Works like a charm. But the issue is, the stream is always going to be a different length. It could be 100 characters or it could be 50, but I always know that the two values I need are located at the 42nd and 43rd spot. Does anyone have an example of or know the best way to do this?
I wonder that your code with a negative length does not crash.
To get the two bytes at position 42, 43, just use
NSData *dataTrimmed = [data subdataWithRange:NSMakeRange(42, 2)];
Why do you want to skip the conversion to NSString?
The string you receive is encoded as NSData. Depending on the encoding each character will be represented as one or multiple bytes. If it is UTF8 encoded, some characters will be represented as one byte while other characters will be represented by two or more bytes.
For this reason, if you want your code be robust and handle different encodings and different string content you should first convert your NSData to a NSString and then index the string.
If your string is UTF-8 encoded you could do the following:
NSData *data = ...
NSString *str = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
NSString *subString = [str substringFromIndex:...
In my view, it only makes sense to skip converting to NSString if you receive a lot of data and you control both encoding and the contents of the string data you receive.
As the saying goes: Premature optimization is the root of all evil.

Unihan: combining UTF-8 chars

I am using data that involves Chinese Unihan characters in an Objective-C app. I am using a voice recognition program (cmusphinx) that returns a phrase from my data. It returns UTF-8 characters and when returning a Chinese character (which is three bytes) it separates it into three separate characters.
Example: When I want 人 to, I see: ‰∫∫. This is the proper in coding (E4 BA BA), but my code sees the returned value as three seperate characters rather than one.
Actually, my function is receiving the phrase as an NSString, (due to a wrap around) which uses UTF-16. I tried using Objective-C's built in conversion methods (to UTF-8 and from UTF-16), but these keep my string as three characters.
How can I decode these three separate characters into the one utf-8 codepoint for the Chinese character?
Or how can I properly encode it?
This is code fragment dealing with the cstring returned from sphinx and its encoding to a NSString:
const char * hypothesis = ps_get_hyp(pocketSphinxDecoder, &recognitionScore, &utteranceID);
NSString *hypothesisString = [[NSString alloc] initWithCString:hypothesis encoding:NSMacOSRomanEncoding];
Edit: From looking at the addition to your post, you actually do have control over the string encoding. In that case, why are you creating the string with NSMacOSRomanEncoding when you're expecting utf-8? Just change that to NSUTF8StringEncoding.
It sounds like what you're saying is you're being given an NSString that contains UTF-8 data that's being interpreted as a single-byte encoding (e.g. ISO-Latin-1, MacRoman, etc). I'm assuming here that you have no control over the code that creates the NSString, because if you did then the solution is just to change the encoding it's initializing with.
In any case, what you're asking for is a way to take the data in the string and convert it back to UTF-8. You can do this by creating an NSData from the NSString using whatever encoding its was originally created with (you need to know this much, at least, or it won't work), and then you can create a new NSString from the same data using UTF-8.
From the example character you gave (人) it looks like it's being interpreted as MacRoman, so lets go with that. The following code should convert it back:
- (NSString *)fixEncodingOfString:(NSString *)input {
CFStringEncoding cfEncoding = kCFStringEncodingMacRoman;
NSStringEncoding encoding = CFStringCovnertEncodingToNSStringEncoding(cfEncoding);
NSData *data = [input dataUsingEncoding:encoding];
if (!data) {
// the string wasn't actually in MacRoman
return nil;
}
NSString *output = [[[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding] autorelease];
}

NSString writeToFile with URL encoded string

I have a Mac application that keeps it's own log file. It appends info to the file using NSString's writeToFile method. One of the things that it logs are URL's of web services that it is interacting with. To encode the URL, I'm doing this:
searchString = (NSString *)CFURLCreateStringByAddingPercentEscapes(NULL, (CFStringRef)searchString, NULL, (CFStringRef)#"!*'();:#&=+$,/?%#[]", kCFStringEncodingUTF8 );
The app then appends searchString to the rest of the URL and writes it to the log file. Now the problem is that after adding that URL encoding line, nothing seems to be getting written to the file. The program functions as expected otherwise however. Removing the line of code above results in all of the correct information being logged to the file (removing that line is not an option because searchString must be URL encoded).
Oh and I am using NSUTF8StringEncoding when writing the NSString to the file.
Thanks for any help.
EDIT: I know there's also a similar function to CFURLCreateStringByAddingPercentEscapes in NSString, but I've read that it doesn't always work. Can anyone shed some light on this if my original question cannot be answered? Thanks! (EDIT: same problem occurs when using stringByAddingPercentEscapesUsingEncoding:)
EDIT 2: Here's the code that I'm using to append messages to the log file.
+(void)logText:(NSString *)theString{
NSString *docsDirectory = [NSSearchPathForDirectoriesInDomains(NSApplicationSupportDirectory,NSUserDomainMask,YES) objectAtIndex:0];
NSString *path = [docsDirectory stringByAppendingPathComponent:#"Folder/File.log"];
NSString *fileContents = [[[NSString alloc] initWithContentsOfFile:path] autorelease];
if([fileContents lengthOfBytesUsingEncoding:NSUTF8StringEncoding] >= 204800){
fileContents = #"";
}
NSString *timeStamp = [[NSDate date] description];
timeStamp = [timeStamp stringByAppendingString:#": "];
timeStamp = [timeStamp stringByAppendingString:theString];
fileContents = [fileContents stringByAppendingString:timeStamp];
fileContents = [fileContents stringByAppendingString:#"\n"];
[fileContents writeToFile:path atomically:YES encoding:NSUTF8StringEncoding error:NULL];
}
Because after almost a whole day no one else has offered any answers, I'm going to post a wild guess here: you're not accidentally using the string you want to output (with percent characters in it) as a format string are you?
That is, making the mistake of doing:
NSLog(#"In format strings you can use %# as a placeholder for an object, and %i for a plain C integer.")
Instead of:
NSLog(#"%#", #"In format strings you can use %# as a placeholder for an object, and %i for a plain C integer.");
But I'm going to be surprised if this turns out to be the cause of your problem, as it usually causes random-looking output, rather than absolutely no output. And in some cases, Xcode also gives compiler warnings about it (when I tried NSLog(myString), I got "warning: format not a string literal and no format arguments").
So don't shoot me down if this answer doesn't help. It would be easier to answer your question if you could show us more of your logging code. As for the one line you provided, I can't detect anything wrong with it.
Edit: Oops, I kind of missed that you mentioned you're using writeToFile:atomically:encoding:error: to write the string to the file, so it's even more unlikely you're accidentally treating it as a format string somewhere. But I'm going to leave this answer up for now. Again, you should really show us more of your code though ...
Edit: Regarding your question on a method in NSString that has similar percent encoding functionality, that would be stringByAddingPercentEscapesUsingEncoding:. I'm not sure what kind of problems you're thinking of when you say you've heard it doesn't always work. But one thing is that CFURLCreateStringByAddingPercentEscapes allows you to specify extra characters that don't normally have to be escaped but which you still want to be escaped, while the method of NSString doesn't allow you to specify this.

Unicode data from NSData to NSString

So if I have NSData from an HTTP request, then I do something like this:
NSString *test = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
This will result in null if the data contains weird unicode data (title is from reddit):
{"title":"click..██me..and..then██________ ██check██_.your...██.__...██____ ██....██████████████....██____ ██████....██████....██████____ ██████████████████████____ ....██████████████████______ ........██..._recently....██________ ....██....viewed....links....██_____"},
How would I convert the data to a string?
Ideally, it would best if the string wasn't null so I could parse it as JSON, but even a lossy conversion is fine with me in these cases.
I'm not familiar with unicode (naive American I am), so any enlightenment about that would be a nice bonus :)
If I copy and paste that text into a UTF-8 text file, read it with dataWithContentsOfURL: and convert it to a string with initWithData:encoding:, it works fine. The most likely explanation is that you are not getting valid UTF-8 data.