Unihan: combining UTF-8 chars - objective-c

I am using data that involves Chinese Unihan characters in an Objective-C app. I am using a voice recognition program (cmusphinx) that returns a phrase from my data. It returns UTF-8 characters and when returning a Chinese character (which is three bytes) it separates it into three separate characters.
Example: When I want 人 to, I see: ‰∫∫. This is the proper in coding (E4 BA BA), but my code sees the returned value as three seperate characters rather than one.
Actually, my function is receiving the phrase as an NSString, (due to a wrap around) which uses UTF-16. I tried using Objective-C's built in conversion methods (to UTF-8 and from UTF-16), but these keep my string as three characters.
How can I decode these three separate characters into the one utf-8 codepoint for the Chinese character?
Or how can I properly encode it?
This is code fragment dealing with the cstring returned from sphinx and its encoding to a NSString:
const char * hypothesis = ps_get_hyp(pocketSphinxDecoder, &recognitionScore, &utteranceID);
NSString *hypothesisString = [[NSString alloc] initWithCString:hypothesis encoding:NSMacOSRomanEncoding];

Edit: From looking at the addition to your post, you actually do have control over the string encoding. In that case, why are you creating the string with NSMacOSRomanEncoding when you're expecting utf-8? Just change that to NSUTF8StringEncoding.
It sounds like what you're saying is you're being given an NSString that contains UTF-8 data that's being interpreted as a single-byte encoding (e.g. ISO-Latin-1, MacRoman, etc). I'm assuming here that you have no control over the code that creates the NSString, because if you did then the solution is just to change the encoding it's initializing with.
In any case, what you're asking for is a way to take the data in the string and convert it back to UTF-8. You can do this by creating an NSData from the NSString using whatever encoding its was originally created with (you need to know this much, at least, or it won't work), and then you can create a new NSString from the same data using UTF-8.
From the example character you gave (人) it looks like it's being interpreted as MacRoman, so lets go with that. The following code should convert it back:
- (NSString *)fixEncodingOfString:(NSString *)input {
CFStringEncoding cfEncoding = kCFStringEncodingMacRoman;
NSStringEncoding encoding = CFStringCovnertEncodingToNSStringEncoding(cfEncoding);
NSData *data = [input dataUsingEncoding:encoding];
if (!data) {
// the string wasn't actually in MacRoman
return nil;
}
NSString *output = [[[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding] autorelease];
}

Related

NSData subStringFromIndex: Equivalent

I'm receiving a stream of NSData characters with around at least 50 characters. Usually, I would try and convert this to an NSString and use the subStringFromIndex: selector, but it seems like NSString is NULL terminating (correct me if I'm wrong) and I'd rather skip the data / string conversion. Does anyone know if there is a way to get the charecter at specific index in NSData? For example, say that the data returned is:
<12345678 9abcdefg hjiklmno>
Lets say I would like to get the 7 and the 8 out, and just those two alone. To get the 7 and 8, I've looked into trying something like this:
NSData *dataTrimmed = [data subdataWithRange:NSMakeRange(7, -19)];
Works like a charm. But the issue is, the stream is always going to be a different length. It could be 100 characters or it could be 50, but I always know that the two values I need are located at the 42nd and 43rd spot. Does anyone have an example of or know the best way to do this?
I wonder that your code with a negative length does not crash.
To get the two bytes at position 42, 43, just use
NSData *dataTrimmed = [data subdataWithRange:NSMakeRange(42, 2)];
Why do you want to skip the conversion to NSString?
The string you receive is encoded as NSData. Depending on the encoding each character will be represented as one or multiple bytes. If it is UTF8 encoded, some characters will be represented as one byte while other characters will be represented by two or more bytes.
For this reason, if you want your code be robust and handle different encodings and different string content you should first convert your NSData to a NSString and then index the string.
If your string is UTF-8 encoded you could do the following:
NSData *data = ...
NSString *str = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
NSString *subString = [str substringFromIndex:...
In my view, it only makes sense to skip converting to NSString if you receive a lot of data and you control both encoding and the contents of the string data you receive.
As the saying goes: Premature optimization is the root of all evil.

NSData to NSString returns Null

I have searched. But still couldnt get it.
I'm converting NSdata to NSString.
When I do [data description];
it returns me <00000000 31323334 35363738>
Yes, Im receiving my string #"12345678".
How do I convert it to NSString appropriately?
I tried
NSString *b = [NSString stringWithUTF8String:[data bytes]];
NSString *a = [[NSString alloc] initWithBytes:[data bytes] length:[data length] encoding:NSUTF8StringEncoding];
Both returns me null.
Any idea?
Thanks
Hi all,
Thanks for all suggestion.
It appears to be constant whereby theres a null character infront always.
So whenever I receive something, i just remove the first <00000000>, then its working fine already
This happens if the encoding is incorrect.
Try using ASCII to test out. ASCII almost certainly work to retrive somekind of string. If it's only numbers it will probably work.
NSString *a = [[NSString alloc] initWithBytes:[data bytes] length:[data length] encoding:NSASCIIStringEncoding];
Most common except UTF-8 enconding is:
NSASCIIStringEncoding
NSUnicodeStringEncoding
NSISOLatin1StringEncoding
NSISOLatin2StringEncoding
NSSymbolStringEncoding
try them out and see if they work.
I'm converting NSdata to NSString. When I do [data description]; it
returns me <00000000 31323334 35363738> Yes, Im receiving my string
#"12345678".
No -- you aren't receiving that string. You are receiving a byte sequence that starts with a bunch of 0x00 values and is followed by a series of bytes that happen to correspond to the ASCII sequence "12345678".
I.e. you have raw data and are trying to convert it to a constrained type, but can't because the constrained type cannot represent the raw data.
You could try using the "lossy conversion" APIs on NSString, but that might not work and would be fragile anyway.
Best bet?
Only convert the bytes in the NSData that actually represent the string to an instance of NSString. That can be done with -initWithBytes:length:encoding:; you'll need to do the calculations to find the correct offset and length.
This may be because the first bytes of your data is 00. The character 0 is the end of string character. When creating a string from ASCII (from an array of chars or an array of bytes as you are doing), when the character 0 is encountered at the beginning, it produces an empty string.
I would however expect it to return an instance of NSString with 0 characters, and not null.

Removing unicode and backslash escapes from NSString converted from NSData

I am converting the response data from a web request to an NSString in the following manner:
NSData *data = self.responseData;
if (!data) {
return nil;
}
NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(CFStringConvertIANACharSetNameToEncoding((__bridge CFStringRef)[self.response textEncodingName]));
NSString *responseString = [[NSString alloc] initWithData:data encoding:encoding];
However the resulting string looks like this:
"birthday":"04\/01\/1990",
"email":"some.address\u0040some.domain.com"
What I would like is
"birthday":"04/01/1990",
"email":"some.address#some.domain.com"
without the backslash escapes and unicode. What is the cleanest way to do this?
The response seems to be JSON-encoded. So simply decode the response string using a JSON library (SBJson, JsonKit etc.) to get the correct form.
You can replace (or remove) characters using NSString's stringByReplacingCharactersInRange:withString: or stringByReplacingOccurrencesOfString:withString:.
To remove (convert) unicode characters, use dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES (from this answer).
I'm sorry if the following has nothing to do with your case: Personally, I would ask myself where did that back-slashes come from in the first place. For example, for JSON, I'd know that some sort of JSON serializer on the other side escapes some characters (so the slashes are really there, in the response, and that is not a some weird bug in Cocoa). That way I'd able to tell for sure which characters I have to handle and how. Or maybe I'd use some kind of library to do that for me.

Convert special characters like ë,à,é,ä all to e,a,e,a? Objective C

Is there a simple way in objective c to convert all special characters like ë,à,é,ä to the normal characters like e en a?
Yep, and it's pretty simple:
NSString *src = #"Convert special characters like ë,à,é,ä all to e,a,e,a? Objective C";
NSData *temp = [src dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *dst = [[[NSString alloc] initWithData:temp encoding:NSASCIIStringEncoding] autorelease];
NSLog(#"converted: %#", dst);
Running that on my machine produces:
EmptyFoundation[69299:a0f] converted: Convert special characters like e,a,e,a all to e,a,e,a? Objective C
Basically, we're asking the string to transform itself it an NSData (ie, a byte array) that represents the characters in the string in the ASCII character set. Since not all of the characters in the original string are in ASCII, we tell the string that it's OK to do a "lossy" conversion. In other words, it's OK to turn "é" into "e", and so on.
Once we've got our byte array, we simply turn it back into a string, and we're done! :)
CFStringTransform
CFStringTransform is the solution when you are dealing with a specific language. It transliterates strings in ways that simplify normalization, indexing, and searching. For example, it can remove accent marks using the option kCFStringTransformStripCombiningMarks:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("Schläger"));
CFStringTransform(string, NULL, kCFStringTransformStripCombiningMarks,
false);
... => string is now “Schlager” CFRelease(string);
CFStringTransform is even more powerful when you are dealing with non-Latin writing systems such as Arabic or Chinese. It can convert many writing systems to Latin script, making normalization much simpler.
For example, you can convert Chinese script to Latin script like this:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("你好"));
CFStringTransform(string, NULL, kCFStringTransformToLatin, false);
... => string is now “nˇı hˇao”
CFStringTransform(string, NULL, kCFStringTransformStripCombiningMarks,
false);
... => string is now “ni hao” CFRelease(string);
Notice that the option is simply kCFStringTransformToLatin.
The source language is not required. You can hand almost any string to
this transform without having to know first what language it is in.
CFStringTransform can also transliterate from Latin script to other
writing systems such as Arabic, Hangul, Hebrew, and Thai.
References: iOS 7 Programming: Pushing to the limits

Unicode data from NSData to NSString

So if I have NSData from an HTTP request, then I do something like this:
NSString *test = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
This will result in null if the data contains weird unicode data (title is from reddit):
{"title":"click..██me..and..then██________ ██check██_.your...██.__...██____ ██....██████████████....██____ ██████....██████....██████____ ██████████████████████____ ....██████████████████______ ........██..._recently....██________ ....██....viewed....links....██_____"},
How would I convert the data to a string?
Ideally, it would best if the string wasn't null so I could parse it as JSON, but even a lossy conversion is fine with me in these cases.
I'm not familiar with unicode (naive American I am), so any enlightenment about that would be a nice bonus :)
If I copy and paste that text into a UTF-8 text file, read it with dataWithContentsOfURL: and convert it to a string with initWithData:encoding:, it works fine. The most likely explanation is that you are not getting valid UTF-8 data.