NSString and unichar don't match well when it comes to Unicode - objective-c

The Apple's documentation states that
A string object is implemented as an array of Unicode characters
However, the size of unichar data type, which is likely to be unsigned short behind the scenes, is only 16 bits, which renders impossible to represent every Unicode character with unichar. How do I reconcile these two facts in my mind?

You are correct that Apple's docs incorrectly refer to Unicode characters when it really means UTF-16 code points.
In the early days of Unicode it was hoped that it would not exceed 16 bits, but it has. Both Apple and Microsoft (and probably others) use 16-bit integers to represent "Unicode characters", even though some characters will have to be represented by surrogate pairs.
Various methods of NSString handle this case (plus combining characters) and return a range for a given character. E.g. -rangeOfCharacterFromSet:... and -rangeOfComposedCharacterSequences....

It's not sure that strings are represented by the unichar data type. "A string object is implemented as an array of Unicode characters" doesn't mean in the source code it is stored as unichar *. You don't know how it is implemented, do you?
And what if unichar is not an unsigned short? What if it is a 32- or 64-bit data type?

Related

How do I convert a unicode code point range into an NSString character range?

I have an NSString and a unicode code point range that represents a specific section of the text in that NSString. Since the characters in that NSString do not correspond one-to-one with code points, I need to somehow convert my code point range into the corresponding character range. How do I do this?
I know I can use the NSString method -rangeOfComposedCharacterSequencesForRange: to convert a character range to a grapheme cluster range, but what I want to do is sort of the opposite of that, and I can't find an inverse of that method in the APIs. And even if there was such a method available, I don't think this is exactly what I'm looking for, since (if I understand this correctly) a grapheme cluster is not the same thing as a unicode code point, and can in fact be composed of more than one code point.
What you have is kind of mixed data from two different worlds. You might typically get a Unicode code point range along with a UTF-32 string (where the correspondence is one-to-one) so that extracting the substring would be trivial. You have two options:
Work in the UTF-32 world before you put the data into an NSString
Convert the Unicode code point range into a UTF-16 unit range
I assume from your question that #2 is the easiest option in your case.
As you say, characters in an NSString do not correspond one-to-one with Unicode code points since an NSString character is a UTF-16 unit. However, a Unicode code point corresponds to exactly 1 or 2 characters in an NSString. You can fairly easily write your own range conversion routine by iterating through the NSString characters and counting Unicode code points. This is made somewhat easier by the fact that you don't even care about the endianness of the UTF-16 data since valid BMP characters, lead surrogates, and trail surrogates are disjoint. CFString provides some functions to determine what each character is. So in pseudocode you counting would look like:
for each NSString character {
if (CFStringIsSurrogateHighCharacter(character) ||
CFStringIsSurrogateLowCharacter(character))
{
Skip forward another character in the NSString
}
Increment count of Unicode code points stepped through
}

numerical value of a unicode character in objective c

is it possible to get a numerical value from a unicode character in objective-c?
#"A" is 0041, #"➜" is 279C, #"Ω" is 03A9, #"झ" is 091D... ?
OK, so it’s perhaps worth pointing a few things out in a separate answer here. First, the term “character” is ambiguous, so we should choose a more appropriate term depending on what we mean. (See Characters and Grapheme Clusters in the Apple developer docs, as well as the Unicode website for more detail.)
If you are asking for the UTF-16 code unit, then you can use
unichar ch = [myString characterAtIndex:ndx];
Note that this is only equivalent to a Unicode code-point in the case where the code point is within the Basic Multilingual Plane (i.e. it is less than U+FFFF).
If you are asking for the Unicode code point, then you should be aware that UTF-16 supports characters outside of the BMP (i.e. U+10000 and above) using surrogate pairs. Thus there will be two UTF-16 code units for any code point above U+10000. To detect this case, you need to do something like
uint32_t codepoint = [myString characterAtIndex:ndx];
if ((codepoint & 0xfc00) == 0xd800) {
unichar ch2 = [myString characterAtIndex:ndx + 1];
codepoint = (((codepoint & 0x3ff) << 10) | (ch2 & 0x3ff)) + 0x10000;
}
Note that in production code, you should also test for and cope with the case where the surrogate pair has been truncated somehow.
Importantly, neither UTF-16 code units, nor Unicode code points necessarily correspond to anything that and end-user would regard as a “character” (the Unicode consortium generally refers to this as a grapheme cluster to distinguish it from other possible meanings of “character”). There are many examples, but the simplest to understand are probably the combining diacritical marks. For instance, the character ‘Ä’ can be represented as the Unicode code point U+00C4, or as a pair of code points, U+0041 U+0308.
Sometimes people (like #DietrichEpp in the comments on his answer) will claim that you can deal with this by converting to precomposed form before dealing with your string. This is something of a red herring, because precomposed form only deals with characters that have a precomposed equivalent in Unicode. e.g. it will not help with all combining marks; it will not help with Indic or Arabic scripts; it will not help with Hangul Jamos. There are many other cases as well.
If you are trying to manipulate grapheme clusters (things the user might think of as “characters”), you should probably make use of the NSString methods -rangeOfComposedCharacterSequencesForRange:, rangeOfComposedCharacterSequenceAtIndex: or the CFString function CFStringGetRangeOfComposedCharactersAtIndex. Obviously you cannot hold a grapheme cluster in an integer variable and it has no inherent numerical value; rather, it is represented by a string of code points, which are represented by a string of code units. For instance:
NSRange gcRange = [myString rangeOfComposedCharacterSequenceAtIndex:ndx];
NSString *graphemeCluster = [myString substringWithRange:gcRange];
Note that graphemeCluster may be arbitrarily long(!)
Even then, we have ignored the effects of matters such as Unicode’s support for bidirectional text. That is, the order of the code points represented by the code units in your NSString may in some cases be the reverse of what you might expect. The worse cases involve things like English text embedded in Arabic or Hebrew; this is supported by the Cocoa Text system, and so you really can end up with bidirectional strings in your code.
To summarise: generally speaking one should avoid examining NSString and CFString instances unichar by unichar. If at all possible, use an appropriate NSString method or CFString function instead. If you do find yourself examining the UTF-16 code units, please familiarise yourself with the Unicode standard first (I recommend “Unicode Demystified” if you can’t stomach reading through the Unicode book itself), so that you can avoid the major pitfalls.
Cocoa strings allow you to access the UTF-16 elements using -characterAtIndex:, so the following code will convert the string to a unicode code point:
unsigned strToChar(NSString *str)
{
unsigned c1, c2;
c1 = [str characterAtIndex:0];
if ((c1 & 0xfc00) == 0xd800) {
c2 = [str characterAtIndex:1];
return (((c1 & 0x3ff) << 10) | (c2 & 0x3ff)) + 0x10000;
} else {
return c1;
}
}
I am not aware of any convenience functions for this. You can use -characterAtIndex: by itself if you are okay with your code breaking horribly when someone uses characters outside the BMP; a number of applications on OS X break horribly in this way.
The following should render as a musical "G clef", U+1D11E, but if you copy and paste it into some text editors (TextMate), they'll let you do bizarre things like delete half of the character, at which point your text file is garbage.
𝄞

How to compare two NSString efficiently

I know it is possible to use the methods compare: and isEqualToString:, and I suppose isEqualToString is the most efficient method If you know it´s an string. But my question is, is there another way to do it more efficiently? Like comparing char by char or something like that.
By reading the documentation:
The comparison uses the canonical representation of strings, which for a particular string is the length of the string plus the Unicode characters that make up the string. When this method compares two strings, if the individual Unicodes are the same, then the strings are equal, regardless of the backing store. “Literal” when applied to string comparison means that various Unicode decomposition rules are not applied and Unicode characters are individually compared. So, for instance, “Ö” represented as the composed character sequence “O” and umlaut would not compare equal to “Ö” represented as one Unicode character.
and:
When you know both objects are strings, this method is a faster way to check equality than isEqual:.
it seems that it's the best method available, to compare strings and that it does exactly what you need, that is: first it checks for length (if 2 strings have different length, is not necessary to check each char contained), then if the length it's the same it compares each char. Simple and efficient!
isEqualToString: is faster if you know both objects are strings, as the documentation states.
You could try converting both string to C strings and then use strcmp. Doubt it'll actually be any quicker though.
char *str1 = [myNSString1 UTF8String];
char *str2 = [myNSString2 UTF8String];
BOOL isEqual = strcmp(str1,str2);

C char * into Objective-C NSString (or NSData?)

I'm parsing a file byte stream in C and organising the results into NSDictionarys and NSArrays in Objective-C world via callback.
The keys of an NSDictionary are all instances of NSString. I'm converting the C character strings into NSStrings with the NSNEXTSTEPStringEncoding but now and again some of the keys are nil (even when the C character strings have one or more characters).
What is the correct way to do this?
NSNEXTSTEPStringEncoding is an extremely old and rare string encoding and is probably the wrong one to use. Your text files will generally be in either Latin-1 or UTF-8, if they originated in Western Europe or in North America. Let's assume UTF-8 for now (that's NSUTF8StringEncoding.) Ideally, you'd know the encoding used when writing the files, and you'd use that when reading them.
You presumably have the rest of the code correct, since you are getting strings back. It's the strings that aren't pure 7-bit ASCII that are likely giving you trouble.

Objective c doesn't like my unichars?

Xcode complaints about "multi-character character contant"'s when I try to do the following:
static unichar accent characters[] = { 'ā', 'á', 'ă', 'à' };
How do you make an array of characters, when not all of them are ascii? The following works just fine
static unichar accent[] = { 'a', 'b', 'c' };
Workaround
The closest work around I have found is to convert the special characters into hex, ie this works:
static unichar accent characters[] = { 0x0100, 0x0101, 0x0102 };
It's not that Objective-C doesn't like it, it's that C doesn't. The constant 'c' is for char which has 1 byte, not unichar which has 2 bytes. (see the note below for a bit more detail.)
There's no perfectly supported way to represent a unichar constant. You can use
char* s="ü";
in a UTF-8-encoded source file to get the unicode C-string, or
NSString* s=#"ü";
in a UTF-8 encoded source file to get an NSString. (This was not possible before 10.5. It's OK for iPhone.)
NSString itself is conceptually encoding-neutral; but if you want, you can get the unicode character by using -characterAtIndex:.
Finally two comments:
If you just want to remove accents from the string, you can just use the method like this, without writing the table yourself:
-(NSString*)stringWithoutAccentsFromString:(NSString*)s
{
if (!s) return nil;
NSMutableString *result = [NSMutableString stringWithString:s];
CFStringFold((CFMutableStringRef)result, kCFCompareDiacriticInsensitive, NULL);
return result;
}
See the document of CFStringFold.
If you want unicode characters for localization/internationalization, you shouldn't embed the strings in the source code. Instead you should use Localizable.strings and NSLocalizedString. See here.
Note:
For arcane historical reasons, 'a' is an int in C, see the discussions here. In C++, it's a char. But it doesn't change the fact that writing more than one byte inside '...' is implementation-defined and not recommended. For example, see ISO C Standard 6.4.4.10. However, it was common in classic Mac OS to write the four-letter code enclosed in single quotes, like 'APPL'. But that's another story...
Another complication is that accented letters are not always represented by 1 byte; it depends on the encoding. In UTF-8, it's not. In ISO-8859-1, it is. And unichar should be in UTF-16. Did you save your source code in UTF-16? I think the default of XCode is UTF-8. GCC might do some encoding conversion depending on the setup, too...
Or you can just do it like this:
static unichar accent characters[] = { L'ā', L'á', L'ă', L'à' };
L is a standard C keyword which says "I'm about to write a UNICODE character or character set".
Works fine for Objective-C too.
Note: The compiler may give you a strange warning about too many characters put inside a unichar, but you can safely ignore that warning. Xcode just doesn't deal with the unicode characters the right way, but the compiler parses them properly and the result is OK.
Depending on your circumstances, this may be a tidy way to do it:
NSCharacterSet* accents =
[NSCharacterSet characterSetWithCharactersInString:#"āáăà"];
And then, if you want to check if a given unichar is one of those accent characters:
if ([accents characterIsMember:someOtherUnichar])
{
}
NSString also has many methods of its own for handling NSCharacterSet objects.