C char * into Objective-C NSString (or NSData?) - objective-c

I'm parsing a file byte stream in C and organising the results into NSDictionarys and NSArrays in Objective-C world via callback.
The keys of an NSDictionary are all instances of NSString. I'm converting the C character strings into NSStrings with the NSNEXTSTEPStringEncoding but now and again some of the keys are nil (even when the C character strings have one or more characters).
What is the correct way to do this?

NSNEXTSTEPStringEncoding is an extremely old and rare string encoding and is probably the wrong one to use. Your text files will generally be in either Latin-1 or UTF-8, if they originated in Western Europe or in North America. Let's assume UTF-8 for now (that's NSUTF8StringEncoding.) Ideally, you'd know the encoding used when writing the files, and you'd use that when reading them.
You presumably have the rest of the code correct, since you are getting strings back. It's the strings that aren't pure 7-bit ASCII that are likely giving you trouble.

Related

Can I use memchr safely with an internal UTF-8 char * returned from an NSString?

I'd like to use memchr instead of strlen to find the length of a C string potentially used as the backing string of an NSString. Is this safe to do, or do I risk reading memory that I don't own, causing a crash? Let's assume that the NSString will not be released before I'm done with the internal buffer.
memchr(s, 0, XXX) and strlen(s) should pretty much behave identically, save for mechr()'s ability to terminate after XXX bytes. But strnlen() can do that, too.
And that behavior is probably exactly what you don't want.
Neither function accounts for any kind of unicode encoding. Thus, the returned length will be the length-in-bytes and not the # of characters.
Use -length on the NSString if you want the string length. Beyond that, what are you trying to do?

How do I convert a unicode code point range into an NSString character range?

I have an NSString and a unicode code point range that represents a specific section of the text in that NSString. Since the characters in that NSString do not correspond one-to-one with code points, I need to somehow convert my code point range into the corresponding character range. How do I do this?
I know I can use the NSString method -rangeOfComposedCharacterSequencesForRange: to convert a character range to a grapheme cluster range, but what I want to do is sort of the opposite of that, and I can't find an inverse of that method in the APIs. And even if there was such a method available, I don't think this is exactly what I'm looking for, since (if I understand this correctly) a grapheme cluster is not the same thing as a unicode code point, and can in fact be composed of more than one code point.
What you have is kind of mixed data from two different worlds. You might typically get a Unicode code point range along with a UTF-32 string (where the correspondence is one-to-one) so that extracting the substring would be trivial. You have two options:
Work in the UTF-32 world before you put the data into an NSString
Convert the Unicode code point range into a UTF-16 unit range
I assume from your question that #2 is the easiest option in your case.
As you say, characters in an NSString do not correspond one-to-one with Unicode code points since an NSString character is a UTF-16 unit. However, a Unicode code point corresponds to exactly 1 or 2 characters in an NSString. You can fairly easily write your own range conversion routine by iterating through the NSString characters and counting Unicode code points. This is made somewhat easier by the fact that you don't even care about the endianness of the UTF-16 data since valid BMP characters, lead surrogates, and trail surrogates are disjoint. CFString provides some functions to determine what each character is. So in pseudocode you counting would look like:
for each NSString character {
if (CFStringIsSurrogateHighCharacter(character) ||
CFStringIsSurrogateLowCharacter(character))
{
Skip forward another character in the NSString
}
Increment count of Unicode code points stepped through
}

NSString and unichar don't match well when it comes to Unicode

The Apple's documentation states that
A string object is implemented as an array of Unicode characters
However, the size of unichar data type, which is likely to be unsigned short behind the scenes, is only 16 bits, which renders impossible to represent every Unicode character with unichar. How do I reconcile these two facts in my mind?
You are correct that Apple's docs incorrectly refer to Unicode characters when it really means UTF-16 code points.
In the early days of Unicode it was hoped that it would not exceed 16 bits, but it has. Both Apple and Microsoft (and probably others) use 16-bit integers to represent "Unicode characters", even though some characters will have to be represented by surrogate pairs.
Various methods of NSString handle this case (plus combining characters) and return a range for a given character. E.g. -rangeOfCharacterFromSet:... and -rangeOfComposedCharacterSequences....
It's not sure that strings are represented by the unichar data type. "A string object is implemented as an array of Unicode characters" doesn't mean in the source code it is stored as unichar *. You don't know how it is implemented, do you?
And what if unichar is not an unsigned short? What if it is a 32- or 64-bit data type?

numerical value of a unicode character in objective c

is it possible to get a numerical value from a unicode character in objective-c?
#"A" is 0041, #"➜" is 279C, #"Ω" is 03A9, #"झ" is 091D... ?
OK, so it’s perhaps worth pointing a few things out in a separate answer here. First, the term “character” is ambiguous, so we should choose a more appropriate term depending on what we mean. (See Characters and Grapheme Clusters in the Apple developer docs, as well as the Unicode website for more detail.)
If you are asking for the UTF-16 code unit, then you can use
unichar ch = [myString characterAtIndex:ndx];
Note that this is only equivalent to a Unicode code-point in the case where the code point is within the Basic Multilingual Plane (i.e. it is less than U+FFFF).
If you are asking for the Unicode code point, then you should be aware that UTF-16 supports characters outside of the BMP (i.e. U+10000 and above) using surrogate pairs. Thus there will be two UTF-16 code units for any code point above U+10000. To detect this case, you need to do something like
uint32_t codepoint = [myString characterAtIndex:ndx];
if ((codepoint & 0xfc00) == 0xd800) {
unichar ch2 = [myString characterAtIndex:ndx + 1];
codepoint = (((codepoint & 0x3ff) << 10) | (ch2 & 0x3ff)) + 0x10000;
}
Note that in production code, you should also test for and cope with the case where the surrogate pair has been truncated somehow.
Importantly, neither UTF-16 code units, nor Unicode code points necessarily correspond to anything that and end-user would regard as a “character” (the Unicode consortium generally refers to this as a grapheme cluster to distinguish it from other possible meanings of “character”). There are many examples, but the simplest to understand are probably the combining diacritical marks. For instance, the character ‘Ä’ can be represented as the Unicode code point U+00C4, or as a pair of code points, U+0041 U+0308.
Sometimes people (like #DietrichEpp in the comments on his answer) will claim that you can deal with this by converting to precomposed form before dealing with your string. This is something of a red herring, because precomposed form only deals with characters that have a precomposed equivalent in Unicode. e.g. it will not help with all combining marks; it will not help with Indic or Arabic scripts; it will not help with Hangul Jamos. There are many other cases as well.
If you are trying to manipulate grapheme clusters (things the user might think of as “characters”), you should probably make use of the NSString methods -rangeOfComposedCharacterSequencesForRange:, rangeOfComposedCharacterSequenceAtIndex: or the CFString function CFStringGetRangeOfComposedCharactersAtIndex. Obviously you cannot hold a grapheme cluster in an integer variable and it has no inherent numerical value; rather, it is represented by a string of code points, which are represented by a string of code units. For instance:
NSRange gcRange = [myString rangeOfComposedCharacterSequenceAtIndex:ndx];
NSString *graphemeCluster = [myString substringWithRange:gcRange];
Note that graphemeCluster may be arbitrarily long(!)
Even then, we have ignored the effects of matters such as Unicode’s support for bidirectional text. That is, the order of the code points represented by the code units in your NSString may in some cases be the reverse of what you might expect. The worse cases involve things like English text embedded in Arabic or Hebrew; this is supported by the Cocoa Text system, and so you really can end up with bidirectional strings in your code.
To summarise: generally speaking one should avoid examining NSString and CFString instances unichar by unichar. If at all possible, use an appropriate NSString method or CFString function instead. If you do find yourself examining the UTF-16 code units, please familiarise yourself with the Unicode standard first (I recommend “Unicode Demystified” if you can’t stomach reading through the Unicode book itself), so that you can avoid the major pitfalls.
Cocoa strings allow you to access the UTF-16 elements using -characterAtIndex:, so the following code will convert the string to a unicode code point:
unsigned strToChar(NSString *str)
{
unsigned c1, c2;
c1 = [str characterAtIndex:0];
if ((c1 & 0xfc00) == 0xd800) {
c2 = [str characterAtIndex:1];
return (((c1 & 0x3ff) << 10) | (c2 & 0x3ff)) + 0x10000;
} else {
return c1;
}
}
I am not aware of any convenience functions for this. You can use -characterAtIndex: by itself if you are okay with your code breaking horribly when someone uses characters outside the BMP; a number of applications on OS X break horribly in this way.
The following should render as a musical "G clef", U+1D11E, but if you copy and paste it into some text editors (TextMate), they'll let you do bizarre things like delete half of the character, at which point your text file is garbage.
𝄞

Is it better to append a CString than an ObjC String?

I'm writing a bit of code doing string manipulation. In this particular situation, appending "?partnerId=30" to a URL for iTunes Affiliate linking. This is a raw string and completely static. I was thinking, is it better to do:
urlString = [urlString stringByAppendingFormat:#"%#", #"?partnerId=30"];
Or:
urlString = [urlString stringByAppendingFormat:#"%s", "?partnerId=30"];
I would think it's better to not instantiate an entire Objective-C object, but I've never seen it done that way.
String declared using the #"" syntax are constant and will already exist in memory by the time your code is running, thus there is no allocation penalty from using them.
You may find they are very slightly faster, as they know their own length, whereas C strings need to be looped through to find out their length.
Through you'd gain a more tangible performance improvement (though still tiny) from not using a format string:
urlString = [urlString stringByAppendingString:#"?partnerId=30"];
Both literal C strings and literal NSStrings are expressed as constant bits of memory. Neither requires an allocation on use.
Objective-C string literals are immortal objects. They are instantiated when your binary is loaded into memory. With that knowledge, the former form does not create a temporary NSString.
I honestly don't know which is faster in general, because it also depends on external conditions; An NSString may represent strings of multiple encodings (the default is UTF-16), if urlString has an encoding conversion to perform, then it could be a performance hit for either approach. Either way, they will both be quite fast - I wouldn't worry about this case unless you have many (e.g. thousands) of these to create and it is time critical, because their performance should be similar.
Since you are using the form: NSString = NSString+NSString, the NSString literal could be faster for nontrivial cases because the length is stored with the object and the encodings of both strings may already match the destination string. The C string used in your example would also be trivial to convert to another encoding, plus it is short.
C strings, as a more primitive type, could reduce your load times and/or memory usage if you need to define a lot of them.
For simple cases, I'd just stick with NSString literals in this case, unless the problem is much larger than the post would imply.
If you need a C string representation as well for a given set of literals, then you may prefer to define C string literals. Defining C string literals may also force you to create temporary NSStrings based on the C strings. In this case, you may want to define one for each flavor or use CFString's 'create CFString with external buffer' APIs. Again, this would be for very unusual cases (a micro-optimization if you are really not going through huge sets of these strings).