Objective c doesn't like my unichars? - objective-c

Xcode complaints about "multi-character character contant"'s when I try to do the following:
static unichar accent characters[] = { 'ā', 'á', 'ă', 'à' };
How do you make an array of characters, when not all of them are ascii? The following works just fine
static unichar accent[] = { 'a', 'b', 'c' };
Workaround
The closest work around I have found is to convert the special characters into hex, ie this works:
static unichar accent characters[] = { 0x0100, 0x0101, 0x0102 };

It's not that Objective-C doesn't like it, it's that C doesn't. The constant 'c' is for char which has 1 byte, not unichar which has 2 bytes. (see the note below for a bit more detail.)
There's no perfectly supported way to represent a unichar constant. You can use
char* s="ü";
in a UTF-8-encoded source file to get the unicode C-string, or
NSString* s=#"ü";
in a UTF-8 encoded source file to get an NSString. (This was not possible before 10.5. It's OK for iPhone.)
NSString itself is conceptually encoding-neutral; but if you want, you can get the unicode character by using -characterAtIndex:.
Finally two comments:
If you just want to remove accents from the string, you can just use the method like this, without writing the table yourself:
-(NSString*)stringWithoutAccentsFromString:(NSString*)s
{
if (!s) return nil;
NSMutableString *result = [NSMutableString stringWithString:s];
CFStringFold((CFMutableStringRef)result, kCFCompareDiacriticInsensitive, NULL);
return result;
}
See the document of CFStringFold.
If you want unicode characters for localization/internationalization, you shouldn't embed the strings in the source code. Instead you should use Localizable.strings and NSLocalizedString. See here.
Note:
For arcane historical reasons, 'a' is an int in C, see the discussions here. In C++, it's a char. But it doesn't change the fact that writing more than one byte inside '...' is implementation-defined and not recommended. For example, see ISO C Standard 6.4.4.10. However, it was common in classic Mac OS to write the four-letter code enclosed in single quotes, like 'APPL'. But that's another story...
Another complication is that accented letters are not always represented by 1 byte; it depends on the encoding. In UTF-8, it's not. In ISO-8859-1, it is. And unichar should be in UTF-16. Did you save your source code in UTF-16? I think the default of XCode is UTF-8. GCC might do some encoding conversion depending on the setup, too...

Or you can just do it like this:
static unichar accent characters[] = { L'ā', L'á', L'ă', L'à' };
L is a standard C keyword which says "I'm about to write a UNICODE character or character set".
Works fine for Objective-C too.
Note: The compiler may give you a strange warning about too many characters put inside a unichar, but you can safely ignore that warning. Xcode just doesn't deal with the unicode characters the right way, but the compiler parses them properly and the result is OK.

Depending on your circumstances, this may be a tidy way to do it:
NSCharacterSet* accents =
[NSCharacterSet characterSetWithCharactersInString:#"āáăà"];
And then, if you want to check if a given unichar is one of those accent characters:
if ([accents characterIsMember:someOtherUnichar])
{
}
NSString also has many methods of its own for handling NSCharacterSet objects.

Related

How to access a character in NSMutableString Objective-C

I have an instance of NSMutableString called MyMutableStr and I want access its character at index 7.
For example:
unsigned char cMy = [(NSString*) MyMutableStr characterAtIndex:7];
I think this is an ugly way; it's too much code.
My question is: Are there more simple ways in Objective-C to access the character in NSMutableString?
Like, in C language we can access a character of a string using [ ] operator:
unsigned char cMy = MyMutableStr[7];
The way of doing it is to use characterAtIndex:, but you don't need to cast it to a NSString pointer, since NSMutableString is a subclass of NSString. So it isn't that long, but if you still don't find it comfortable, I suggest to use UTF8String to obtain a C string over which you can iterate using the brackets operator:
const char* cString= [MyMutableStr UTF8String];
char first= cString[0];
But remember this (taken from NSString class reference):
The returned C string is automatically freed just as a returned object would be released; you should copy the C string if it needs to store it outside of the autorelease context in which the C string is created.
As others said characterAtIndex: but a few things you might want to consider carefully.
First you're dealing with an mutable string. You want to be careful to avoid it changing out from under you. One way is to an immutable copy and use that for the op.
Second, you're dealing with Unicode so you may want to consider normalizing your string to get a precomposed form as some visual representations may be more than one actual unichar. That's often a stumbling block for folks.

numerical value of a unicode character in objective c

is it possible to get a numerical value from a unicode character in objective-c?
#"A" is 0041, #"➜" is 279C, #"Ω" is 03A9, #"झ" is 091D... ?
OK, so it’s perhaps worth pointing a few things out in a separate answer here. First, the term “character” is ambiguous, so we should choose a more appropriate term depending on what we mean. (See Characters and Grapheme Clusters in the Apple developer docs, as well as the Unicode website for more detail.)
If you are asking for the UTF-16 code unit, then you can use
unichar ch = [myString characterAtIndex:ndx];
Note that this is only equivalent to a Unicode code-point in the case where the code point is within the Basic Multilingual Plane (i.e. it is less than U+FFFF).
If you are asking for the Unicode code point, then you should be aware that UTF-16 supports characters outside of the BMP (i.e. U+10000 and above) using surrogate pairs. Thus there will be two UTF-16 code units for any code point above U+10000. To detect this case, you need to do something like
uint32_t codepoint = [myString characterAtIndex:ndx];
if ((codepoint & 0xfc00) == 0xd800) {
unichar ch2 = [myString characterAtIndex:ndx + 1];
codepoint = (((codepoint & 0x3ff) << 10) | (ch2 & 0x3ff)) + 0x10000;
}
Note that in production code, you should also test for and cope with the case where the surrogate pair has been truncated somehow.
Importantly, neither UTF-16 code units, nor Unicode code points necessarily correspond to anything that and end-user would regard as a “character” (the Unicode consortium generally refers to this as a grapheme cluster to distinguish it from other possible meanings of “character”). There are many examples, but the simplest to understand are probably the combining diacritical marks. For instance, the character ‘Ä’ can be represented as the Unicode code point U+00C4, or as a pair of code points, U+0041 U+0308.
Sometimes people (like #DietrichEpp in the comments on his answer) will claim that you can deal with this by converting to precomposed form before dealing with your string. This is something of a red herring, because precomposed form only deals with characters that have a precomposed equivalent in Unicode. e.g. it will not help with all combining marks; it will not help with Indic or Arabic scripts; it will not help with Hangul Jamos. There are many other cases as well.
If you are trying to manipulate grapheme clusters (things the user might think of as “characters”), you should probably make use of the NSString methods -rangeOfComposedCharacterSequencesForRange:, rangeOfComposedCharacterSequenceAtIndex: or the CFString function CFStringGetRangeOfComposedCharactersAtIndex. Obviously you cannot hold a grapheme cluster in an integer variable and it has no inherent numerical value; rather, it is represented by a string of code points, which are represented by a string of code units. For instance:
NSRange gcRange = [myString rangeOfComposedCharacterSequenceAtIndex:ndx];
NSString *graphemeCluster = [myString substringWithRange:gcRange];
Note that graphemeCluster may be arbitrarily long(!)
Even then, we have ignored the effects of matters such as Unicode’s support for bidirectional text. That is, the order of the code points represented by the code units in your NSString may in some cases be the reverse of what you might expect. The worse cases involve things like English text embedded in Arabic or Hebrew; this is supported by the Cocoa Text system, and so you really can end up with bidirectional strings in your code.
To summarise: generally speaking one should avoid examining NSString and CFString instances unichar by unichar. If at all possible, use an appropriate NSString method or CFString function instead. If you do find yourself examining the UTF-16 code units, please familiarise yourself with the Unicode standard first (I recommend “Unicode Demystified” if you can’t stomach reading through the Unicode book itself), so that you can avoid the major pitfalls.
Cocoa strings allow you to access the UTF-16 elements using -characterAtIndex:, so the following code will convert the string to a unicode code point:
unsigned strToChar(NSString *str)
{
unsigned c1, c2;
c1 = [str characterAtIndex:0];
if ((c1 & 0xfc00) == 0xd800) {
c2 = [str characterAtIndex:1];
return (((c1 & 0x3ff) << 10) | (c2 & 0x3ff)) + 0x10000;
} else {
return c1;
}
}
I am not aware of any convenience functions for this. You can use -characterAtIndex: by itself if you are okay with your code breaking horribly when someone uses characters outside the BMP; a number of applications on OS X break horribly in this way.
The following should render as a musical "G clef", U+1D11E, but if you copy and paste it into some text editors (TextMate), they'll let you do bizarre things like delete half of the character, at which point your text file is garbage.
𝄞

Convert special characters like ë,à,é,ä all to e,a,e,a? Objective C

Is there a simple way in objective c to convert all special characters like ë,à,é,ä to the normal characters like e en a?
Yep, and it's pretty simple:
NSString *src = #"Convert special characters like ë,à,é,ä all to e,a,e,a? Objective C";
NSData *temp = [src dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *dst = [[[NSString alloc] initWithData:temp encoding:NSASCIIStringEncoding] autorelease];
NSLog(#"converted: %#", dst);
Running that on my machine produces:
EmptyFoundation[69299:a0f] converted: Convert special characters like e,a,e,a all to e,a,e,a? Objective C
Basically, we're asking the string to transform itself it an NSData (ie, a byte array) that represents the characters in the string in the ASCII character set. Since not all of the characters in the original string are in ASCII, we tell the string that it's OK to do a "lossy" conversion. In other words, it's OK to turn "é" into "e", and so on.
Once we've got our byte array, we simply turn it back into a string, and we're done! :)
CFStringTransform
CFStringTransform is the solution when you are dealing with a specific language. It transliterates strings in ways that simplify normalization, indexing, and searching. For example, it can remove accent marks using the option kCFStringTransformStripCombiningMarks:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("Schläger"));
CFStringTransform(string, NULL, kCFStringTransformStripCombiningMarks,
false);
... => string is now “Schlager” CFRelease(string);
CFStringTransform is even more powerful when you are dealing with non-Latin writing systems such as Arabic or Chinese. It can convert many writing systems to Latin script, making normalization much simpler.
For example, you can convert Chinese script to Latin script like this:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("你好"));
CFStringTransform(string, NULL, kCFStringTransformToLatin, false);
... => string is now “nˇı hˇao”
CFStringTransform(string, NULL, kCFStringTransformStripCombiningMarks,
false);
... => string is now “ni hao” CFRelease(string);
Notice that the option is simply kCFStringTransformToLatin.
The source language is not required. You can hand almost any string to
this transform without having to know first what language it is in.
CFStringTransform can also transliterate from Latin script to other
writing systems such as Arabic, Hangul, Hebrew, and Thai.
References: iOS 7 Programming: Pushing to the limits

Unfamiliar C syntax in Objective-C context

I am coming to Objective-C from C# without any intermediate knowledge of C. (Yes, yes, I will need to learn C at some point and I fully intend to.) In Apple's Certificate, Key, and Trust Services Programming Guide, there is the following code:
static const UInt8 publicKeyIdentifier[] = "com.apple.sample.publickey\0";
static const UInt8 privateKeyIdentifier[] = "com.apple.sample.privatekey\0";
I have an NSString that I would like to use as an identifier here and for the life of me I can't figure out how to get that into this data structure. Searching through Google has been fruitless also. I looked at the NSString Class Reference and looked at the UTF8String and getCharacters methods but I couldn't get the product into the structure.
What's the simple, easy trick I'm missing?
Those are C strings: Arrays (not NSArrays, but C arrays) of characters. The last character is a NUL, with the numeric value 0.
“UInt8” is the CoreServices name for an unsigned octet, which (on Mac OS X) is the same as an unsigned char.
static means that the array is specific to this file (if it's in file scope) or persists across function calls (if it's inside a method or function body).
const means just what you'd guess: You cannot change the characters in these arrays.
\0 is a NUL, but including it explicitly in a "" literal as shown in those examples is redundant. A "" literal (without the #) is NUL-terminated anyway.
C doesn't specify an encoding. On Mac OS X, it's generally something ASCII-compatible, usually UTF-8.
To convert an NSString to a C-string, use UTF8String or cStringUsingEncoding:. To have the NSString extract the C string into a buffer, use getCString:maxLength:encoding:.
I think some people are missing the point here. Everyone has explained the two constant arrays that are being set up for the tags, but if you want to use an NSString, you can simply add it to the attribute dictionary as-is. You don't have to convert it to anything. For example:
NSString *publicTag = #"com.apple.sample.publickey";
NSString *privateTag = #"com.apple.sample.privatekey";
The rest of the example stays exactly the same. In this case, there is no need for the C string literals at all.
Obtaining a char* (C string) from an NSString isn't the tricky part. (BTW, I'd also suggest UTF8String, it's much simpler.) The Apple-supplied code works because it's assigning a C string literal to the static const array variables. Assigning the result of a function or method call to a const will probably not work.
I recently answered an SO question about defining a constant in Objective-C, which should help your situation. You may have to compromise by getting rid of the const modifier. If it's declared static, you at least know that nobody outside the compilation unit where it's declared can reference it, so just make sure you don't let a reference to it "escape" such that other code could modify it via a pointer, etc.
However, as #Jason points out, you may not even need to convert it to a char* at all. The sample code creates an NSData object for each of these strings. You could just do something like this within the code (replacing steps 1 and 3):
NSData* publicTag = [#"com.apple.sample.publickey" dataUsingEncoding:NSUnicodeStringEncoding];
NSData* privateTag = [#"com.apple.sample.privatekey" dataUsingEncoding:NSUnicodeStringEncoding];
That sure seems easier to me than dealing with the C arrays if you already have an NSString.
try this
NSString *newString = #"This is a test string.";
char *theString;
theString = [newString cStringWithEncoding:[NSString defaultCStringEncoding]];

Raw strings like Python's in Objective-C

Does Objective-C have raw strings like Python's?
Clarification: a raw string doesn't interpret escape sequences like \n: both the slash and the "n" are separate characters in the string. From the linked Python tutorial:
>>> print 'C:\some\name' # here \n means newline!
C:\some
ame
>>> print r'C:\some\name' # note the r before the quote
C:\some\name
Objective-C is a superset of C. So, the answer is yes. You can write
char* string="hello world";
anywhere. You can then turn it into an NSString later by
NSString* nsstring=[NSString stringWithUTF8String:string];
From your link explaining what you mean by "raw string", the answer is: there is no built in method for what you are asking.
However, you can replace occurrences of one string with another string, so you can replace #"\n" with #"\\n", for example. That should get you close to what you're seeking.
You can use stringize macro.
#define MAKE_STRING(x) ##x
NSString *expendedString = MAKE_STRING(
hello world
"even quotes will be escaped"
);
The preprocess result is
NSString *expendedString = #"hello world \"even quotes will be escaped\"";
As you can see, double quotes are escaped, however new lines are ignored.
This feature is very suitable to paste some JS code in Objective-C files. Using this feature is safe if you are using C99.
source:
https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html
How, exactly, does the double-stringize trick work?
Like everyone said, raw ANSI strings are very easy. Just use simple C strings, or C++ std::string if you feel like compiling Objective C++.
However, the native string format of Cocoa is UCS-2 - fixed-width 2-byte characters. NSStrings are stored, internally, as UCS-2, i. e. as arrays of unsigned short. (Just like in Win32 and in Java, by the way.) The systemwide aliases for that datatype are unichar and UniChar. Here's where things become tricky.
GCC includes a wchar_t datatype, and lets you define a raw wide-char string constant like this:
wchar_t *ws = L"This a wide-char string.";
However, by default, this datatype is defined as 4-byte int and therefore is not the same as Cocoa's unichar! You can override that by specifying the following compiler option:
-fshort-wchar
but then you lose the wide-char C RTL functions (wcslen(), wcscpy(), etc.) - the RTL was compiled without that option and assumes 4-byte wchar_t. It's not particularly hard to reimplement these functions by hand. Your call.
Once you have a truly 2-byte wchar_t raw strings, you can trivially convert them to NSStrings and back:
wchar_t *ws = L"Hello";
NSString *s = [NSString stringWithCharacters:(const unichar*)ws length:5];
Unlike all other [stringWithXXX] methods, this one does not involve any codepage conversions.
Objective-C is a strict superset of C so you are free to use char * and char[] wherever you want (if that's what you call raw strings).
If you mean C-style strings, then yes.