Convert special characters like ë,à,é,ä all to e,a,e,a? Objective C - objective-c

Is there a simple way in objective c to convert all special characters like ë,à,é,ä to the normal characters like e en a?

Yep, and it's pretty simple:
NSString *src = #"Convert special characters like ë,à,é,ä all to e,a,e,a? Objective C";
NSData *temp = [src dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *dst = [[[NSString alloc] initWithData:temp encoding:NSASCIIStringEncoding] autorelease];
NSLog(#"converted: %#", dst);
Running that on my machine produces:
EmptyFoundation[69299:a0f] converted: Convert special characters like e,a,e,a all to e,a,e,a? Objective C
Basically, we're asking the string to transform itself it an NSData (ie, a byte array) that represents the characters in the string in the ASCII character set. Since not all of the characters in the original string are in ASCII, we tell the string that it's OK to do a "lossy" conversion. In other words, it's OK to turn "é" into "e", and so on.
Once we've got our byte array, we simply turn it back into a string, and we're done! :)

CFStringTransform
CFStringTransform is the solution when you are dealing with a specific language. It transliterates strings in ways that simplify normalization, indexing, and searching. For example, it can remove accent marks using the option kCFStringTransformStripCombiningMarks:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("Schläger"));
CFStringTransform(string, NULL, kCFStringTransformStripCombiningMarks,
false);
... => string is now “Schlager” CFRelease(string);
CFStringTransform is even more powerful when you are dealing with non-Latin writing systems such as Arabic or Chinese. It can convert many writing systems to Latin script, making normalization much simpler.
For example, you can convert Chinese script to Latin script like this:
CFMutableStringRef string = CFStringCreateMutableCopy(NULL, 0, CFSTR("你好"));
CFStringTransform(string, NULL, kCFStringTransformToLatin, false);
... => string is now “nˇı hˇao”
CFStringTransform(string, NULL, kCFStringTransformStripCombiningMarks,
false);
... => string is now “ni hao” CFRelease(string);
Notice that the option is simply kCFStringTransformToLatin.
The source language is not required. You can hand almost any string to
this transform without having to know first what language it is in.
CFStringTransform can also transliterate from Latin script to other
writing systems such as Arabic, Hangul, Hebrew, and Thai.
References: iOS 7 Programming: Pushing to the limits

Related

Inserting hebrew into NSMutableArrey

When I insert Hebrew (LTR) string into NSMutableArrey, the string is distorted somehow.
What do I do?
NSString *peace = #"שלום";
NSLog(#"peace - %#", peace);
NSMutableArray *peaceArrey = [[NSMutableArray alloc]initWithCapacity:1];
[peaceArrey addObject:peace];
NSLog(#"peaceArrey - %#",peaceArrey);
And here is the log:
peace - שלום
peaceArrey - (
"\U05e9\U05dc\U05d5\U05dd"
)
everything should be ok, try NSLog(#"%#", peaceArrey[0]);
the result you are seeing is just the way NSArrays are printed: unicode chars are represented as their codes.
Don't mistake the representation that gets logged as the actual value of the object. NSArray's description is in an old-style property list format. Among other things, that means that non-ASCII values in strings are represented as escape sequences. You're seeing the Unicode characters as a series of UTF-16 code units expressed as escape sequences.
When using the %# format specifier NSLog calls description on the argument to log the string.
In case of a plain string this method just returns the string:
NSLog(#"string: %#", #"שלום");
// prints שלום
If you, on the other hand, put the string into an array, the NSArray's description method is called which, in turn, calls descriptionWithLocale:indent:.
This method just creates a property list formatted string. It uses the NSPropertyListOpenStepFormat which is ASCII encoded. That's why it has to escape the hebrew unicode characters.

numerical value of a unicode character in objective c

is it possible to get a numerical value from a unicode character in objective-c?
#"A" is 0041, #"➜" is 279C, #"Ω" is 03A9, #"झ" is 091D... ?
OK, so it’s perhaps worth pointing a few things out in a separate answer here. First, the term “character” is ambiguous, so we should choose a more appropriate term depending on what we mean. (See Characters and Grapheme Clusters in the Apple developer docs, as well as the Unicode website for more detail.)
If you are asking for the UTF-16 code unit, then you can use
unichar ch = [myString characterAtIndex:ndx];
Note that this is only equivalent to a Unicode code-point in the case where the code point is within the Basic Multilingual Plane (i.e. it is less than U+FFFF).
If you are asking for the Unicode code point, then you should be aware that UTF-16 supports characters outside of the BMP (i.e. U+10000 and above) using surrogate pairs. Thus there will be two UTF-16 code units for any code point above U+10000. To detect this case, you need to do something like
uint32_t codepoint = [myString characterAtIndex:ndx];
if ((codepoint & 0xfc00) == 0xd800) {
unichar ch2 = [myString characterAtIndex:ndx + 1];
codepoint = (((codepoint & 0x3ff) << 10) | (ch2 & 0x3ff)) + 0x10000;
}
Note that in production code, you should also test for and cope with the case where the surrogate pair has been truncated somehow.
Importantly, neither UTF-16 code units, nor Unicode code points necessarily correspond to anything that and end-user would regard as a “character” (the Unicode consortium generally refers to this as a grapheme cluster to distinguish it from other possible meanings of “character”). There are many examples, but the simplest to understand are probably the combining diacritical marks. For instance, the character ‘Ä’ can be represented as the Unicode code point U+00C4, or as a pair of code points, U+0041 U+0308.
Sometimes people (like #DietrichEpp in the comments on his answer) will claim that you can deal with this by converting to precomposed form before dealing with your string. This is something of a red herring, because precomposed form only deals with characters that have a precomposed equivalent in Unicode. e.g. it will not help with all combining marks; it will not help with Indic or Arabic scripts; it will not help with Hangul Jamos. There are many other cases as well.
If you are trying to manipulate grapheme clusters (things the user might think of as “characters”), you should probably make use of the NSString methods -rangeOfComposedCharacterSequencesForRange:, rangeOfComposedCharacterSequenceAtIndex: or the CFString function CFStringGetRangeOfComposedCharactersAtIndex. Obviously you cannot hold a grapheme cluster in an integer variable and it has no inherent numerical value; rather, it is represented by a string of code points, which are represented by a string of code units. For instance:
NSRange gcRange = [myString rangeOfComposedCharacterSequenceAtIndex:ndx];
NSString *graphemeCluster = [myString substringWithRange:gcRange];
Note that graphemeCluster may be arbitrarily long(!)
Even then, we have ignored the effects of matters such as Unicode’s support for bidirectional text. That is, the order of the code points represented by the code units in your NSString may in some cases be the reverse of what you might expect. The worse cases involve things like English text embedded in Arabic or Hebrew; this is supported by the Cocoa Text system, and so you really can end up with bidirectional strings in your code.
To summarise: generally speaking one should avoid examining NSString and CFString instances unichar by unichar. If at all possible, use an appropriate NSString method or CFString function instead. If you do find yourself examining the UTF-16 code units, please familiarise yourself with the Unicode standard first (I recommend “Unicode Demystified” if you can’t stomach reading through the Unicode book itself), so that you can avoid the major pitfalls.
Cocoa strings allow you to access the UTF-16 elements using -characterAtIndex:, so the following code will convert the string to a unicode code point:
unsigned strToChar(NSString *str)
{
unsigned c1, c2;
c1 = [str characterAtIndex:0];
if ((c1 & 0xfc00) == 0xd800) {
c2 = [str characterAtIndex:1];
return (((c1 & 0x3ff) << 10) | (c2 & 0x3ff)) + 0x10000;
} else {
return c1;
}
}
I am not aware of any convenience functions for this. You can use -characterAtIndex: by itself if you are okay with your code breaking horribly when someone uses characters outside the BMP; a number of applications on OS X break horribly in this way.
The following should render as a musical "G clef", U+1D11E, but if you copy and paste it into some text editors (TextMate), they'll let you do bizarre things like delete half of the character, at which point your text file is garbage.
𝄞

Unihan: combining UTF-8 chars

I am using data that involves Chinese Unihan characters in an Objective-C app. I am using a voice recognition program (cmusphinx) that returns a phrase from my data. It returns UTF-8 characters and when returning a Chinese character (which is three bytes) it separates it into three separate characters.
Example: When I want 人 to, I see: ‰∫∫. This is the proper in coding (E4 BA BA), but my code sees the returned value as three seperate characters rather than one.
Actually, my function is receiving the phrase as an NSString, (due to a wrap around) which uses UTF-16. I tried using Objective-C's built in conversion methods (to UTF-8 and from UTF-16), but these keep my string as three characters.
How can I decode these three separate characters into the one utf-8 codepoint for the Chinese character?
Or how can I properly encode it?
This is code fragment dealing with the cstring returned from sphinx and its encoding to a NSString:
const char * hypothesis = ps_get_hyp(pocketSphinxDecoder, &recognitionScore, &utteranceID);
NSString *hypothesisString = [[NSString alloc] initWithCString:hypothesis encoding:NSMacOSRomanEncoding];
Edit: From looking at the addition to your post, you actually do have control over the string encoding. In that case, why are you creating the string with NSMacOSRomanEncoding when you're expecting utf-8? Just change that to NSUTF8StringEncoding.
It sounds like what you're saying is you're being given an NSString that contains UTF-8 data that's being interpreted as a single-byte encoding (e.g. ISO-Latin-1, MacRoman, etc). I'm assuming here that you have no control over the code that creates the NSString, because if you did then the solution is just to change the encoding it's initializing with.
In any case, what you're asking for is a way to take the data in the string and convert it back to UTF-8. You can do this by creating an NSData from the NSString using whatever encoding its was originally created with (you need to know this much, at least, or it won't work), and then you can create a new NSString from the same data using UTF-8.
From the example character you gave (人) it looks like it's being interpreted as MacRoman, so lets go with that. The following code should convert it back:
- (NSString *)fixEncodingOfString:(NSString *)input {
CFStringEncoding cfEncoding = kCFStringEncodingMacRoman;
NSStringEncoding encoding = CFStringCovnertEncodingToNSStringEncoding(cfEncoding);
NSData *data = [input dataUsingEncoding:encoding];
if (!data) {
// the string wasn't actually in MacRoman
return nil;
}
NSString *output = [[[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding] autorelease];
}

Objective c doesn't like my unichars?

Xcode complaints about "multi-character character contant"'s when I try to do the following:
static unichar accent characters[] = { 'ā', 'á', 'ă', 'à' };
How do you make an array of characters, when not all of them are ascii? The following works just fine
static unichar accent[] = { 'a', 'b', 'c' };
Workaround
The closest work around I have found is to convert the special characters into hex, ie this works:
static unichar accent characters[] = { 0x0100, 0x0101, 0x0102 };
It's not that Objective-C doesn't like it, it's that C doesn't. The constant 'c' is for char which has 1 byte, not unichar which has 2 bytes. (see the note below for a bit more detail.)
There's no perfectly supported way to represent a unichar constant. You can use
char* s="ü";
in a UTF-8-encoded source file to get the unicode C-string, or
NSString* s=#"ü";
in a UTF-8 encoded source file to get an NSString. (This was not possible before 10.5. It's OK for iPhone.)
NSString itself is conceptually encoding-neutral; but if you want, you can get the unicode character by using -characterAtIndex:.
Finally two comments:
If you just want to remove accents from the string, you can just use the method like this, without writing the table yourself:
-(NSString*)stringWithoutAccentsFromString:(NSString*)s
{
if (!s) return nil;
NSMutableString *result = [NSMutableString stringWithString:s];
CFStringFold((CFMutableStringRef)result, kCFCompareDiacriticInsensitive, NULL);
return result;
}
See the document of CFStringFold.
If you want unicode characters for localization/internationalization, you shouldn't embed the strings in the source code. Instead you should use Localizable.strings and NSLocalizedString. See here.
Note:
For arcane historical reasons, 'a' is an int in C, see the discussions here. In C++, it's a char. But it doesn't change the fact that writing more than one byte inside '...' is implementation-defined and not recommended. For example, see ISO C Standard 6.4.4.10. However, it was common in classic Mac OS to write the four-letter code enclosed in single quotes, like 'APPL'. But that's another story...
Another complication is that accented letters are not always represented by 1 byte; it depends on the encoding. In UTF-8, it's not. In ISO-8859-1, it is. And unichar should be in UTF-16. Did you save your source code in UTF-16? I think the default of XCode is UTF-8. GCC might do some encoding conversion depending on the setup, too...
Or you can just do it like this:
static unichar accent characters[] = { L'ā', L'á', L'ă', L'à' };
L is a standard C keyword which says "I'm about to write a UNICODE character or character set".
Works fine for Objective-C too.
Note: The compiler may give you a strange warning about too many characters put inside a unichar, but you can safely ignore that warning. Xcode just doesn't deal with the unicode characters the right way, but the compiler parses them properly and the result is OK.
Depending on your circumstances, this may be a tidy way to do it:
NSCharacterSet* accents =
[NSCharacterSet characterSetWithCharactersInString:#"āáăà"];
And then, if you want to check if a given unichar is one of those accent characters:
if ([accents characterIsMember:someOtherUnichar])
{
}
NSString also has many methods of its own for handling NSCharacterSet objects.

Raw strings like Python's in Objective-C

Does Objective-C have raw strings like Python's?
Clarification: a raw string doesn't interpret escape sequences like \n: both the slash and the "n" are separate characters in the string. From the linked Python tutorial:
>>> print 'C:\some\name' # here \n means newline!
C:\some
ame
>>> print r'C:\some\name' # note the r before the quote
C:\some\name
Objective-C is a superset of C. So, the answer is yes. You can write
char* string="hello world";
anywhere. You can then turn it into an NSString later by
NSString* nsstring=[NSString stringWithUTF8String:string];
From your link explaining what you mean by "raw string", the answer is: there is no built in method for what you are asking.
However, you can replace occurrences of one string with another string, so you can replace #"\n" with #"\\n", for example. That should get you close to what you're seeking.
You can use stringize macro.
#define MAKE_STRING(x) ##x
NSString *expendedString = MAKE_STRING(
hello world
"even quotes will be escaped"
);
The preprocess result is
NSString *expendedString = #"hello world \"even quotes will be escaped\"";
As you can see, double quotes are escaped, however new lines are ignored.
This feature is very suitable to paste some JS code in Objective-C files. Using this feature is safe if you are using C99.
source:
https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html
How, exactly, does the double-stringize trick work?
Like everyone said, raw ANSI strings are very easy. Just use simple C strings, or C++ std::string if you feel like compiling Objective C++.
However, the native string format of Cocoa is UCS-2 - fixed-width 2-byte characters. NSStrings are stored, internally, as UCS-2, i. e. as arrays of unsigned short. (Just like in Win32 and in Java, by the way.) The systemwide aliases for that datatype are unichar and UniChar. Here's where things become tricky.
GCC includes a wchar_t datatype, and lets you define a raw wide-char string constant like this:
wchar_t *ws = L"This a wide-char string.";
However, by default, this datatype is defined as 4-byte int and therefore is not the same as Cocoa's unichar! You can override that by specifying the following compiler option:
-fshort-wchar
but then you lose the wide-char C RTL functions (wcslen(), wcscpy(), etc.) - the RTL was compiled without that option and assumes 4-byte wchar_t. It's not particularly hard to reimplement these functions by hand. Your call.
Once you have a truly 2-byte wchar_t raw strings, you can trivially convert them to NSStrings and back:
wchar_t *ws = L"Hello";
NSString *s = [NSString stringWithCharacters:(const unichar*)ws length:5];
Unlike all other [stringWithXXX] methods, this one does not involve any codepage conversions.
Objective-C is a strict superset of C so you are free to use char * and char[] wherever you want (if that's what you call raw strings).
If you mean C-style strings, then yes.