Objective C - char with umlaute to NSString - objective-c

I am using libical which is a library to parse the icalendar format (RFC 2445).
The problem is, that there may be some german umlaute for example in the location field.
Now libical returns a const char * for each value like:
"K\303\203\302\274nstlerhaus in M\303\203\302\274nchen"
I tried to convert it to NSString with:
[NSString stringWithCString:icalvalue_as_ical_string_r(value) encoding:NSUTF8StringEncoding];
But what I get is:
Künstlerhaus in München
Any suggestions? I would appreciate any help!

Seems like your string got doubly-UTF-8-encoded, because "Künstlerhaus in München" actually is UTF-8, if you UTF-8-decode that again you should get the correct string.
Bear in mind though that you shouldn't be satisfied with that result. There are combinations where a doubly-UTF-8-encoded string can't be simply be decoded by doing a double-UTF-8-decode. Some encoding combinations are irreversible. So in your situation I'd suggest you find out why the string got doubly-UTF-8-encoded in the first place, probably the ical is stored in the wrong encoding on the hard disk, or libical uses the wrong character set to access it, or if you're getting the ical from a server, perhaps the charset there is wrong for text/ical, etc, etc...

The C string does not seem to be encoded in UTF-8, as there are four bytes for each of the characters. For example ü would be encoded as \xc3\xbc (or \195\188) in UTF-8. So the input is either already garbled when you receive it or it uses some other encoding.

Related

JSON parsing with £ symbol returning null

I am trying to parse a JSON script from my server which contains a £ (pound) however this is returning null. I had problems before so temporarily just switched to using dollars or euro sign but I need to be able to parse the pound sign. However I am unsure as to how to rectify this issue. I created a test project and temporarily just using String with contents method, all the other jsons work fine, but the one with the pound sign in it returns null.
NSString *get5 = [NSString stringWithContentsOfURL:url5 encoding:NSUTF8StringEncoding error:nil];
I tried the other encoding NSUTF encoding but they dont seem to work either. Some return null, some return chinese characters, so they are not much good.
Any help would be much appreciated!!
Edit:
Used the NSError object and got this message back
"The operation couldn’t be completed. (Cocoa error 261.)"
UserInfo=0x68294b0
{NSURL=http://myserver.com/test.jsp,
NSStringEncoding=4}
Cocoa Error 261 is an encoding error. The service returning the JSON obviously isn't returning it with an UTF-8 encoding. Either make the service returns UTF-8 if you can, or find out which encoding it is returning and use that.
See this question for more info:
Encoding issue: Cocoa Error 261?
Can you check that the json is not encoded in 1) CRLF (windows)encoding 2) Western etc.
Make sure the encoding is UTF-8

IOS JSON escaping special characters

I'm working in IOS and trying to pass some content to a web server via an NSURLRequest. On the server I have a PHP script setup to accept the request string and convert it into an JSON object using the Zend_JSON framework. The issue I am having is whenever the character "ø" is in any part of the request parameters, then the request string is cut short by one character.
Request string before going to server.
[{"description":"Blah blah","type":"Russebuss","name":"Roscoe Simulator","appVersion":"1.0.20","osVersion":"IOS 5.1","phone":"5555555","country":"Østfold","udid":"bed164974ea0d436a43f3cdee0e005a1"}]
Request string on server before any parsing
[{"description":"Blah blah","type":"Russebuss","name":"Roscoe Simulator","appVersion":"1.0.20","osVersion":"IOS 5.1","phone":"5555555","country":"Nord-Trøndelag","udid":"bed164974ea0d436a43f3cdee0e005a1"}
Everything looks exactly the same except the final closing ] is missing. I'm thinking it's having an issue when converting the string to UTF-8, but not sure the correct way to fix this issue.
Does anyone have any ideas why this is happening?
first of all do not trust the xcode console in such cases. you never know which coding the console is actually using.
second, escape the invalid characters before you build you json string. easiest way would probably to make sure you are using the same unicode representation, like utf-8, all the time.
third, if there are still invalid characters use a json lib with a parser (does the encoding). validate the output by parsing back to e.g. NSString. or validate the output manually by using a web form like http://jsonformatter.curiousconcept.com/
the badest way is to replace the single characters in the string, build your json and convert back. one way to do this could be to replace e.g an german ä with its unicode representaion U+00E4 (http://www.utf8-chartable.de/).
Thats the way I do it. I am glad that I nerver needed to go further than step three and this is the step you should do anyway to keep your code simple.
Please try to use Zends internal json Encoding:
Zend_Json::$useBuiltinEncoderDecoder = true;
should fix your issue.

Objective-C How to get unicode character

I want to get unicode code point for a given unicode character in Objective-C. NSString said it internal use UTF-16 encoding and said,
The NSString class has two primitive methods—length and characterAtIndex:—that provide the basis for all other methods in its interface. The length method returns the total number of Unicode characters in the string. characterAtIndex: gives access to each character in the string by index, with index values starting at 0.
That seems assume characterAtIndex method is unicode aware. However it return unichar is a 16 bits unsigned int type.
- (unichar)characterAtIndex:(NSUInteger)index
The questions are:
Q1: How it present unicode code point above UFFFF?
Q2: If Q1 make sense, is there method to get unicode code point for a given unicode character in Objective-C.
Thx.
The short answer to "Q1: How it present unicode code point above UFFFF?" is: You need to be UTF16 aware and correctly handle Surrogate Code Points. The info and links below should give you pointers and example code that allow you to do this.
The NSString documentation is correct. However, while you said "NSString said it internal use UTF-16 encoding", it's more accurate to say that the public / abstract interface for NSString is UTF16 based. The difference is that this leaves the internal representation of a string a private implementation detail, but the public methods such as characterAtIndex: and length are always in UTF16.
The reason for this is it tends to strike the best balance between older ASCII-centric and Unicode aware strings, largely due to the fact that Unicode is a strict superset of ASCII (ASCII uses 7 bits, for 128 characters, which are mapped to the first 128 Unicode Code Points).
To represent Unicode Code Points that are > U+FFFF, which obviously exceeds what can be represented in a single UTF16 Code Unit, UTF16 uses special Surrogate Code Points to form a Surrogate Pair, which when combined together form a Unicode Code Point > U+FFFF. You can find details about this at:
Unicode UTF FAQ - What are surrogates?
Unicode UTF FAQ - What’s the algorithm to convert from UTF-16 to character codes?
Although the official Unicode UTF FAQ - How do I write a UTF converter? now recommends the use of International Components for Unicode, it used to recommend some code officially sanctioned and maintained by Unicode. Although no longer directly available from Unicode.org, you can still find copies of the "no longer official" example code in various open-source projects: ConvertUTF.c and ConvertUTF.h. If you need to roll your own, I'd strongly recommend examining this code first, as it is well tested.
From the documentation of length:
The number returned includes the
individual characters of composed
character sequences, so you cannot use
this method to determine if a string
will be visible when printed or how
long it will appear.
From this, I would infer that any characters above U+FFFF would be counted as two characters and would be encoded as a Surrogate Pair (see the relevant entry at http://unicode.org/glossary/).
If you have a UTF-32 encoded string with the character you wish to convert, you could create a new NSString with initWithBytesNoCopy:length:encoding:freeWhenDone: and use the result of that to determine how the character is encoded in UTF-16, but if you're going to be doing much heavy Unicode processing, your best bet is probably to get familiar with ICU (http://site.icu-project.org/).

Char.ConvertFromUtf32 not available in Silverlight

I'm converting a WinForms app to Silverlight (VB.NET). What should I use instead of Char.ConvertFromUtf32 as it's not available to use in Silverlight?
UTF-32 is currently not part of Silverlight, so you have to find a way around the limitation. I think you should stop a moment and think exactly why you need to read UTF32-encoded text.
If you are reading such text from a database or a file on the server, I would perform the conversion server-side (if possible I would convert everything to UTF-8 and get rid of the UTF-32 data in one shot).
If you are parsing a user-provided file on the client side, I would detect the UTF-32 encoding and gently tell the user that the file encoding is not supported. UTF32 is pretty rare nowadays, so I guess it should not be a very common case (but I could be wrong not knowing your exact situation).
In order to detect the file encoding you have to look at the first few bytes (byte order mark) -more information here, if they are not present the task becomes much harder and involves some kind of heuristics based on character frequency.
From: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/types/how-to-convert-between-hexadecimal-strings-and-numeric-types
You can use a direct cast, like:
// Get the character corresponding to the integral value.
string stringValue = Char.ConvertFromUtf32(value);
char charValue = (char)value;
Small warning, it will only work up to 0xffff. It will not work for high range Unicode from 0x10000 to 0x10ffff.
Also, if you need to parse \uXXXX, try this other question: How do I convert Unicode escape sequences to Unicode characters in a .NET string?

How do I match non-ASCII characters with RegexKitLite?

I am using RegexKitLite and I'm trying to match a pattern.
The following regex patterns do not capture my word that includes N with a titlde: ñ.
Is there a string conversion I am missing?
subjectString = #"define_añadir";
//regexString = #"^define_(.*)"; //this pattern does not match, so I assume to add the ñ
//regexString = #"^define_([.ñ]*)"; //tried this pattern first with a range
regexString = #"^define_((?:\\w|ñ)*)"; //tried second
NSString *captured= [subjectString stringByMatching:regexString capture:1L];
//I want captured == añadir
Looks like an encoding problem to me. Either you're saving the source code in an encoding that can't handle that character (like ASCII), or the compiler is using the wrong encoding to read the source files. Going back to the original regex, try creating the subject string like this:
subjectString = #"define_a\xC3\xB1adir";
or this:
subjectString = #"define_a\u00F1adir";
If that works, check the encoding of your source code files and make sure it's the same encoding the compiler expects.
EDIT: I've never worked with the iPhone technology stack, but according to this doc you should be using the stringWithUTF8String method to create the NSString, not the #"" literal syntax. In fact, it says you should never use non-ASCII characters (that is, anything not in the range 0x00..0x7F) in your code; that way you never have to worry about the source file's encoding. That's good advice no matter what language or toolset you're using.