NSJSONSerialization parsng special characters - objective-c

I am parsing some data using NSJSONSerialization. After parsing, I get strings like &auml ; and %#339; which i think has something to do with encoding. But NSJSONSerialzation doesn't ask for what encoding it requires, it i guess detects it by itself. So my question is, how can I get proper strings instead of these weird &auml ; and %#339;.

NSJSONSerialization assumes the encoding is one of the Unicode encodings. Make sure the data you pass to it is in UTF-8 (or UTF-16). ä is C3 A4 in UTF-8 or E4 in UTF-16.
Note that the default encoding for HTTP if none is specified is ISO-8859-1, so it may be that you are passing ISO-8859-1 data instead of UTF-8.

In options try NSJSONReadingMutableLeaves, it must return NSMutableString.. For more take a look at the docs.

Related

Encoding type for polish characters

I have json string which is having characters which exist in polish. example below
"Reno Truck Lachowski & Łuczak - NAPRAWY CHŁODNI,IZOTERM,ZABUDÓW POJAZDÓW CIĘŻAROWYCH"
or
"RENO TRUCK Lachowski & Łuczak s.c. SERWIS POJAZDÓW UZYTKOWYCH"
I need to update this value in database .
Can anyone let me know what is encoding type i need to set..
I tried with UTF-8 and ISO-8859-1, but both doesn't work .
Observed that when i set ISO-8859-1 the value seems to different as below
"RENO TRUCK Lachowski & ?uczak s.c. SERWIS POJAZDÓW UZYTKOWYCH"
The character Ł doesn't get updated.
Can anyone help please..
JSON values are expected to be encoded in UTF-8. The string you quoted seems to be encoded in something else. You are expected to know the encoding of the data. Note that it may not be a valid JSON if it is not UTF-8. Once you know it you could use DataWeave to convert the encoding to what your database is expect. Based on the JDBC URL it seems that the database connection is expecting ISO-8859-1.

Can ASCII arrays be manipulated as arrays without converting to String form?

This is a basic question, but I can't find anything on it, since I don't know what to search — each of my tries have come up with unrelated results.
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character? Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
Dim t1() As Byte = Text.Encoding.ASCII.GetBytes("Hello ")
Dim t2() As Byte = Text.Encoding.ASCII.GetBytes("World")
Dim msg As String = Text.Encoding.ASCII.GetString(t1.Concat(t2).ToArray)
Now msg should be "Hello World".
I would like this to work as I don't want to have to convert data I receive back to Strings in order to manipulate it before it is sent again.
What if I used something other than ASCII (like UTF-8, for example)?
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character?
Yes. ASCII is a 7bit encoding, it does not support multi-byte characters. Any Unicode codepoint above U-007F will get converted to a ? character in ASCII.
If you were to use UTF-7 instead, for instance, it can encode individual Unicode codepoints into a sequence of multiple ASCII characters.
Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
In your particular example, yes (provided you are using LINQ's Concat() method - there are other ways to concat arrays together). There is no data loss.
But for other examples, just know that you will have data loss if you convert non-ASCII characters to ASCII, or otherwise mismatch encodings between GetBytes() and GetString().
You can certainly manipulate byte arrays. Just make sure the arrays are in the same encoding if you merge them together.
.NET strings are counted sequences of UTF-16 code units (char), one or two of which encode a Unicode codepoint (int Char.ConvertToUtf32 ). Some codepoints are "combining characters", which when applied to a preceding "base character" form a grapheme (which is then rendered by a font into a glyph).
An encoder from Unicode to an encoding of another character set should attempt to preserve graphemes. In .NET, a grapheme is called a "text element."
So, yes, you can combine encoded byte sequences as long as you haven't defeated the encoder by converting parts of a grapheme into different byte sequences. If you are breaking a string into two before encoding, see TextElementEnumerator and StringInfo class.

Replacing wrong codificaton letters with SQL

I have a database with data from internet, but some pages have wrong codification and letters like ã becomes ã and çbecomes ç.
What are the possibilities to fix this? I'm using PostgreSQL.
I can use replace, but I need to do a replace for each case? I was thinking about translate, but I see that it transforms only one char into other. Is possible translate two chars into one? Something like: TRANSLATE(text,'ã|ç','ã|ç').
This particular problem looks like you have UTF-8 encoding being interpreted as a single-byte character set ("ç" becoming "ç" suggests iso-8859-1).
You can fix these up individually with a long chain of replace(...) calls. Or you can use postgresql's own character-conversion facilities:
select convert_from(convert_to('£20 - garçon', 'iso-8859-1'), 'utf-8')
In order, this:
Converts the string back to binary using the iso-8859-1 codec (which will just change unicode codepoints back to bytes, assuming all the codepoints are under 256)
Reinterprets that binary output as UTF-8, so sequences such as {0xc2, 0xa3} are translated to '£'
You can fix some of the characters by replacing them, but not all. By decoding the data using the wrong encoding you have already removed some information, and that is impossible to get back.
You should find out what the correct encoding is for those pages, and use that when decoding the data.
Some pages have the encoding in the response header, e.g.
Content-Type: text/html; charset=utf8
Some pages have the encoding in the HTML head, e.g.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
If the information is not in the header you would first have to decode the page (or at least a part of it) using the ASCII encoding (which is not a problem as the meta tag contains no special characters), find out the encoding, then decode the page using the correct encoding.
PostgreSQL has a string replacement function:
replace(string text, from text, to text): Replace all occurrences in string of substring from with substring to
Example:
replace ('abcdefabcdef', 'cd', 'XX') ==> abXXefabXXef

unicode escapes in objective-c

I have a string "Artîsté". I use json_encode from PHP on it and I get "Art\u00eest\u00e9".
How do I convert that to an NSString? I have tried many things and none of them work I always end up getting Artîsté
For Example:
NSString stringWithUTF8String:"Art\u00c3\u00aest\u00c3\u00a9"];//Artîsté
#"Art\u00c3\u00aest\u00c3\u00a9"; //Artîsté
You can use CFStringCreateFromExternalRepresentation with the kCFStringEncodingNonLossyASCII encoding to parse the \uXXXX escape sequences. Check out my answer here:
Converting escaped UTF8 characters back to their original form
The problem is your input string:
"Art\u00c3\u00aest\u00c3\u00a9"
does in fact literally mean "Artîsté". \u00c3 is 'Ã', \u00ae is '®', and \u00a9 is '©'.
Whatever is producing your input string is receiving UTF-8 input but expecting something else (e.g., cp1252, ISO-8859-1, or ISO-8859-15)

how to check the string is UNICODE vb.net

Is there any way to check if the string is UNICODE using VB.net.
Best Regards
inchikka
You need to read the file using the Encoding that the file is written in.
It appears to be a non Unicode file that you are trying to read as Unicode, or possibly a different Unicode encoding than the default UTF-8 (could be UTF-16 for example).
StreamWriter has several constructors that the an Encoding as parameter.
You can do it by validating each character in the string against the 128 characters in the ASCII table. If the character is not found there then it might be a unicode character.
Is that what you mean?