How to convert Unicode strings (\u00e2, etc) into NSString for display? - objective-c

I am trying to support arbitrary unicode from a variety of international users. They have already put a bunch of data into sqlite databases on their iPhones, and now I want to capture the data into a database, then send it back to their device. Right now I am using a php page that is sending data back to from an internet mysql database. The data is saved in the mysql database properly, but when it's sent back it comes out as unicode text, such as
Frank\u00e2\u0080\u0099s iPad
instead of just
Frank's iPad
where the apostrophe should really be a curly apostrophe.
The answer posted to another question indicates that there is no built-in Cocoa methods to convert the "\u00e2\u0080\u0099" portion of the unicode string from the webserver to an NSString object. Is this correct?
That seems really surprising (and scarily disappointing), since Cocoa definitely allows input from many different Unicode characters, and I need to support any arbitrary language that I have never heard of, and all of the possible characters. I save them to and from the local sqlite database just fine now, but once I send it to a web server, then perhaps pull down different data, I want to ensure the data pulled from the web server is correctly formatted.

[...] there is no built-in Cocoa methods to convert [...]. Is this
correct?
It's not correct.
You might be interested in CFStringTransform and it's capabilities. It is a full blown ICU transformation engine, which can (also) perform your requested transformation.
See Using Objective C/Cocoa to unescape unicode characters, ie \u1234

All NSStrings are Unicode.
The problem with the “Frank\u00e2\u0080\u0099s iPad” data isn't that it's Unicode; it's that it's escaped to ASCII. “Frank’s iPad” is valid Unicode in any UTF, and is what you need.
So, you need to see whether the database is returning the data escaped or the PHP layer is escaping it at some point. If either of those is the case, fix it if you can; the PHP resource should return UTF-8/16/32. Only if that approach fails should you seek to unescape the string on the Cocoa side.
You're correct that there is no built-in way to unescape the string in Cocoa. If you get to that point, see if you can find some open-source code to do it; if not, you'll need to do it yourself, probably using NSScanner.

Check that your web service response has Content type and charset. Also that xml has encoding specified. In PHP you need to add the following before printing XML:
header('Content-type: text/xml; charset=UTF-8');
print '<?xml version="1.0" encoding="UTF-8"?>';
I guess there is just no encoding specified.

Related

How to handle carriage return, linefeed within a quoted string

Multiple source systems I want to process using Azure Data Lake contain a carriage return, linefeed within a column.
This causes Extract in ADLA to fail with the following error:
E_RUNTIME_USER_EXTRACT_UNEXPECTED_ROW_DELIMITER
Trying to find a working configuration to not be running into this issue anymore. The native Extractor documentation on Microsoft.com describes this:
Note that the rowDelimiter character inside a quoted string will not
be escaped and will be used as a row separator which will lead to
incorrect or failing extractions.
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/extractor-parameters-u-sql
Unfortunately this fails to mention a good workaround.
I tried switching to another format like Orc or Parquet. However, for the time being, these seem not to be fully supported yet. As this limits the functionality of the development environment, I would rather not use these formats for now.
This issue seems highly likely to occur, yet I am unable to find a good solution. What is a good and standard solution to work around this issue while still keeping the convenience of using csv/tsv to store files?
I've accomplished this by creating a custom extractor based on a third party CSV Parser. Specifically, the CsvParser class from Josh Close's fantastic CsvHelper library. Works like a charm. Don't forget to set AtomicFileProcessing = true.

WINDOWS to Linux character set issue

I have different set of SQL files which has French/Spanish and other language characters. In windows, we are able to see the specific language characters and when it transfers to Linux and i see weird characters.
I understand windows uses different character set like WINDOWS-1252, WINDOWS-1258 and iso-8859-1.
How can we change the charset which is similar to Windows in Linux, So that we won't insert the weird characters in DB when triggering the queries from Linux?
Thanks in advance.
If I'm understanding the problem correctly, you have SQL scripts produced in a variety of windows encodings that include non-ASCII characters. You want to execute these scripts on Linux.
I would think you'd want to lossless-ly convert the files to something that your linux SQL parser can handle, probably to unicode UTF-8. This sort of conversion can be done with iconv (command-line utility, I believe there are libraries as well).
A challenge though is whether or not you know what each file's original encoding is, as this cannot necessarily be automatically detected...might be better if you can get the script files' authors to provide the scripts with a specified encoding.
In windows, we are able to see the specific language characters
You can open it in notepad++ and see what encoding the file is using and you can also convert it to UTF-8.
You will want to use Encode or utf8 modules.
Normally for SQL or MySQL you will set the DB encoding to what you prefer to work with. These days most people set it to UTF-8 to support a large range of character sets.
But in this case you can play around with the encoding to match the right one needed, This could work.
use Encode qw(decode encode);
$data = encode("utf8", decode("iso-8859-1", $data));

Default text encoding on iOS

I'm creating an app based on a API server. The server is currently in UTF-8 encoding.
Rather than write a line of code every time the API is accessed to re-encode text, I thought I may as well just set the server API to have the same text encoding as the app.
Only problem is I can't find out the default encoding!
Where would I find this information, and can I change the default encoding on the app?
Cheers!
There is no default encoding. NSString can have various encodings internally.
I know that on Mac OS X, ASCII, UTF-8 and UTF-16 w/ host byte order are always among possible internal representations, iOS shouldn't be different, though I'm not totally sure. I think it's safe to assume that stringWithUTF8String: will not cause any extra re-encoding.
NSString's static method will do the job
+ (NSStringEncoding)defaultCStringEncoding;

How do I import Spanish into a SQL DB?

So I have some Spanish content saved in Excel, that I am exporting into a .csv format so I can import it from the Firefox sql manager add-on into a .sql db. The problem is that when I import it, whenever there is an accent mark, (or whatever the technical name for those things are) Firefox doesn't recognize it, and accordingly produces a big black diamond with a white ?. Is there a better way to do this? Is there something I can do to have my Spanish content readable in a sql db? Maybe a more preferable program than the Firefox extension? Please let me know if you have any thoughts or ideas. Thanks!
You need to follow the chain and make sure nothing gets lost "in translation".
Specifically:
assert which encoding is used in the CSV file; ensure that the special charaters are effectively in there, and see how they are encoded (UTF8, particular Code page, ...)
ensure the that SQL server can
a) read these characters and
b) store them in an encoding which will preserve their integrity. (BTW, the encoding used in the CSV can of course be remapped to some other encoding of your choosing, i.e. one that you know will be suitable for consumption by your target application)
ensure that the database effectively stored these characters ok.
see if Firefox (or whichever "consumer" of this text) properly handles characters in this particular encoding.
It is commonplace but useful for this type of inquiries to recommend the following reading assignement:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

I need a string that won't properly convert to ANSI using several code pages

My .NET library has to marshal strings to a C library that expects text encoded using the system's default ANSI code page. Since .NET supports Unicode, this makes it possible for users to pass a string to the library that doesn't properly convert to ANSI. For example, on an English machine, "デスクトップ" will turn in to "?????" when passed to the C library.
To address this, I wrote a method that detects when this will happen by comparing the orginal string to a string converted using the ANSI code page. I'd like to test this method, but I really need a string that's guaranteed to be not encodable. For example, we test our code on English and Japanese machines (among other languages.) If I write the test to use the Japanese string above, the test will fail when the Japanese system properly encodes the string. I could write the test to check the current system's encoding, but then I have a maintenance nightmare every time we add/remove a new language.
Is there a unicode character that doesn't encode with any ANSI code page? Failing that, could a string be constructed with characters from enough different code pages to guarantee failure? My first attempt was to use Chinese characters since we don't cover Chinese, but apparently Japanese can convert the Chinese characters I tried.
edit I'm going to accept the answer that proposes a Georgian string for now, but was really expecting a result with a smattering of characters from different languages. I don't know if we plan on supporting Georgian so it seems OK for now. Now I have to test it on each language. Joy!
There are quite a few Unicode-only languages. Georgian is one of them. Here's the word 'English' in Georgian: ინგლისური
You can find more in Georgian file (ka.xml) of the CLDR DB.
If by "ANSI" you mean Windows code pages, I am pretty sure the characters out of BMP are not covered by any Windows code pages.
For instance, try some of Byzantine Musical Symbols
There are Windows code pages, which cover all Unicode characters (e.g. Cp1200, Cp12000, Cp65000 and Cp65001), so it's not always possible to create a string, which is not convertable.
What do you mean by an 'ANSI code page'? On Windows, the code pages are Microsoft, not ANSI. ISO defines the 8859-x series of code sets; Microsoft has Windows code pages analogous to most of these.
Are you thinking of single-byte code sets? If so, you should look for Unicode characters in esoteric languages for which there is less likely to be a non-Unicode, single-byte code set.
You could look at languages such as: Devanagari, Oi Chiki, Cherokee, Ogham.