WINDOWS to Linux character set issue - sql

I have different set of SQL files which has French/Spanish and other language characters. In windows, we are able to see the specific language characters and when it transfers to Linux and i see weird characters.
I understand windows uses different character set like WINDOWS-1252, WINDOWS-1258 and iso-8859-1.
How can we change the charset which is similar to Windows in Linux, So that we won't insert the weird characters in DB when triggering the queries from Linux?
Thanks in advance.

If I'm understanding the problem correctly, you have SQL scripts produced in a variety of windows encodings that include non-ASCII characters. You want to execute these scripts on Linux.
I would think you'd want to lossless-ly convert the files to something that your linux SQL parser can handle, probably to unicode UTF-8. This sort of conversion can be done with iconv (command-line utility, I believe there are libraries as well).
A challenge though is whether or not you know what each file's original encoding is, as this cannot necessarily be automatically detected...might be better if you can get the script files' authors to provide the scripts with a specified encoding.

In windows, we are able to see the specific language characters
You can open it in notepad++ and see what encoding the file is using and you can also convert it to UTF-8.

You will want to use Encode or utf8 modules.
Normally for SQL or MySQL you will set the DB encoding to what you prefer to work with. These days most people set it to UTF-8 to support a large range of character sets.
But in this case you can play around with the encoding to match the right one needed, This could work.
use Encode qw(decode encode);
$data = encode("utf8", decode("iso-8859-1", $data));

Related

Pex Generated Tests Encoded UCS-2 Little Endian, Why, how to change?

HI there
i noticed that when I generate a pex test solution the default encoding of the files is UCS-2 Little Endian, this is not really cool, because all the rest of the files are normally encoded with Windows ANSI
(I m getting this info from Notepadd ++) and its confirmed by my CI breaking
Anyone knows
1) why is it using this encoding?
2) how to change it so by default it uses Windows ANSI like the rest of the files
NOTE:I know this is the issue because i saved the file with Windows Ansi Encoding and it all works
I know I probably shouldnt but I went and posted this same question on the pex forum
link to the question
and this was an Answer from Peli ( he is heavily involved in the Pex project AFAIK)
Copy of the Answer
1) why is it using this encoding?
There is no particular reason for this, besides that we decide to use this particular encoding. We will switch on Windows-1252 (ANSI) encoding in the future for source files. XML files will still be encoded as UTF-8.
2) how to change it so by default it uses Windows ANSI like the rest of the files
Unfortunately, this is hard-coded in Pex and you cannot change this. The next release of Pex (0.93) will use ANSI.

How to convert Unicode strings (\u00e2, etc) into NSString for display?

I am trying to support arbitrary unicode from a variety of international users. They have already put a bunch of data into sqlite databases on their iPhones, and now I want to capture the data into a database, then send it back to their device. Right now I am using a php page that is sending data back to from an internet mysql database. The data is saved in the mysql database properly, but when it's sent back it comes out as unicode text, such as
Frank\u00e2\u0080\u0099s iPad
instead of just
Frank's iPad
where the apostrophe should really be a curly apostrophe.
The answer posted to another question indicates that there is no built-in Cocoa methods to convert the "\u00e2\u0080\u0099" portion of the unicode string from the webserver to an NSString object. Is this correct?
That seems really surprising (and scarily disappointing), since Cocoa definitely allows input from many different Unicode characters, and I need to support any arbitrary language that I have never heard of, and all of the possible characters. I save them to and from the local sqlite database just fine now, but once I send it to a web server, then perhaps pull down different data, I want to ensure the data pulled from the web server is correctly formatted.
[...] there is no built-in Cocoa methods to convert [...]. Is this
correct?
It's not correct.
You might be interested in CFStringTransform and it's capabilities. It is a full blown ICU transformation engine, which can (also) perform your requested transformation.
See Using Objective C/Cocoa to unescape unicode characters, ie \u1234
All NSStrings are Unicode.
The problem with the “Frank\u00e2\u0080\u0099s iPad” data isn't that it's Unicode; it's that it's escaped to ASCII. “Frank’s iPad” is valid Unicode in any UTF, and is what you need.
So, you need to see whether the database is returning the data escaped or the PHP layer is escaping it at some point. If either of those is the case, fix it if you can; the PHP resource should return UTF-8/16/32. Only if that approach fails should you seek to unescape the string on the Cocoa side.
You're correct that there is no built-in way to unescape the string in Cocoa. If you get to that point, see if you can find some open-source code to do it; if not, you'll need to do it yourself, probably using NSScanner.
Check that your web service response has Content type and charset. Also that xml has encoding specified. In PHP you need to add the following before printing XML:
header('Content-type: text/xml; charset=UTF-8');
print '<?xml version="1.0" encoding="UTF-8"?>';
I guess there is just no encoding specified.

How do I import Spanish into a SQL DB?

So I have some Spanish content saved in Excel, that I am exporting into a .csv format so I can import it from the Firefox sql manager add-on into a .sql db. The problem is that when I import it, whenever there is an accent mark, (or whatever the technical name for those things are) Firefox doesn't recognize it, and accordingly produces a big black diamond with a white ?. Is there a better way to do this? Is there something I can do to have my Spanish content readable in a sql db? Maybe a more preferable program than the Firefox extension? Please let me know if you have any thoughts or ideas. Thanks!
You need to follow the chain and make sure nothing gets lost "in translation".
Specifically:
assert which encoding is used in the CSV file; ensure that the special charaters are effectively in there, and see how they are encoded (UTF8, particular Code page, ...)
ensure the that SQL server can
a) read these characters and
b) store them in an encoding which will preserve their integrity. (BTW, the encoding used in the CSV can of course be remapped to some other encoding of your choosing, i.e. one that you know will be suitable for consumption by your target application)
ensure that the database effectively stored these characters ok.
see if Firefox (or whichever "consumer" of this text) properly handles characters in this particular encoding.
It is commonplace but useful for this type of inquiries to recommend the following reading assignement:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

I need a string that won't properly convert to ANSI using several code pages

My .NET library has to marshal strings to a C library that expects text encoded using the system's default ANSI code page. Since .NET supports Unicode, this makes it possible for users to pass a string to the library that doesn't properly convert to ANSI. For example, on an English machine, "デスクトップ" will turn in to "?????" when passed to the C library.
To address this, I wrote a method that detects when this will happen by comparing the orginal string to a string converted using the ANSI code page. I'd like to test this method, but I really need a string that's guaranteed to be not encodable. For example, we test our code on English and Japanese machines (among other languages.) If I write the test to use the Japanese string above, the test will fail when the Japanese system properly encodes the string. I could write the test to check the current system's encoding, but then I have a maintenance nightmare every time we add/remove a new language.
Is there a unicode character that doesn't encode with any ANSI code page? Failing that, could a string be constructed with characters from enough different code pages to guarantee failure? My first attempt was to use Chinese characters since we don't cover Chinese, but apparently Japanese can convert the Chinese characters I tried.
edit I'm going to accept the answer that proposes a Georgian string for now, but was really expecting a result with a smattering of characters from different languages. I don't know if we plan on supporting Georgian so it seems OK for now. Now I have to test it on each language. Joy!
There are quite a few Unicode-only languages. Georgian is one of them. Here's the word 'English' in Georgian: ინგლისური
You can find more in Georgian file (ka.xml) of the CLDR DB.
If by "ANSI" you mean Windows code pages, I am pretty sure the characters out of BMP are not covered by any Windows code pages.
For instance, try some of Byzantine Musical Symbols
There are Windows code pages, which cover all Unicode characters (e.g. Cp1200, Cp12000, Cp65000 and Cp65001), so it's not always possible to create a string, which is not convertable.
What do you mean by an 'ANSI code page'? On Windows, the code pages are Microsoft, not ANSI. ISO defines the 8859-x series of code sets; Microsoft has Windows code pages analogous to most of these.
Are you thinking of single-byte code sets? If so, you should look for Unicode characters in esoteric languages for which there is less likely to be a non-Unicode, single-byte code set.
You could look at languages such as: Devanagari, Oi Chiki, Cherokee, Ogham.

sshfs EBCDIC to ASCII

what I want to do is to be able to mount via sshfs some files on the mainframe via USS on my local PC. I can do that but sshfs doesnt do straight off the conversion from EBCDIC to ascii/unicode. Is there any flags that i can set.
Alternativly, does anybody know of a library that does EBCDIC to ASCII conversions so i can add to SSHFS?
Cheers
Mark
Be aware though that transparent charset conversion is a very dangerous game. Are you absolutely sure that you will never read anything but EBCDIC files via SSHFS? What if there is binary data?
Some systems used transparent conversions in the past:
the infamous "ASCII mode" of FTP, which messed up many binary downloads
the vfat filesystem in Linux, which notes: "Programs that do computed lseeks won't like in-kernel text conversion. Several people have had their data ruined by this translation. Beware!"
So I'd strongly advise to be aware of the consequences.
Why not use use an editor that can handle EBCDIC? Vim e.g. can do it (if it is compiled in).
There are several libraries for character set conversion — iconv (normally part of your C library; see for example iconv_open) and GNU recode come to mind.
I know a lot of time has passed since the original question but I'll leave the info here:
I've wrote a patch for sshfs which adds automatic conversion between ASCII and EBCDIC. It can be found here: https://github.com/vadimshchukin/sshfs-ebcdic
The patch adds "-t" command-line option which defines regular expression for files that should be converted. For example sshfs -t".*"
defines conversion for all files.
I had to "hard-code" the conversion table since there are various "flavours" of EBCDIC and iconv didn't translate the text between ASCII as EBCDIC on my system as needed. The advantage here is that someone can easily change that translation table as needed.
By the way I wrote the same patch for win-sshfs.