Western European Characterset to Turkish in sql - sql

I am having a serious issue with character encoding. To give some background:
I have turkish business users who enter some data on Unix screens in Turkish language.
My database NLS parameter is set to AMERICAN, WE8ISO8859P1 and Unix NLS_LANG to AMERICAN_AMERICA.WE8ISO8859P1.
Turkey business is able to see all the turkish characters on UNIX screens and TOAD while I'm not. I can only see them in Western European Character set.
At business end: ÖZER İNŞAAT TAAHHÜT VE
At our end : ÖZER ÝNÞAAT TAAHHÜT VE
If you notice the turkish characters İ and Ş are getting converted to ISO 8859-1 character set. However, all the settings(NLS paramaters in db and unix) are same at both end- ISO8859-1(Western European)
With some study, I can understand - Turkish machines can display turkish data by doing conversion in real-time(DB NLS settings are overridden by local NLS settings).
Now, I have a interface running in my db- have some PL/SQL scripts(run through shell script) that extracts some data from database and spool them to a .csv file on a unix path. Then that .csv file is transferred to an external system via MFT(Managed File transfer).
The problem is- Exract never conains any turkish character. Every turkish character is getting converted into Western European Characterset and goes like this to the external system which is treated as a case of data conversion/loss and my business is really unhappy.
Could anyone tell me - How could I retain all the turkish characters?
P.S. : External System's characterset could be set to ISP8859-9 charcterset.
Many thanks in advance.

If you are saying that your database character set is ISO-8859-1, i.e.
SELECT parameter, value
FROM v$nls_parameters
WHERE parameter = 'NLS_CHARACTERSET'
returns a value of WE8ISO8859P1 and you are storing the data in CHAR, VARCHAR, or VARCHAR2 columns, the problem is that the database character set does not support the full set of Turkish characters. If a character is not in the ISO-8859-1 codepage layout, it cannot be stored properly in database columns governed by the database character set. If you want to store Turkish data in an ISO-8859-1 database, you could potentially use the workaround characters instead (i.e. substituting S for Ş). If you want to support the full range of Turkish characters, however, you would need to move to a character set that supported all those characters-- either ISO-8859-9 or UTF-8 would be relatively common.
Changing the character set of your existing database is a non-trivial undertaking, however. There is a chapter in the Globalization Support Guide for whatever version of Oracle you are using that covers character set migration. If you want to move to a Unicode character set (which is generally the preferred approach rather than sticking with one of the single-byte ISO character sets), you can potentially leverage the Oracle Database Migration Assistant for Unicode.
At this point, you'll commonly see the objection that at least some applications are seeing the data "correctly" so the database must support the Turkish characters. The problem is that if you set up your NLS_LANG incorrectly, it is possible to bypass character set conversion entirely meaning that whatever binary representation a character has on the client gets persisted without modification to the database. As long as every process that reads the data configures their NLS_LANG identically and incorrectly, things may appear to work. However, you will very quickly find that some other application won't be able to configure their NLS_LANG identically incorrectly. A Java application, for example, will always want to convert the data from the database into a Unicode string internally. So if you're storing the data incorrectly in the database, as it sounds like you are, there is no way to get those applications to read it correctly. If you are simply using SQL*Plus in a shell script to generate the file, it is almost certainly possible to get your client configured incorrectly so that the data file appears to be correct. But it would be a very bad idea to let the existing misconfiguration persist. You open yourself up to much bigger problems in the future (if you're not already there) where different clients insert data in different character sets into the database making it much more difficult to disentangle, when you find that tools like the Oracle export utility have corrupted the data that is exported or when you want to use a tool that can't be configured incorrectly to view the data. You're much better served getting the problem corrected early.

Just setting your NLS_LANG parameter to AMERICAN_AMERICA.WE8ISO8859P9 is enough for Turkish language.

Related

Postgresql searching with non-ascii character using ILIKE

I have encountered interesting problem with character encodings in postgresql. The problem occurs with searching the table using ILIKE operator and using non-ascii characters ('ä','ö',..) in the search term. Seems that the search results differ depending on the encoding of the search term (as you can encode these non-ascii characters with many different way).
I know for a fact that I have in some tables (within 'character varying' columns) non-ascii characters with UTF-8 encoding and also with some other encoding (maybe latin-1, not sure). For example 'ä' character is present sometimes as correct UTF-8 representation 'C3 A4' but sometimes like this: '61 cc 88'. This is probably caused by the fact I imported some legacy data to the database that was not in UTF-8. This is not a problem most of time as they are presented in the Web application UI correctly anyway. Only the search is the issue. I.e. I cannot figure out a way to make the search find all the relevant entries in database as the results vary depending on the search term encoding (=basically problem is that search does not get the legacy data correctly as it has it's non-ascii characters mixed up).
Here are the facts on the application:
Postgresql 11.5 (via Amazon RDS)
Npgsql v4.1.2
.net core 3
react frontend
database encoding UTF8, Collation en_US.UTF-8, Character type en_US.UTF-8
Any ideas what to do?
Write a batch job to convert all data to UTF-8 and update database & guard further data inserts somehow for using nothing but UTF-8?
Make postgresql ILIKE ignore the character encoding?
Edit: Found good page on unicode normalization in Postgresql: https://www.2ndquadrant.com/en/blog/unicode-normalization-in-postgresql-13/

HANA: Unknown Characters in Database column of datatype BLOB

I need help on how to resolve characters of unknown type from a database field into a readable format, because I need to overwrite this value on database level with another valid value (in the exact format the application stores it in) to automate system copy acitvities.
I have a proprietary application that also allows users to configure it in via the frontend. This configuration data gets stored in a table and the values of a configuration property are stored in a column of type "BLOB". For the here desired value, I provide a valid URL in the application frontend (like http://myserver:8080). However, what gets stored in the database is not readable (some square characters). I tried all sorts of conversion functions of HANA (HEX, binary), simple, and in a cascaded way (e.g. first to binary, then to varchar) to make it readable. Also, I tried it the other way around and make the value that I want to insert appear in the correct format (conversion to BLOL over hex or binary) but this does not work either. I copied the value to clipboard and compared it to all sorts of character set tables (although I am not sure if this can work at all).
My conversion tries look somewhat like this:
SELECT TO_ALPHANUM('') FROM DUMMY;
while the brackets would contain the characters in question. I cant even print them here.
How can one approach this and maybe find out the character set that is used by this application? I would be grateful for some more ideas.
What you have in your BLOB column is a series of bytes. As you mentioned, these bytes have been written by an application that uses an unknown character set.
In order to interpret those bytes correctly, you need to know the character set as this is literally the mapping of bytes to characters or character identifiers (e.g. code points in UTF).
Now, HANA doesn't come with a whole lot of options to work on LOB data in the first place and for C(haracter)LOB data most manipulations implicitly perform a conversion to a string data type.
So, what I would recommend is to write a custom application that is able to read out the BLOB bytes and perform the conversion in that custom app. Once successfully converted into a string you can store the data in a new NVCLOB field that keeps it in UTF-8 encoding.
You will have to know the character set in the first place, though. No way around that.
I assume you are on Oracle. You can convert BLOB to CLOB as described here.
http://www.dba-oracle.com/t_convert_blob_to_clob_script.htm
In case of your example try this query:
select UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(<your_blob_value)) from dual;
Obviously this only works for values below 32767 characters.

"¿" (inverted question mark) character in oracle

I found "¿" character (inverted question mark) in the database tables in place of single quote (') character.
Can any one let me know how i can avoid this character from the table.
There are many rows which contains the text with this character but not all single quotes are turning to this ¿ symbol.
I am not even able to filter the rows to update this character (¿) with single quote again.
When i user Like "%¿%" it filters me the text containing ordinary question mark (?)
In general there are two possibilities:
Your database tables really have ¿ characters caused by wrong NLS_LANG settings when data was inserted (or the database character set does not support the special character). In such case the LIKE '%¿%' condition should work. However, this also means you have corrupt data in your database and it is almost impossible to correct them because¿ stands for any wrong character.
Your client (e.g. SQL*Plus) is not able to display the special character caused by wrong NLS_LANG settings or the font does not support the special character.
Which client do you use (SQL*Plus, TOAD, SQL Developer, etc.)?
What is your NLS_LANG Environment variable, resp. your Registry key HKLM\SOFTWARE\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG or HKLM\SOFTWARE\Wow6432Node\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG?
What do you get when you select DUMP(... , 1016) from your table?
Was it really a simple quote (ASCII 39)? If they actually were some sort of "smart quotes", those do have mappings in windows-1252 but there is no ISO-8859 mapping for them, so if your database charset is ISO-8859-1 and you try to insert some windows-1252 text, Oracle tries to translete from windows-1252 to ISO-8859-1 and uses chr(191) to signal an unmappable character. chr(191) happens to be the opening question mark.
You can check that by executing this (copy the select to preserve the smart quotes):
select dump('‘’“”') from dual
This behaviour is basically "correct", as what you are asking Oracle to do cannot be done, though it is not very intuitive.
See windows-1252, ISO8859-1 and ISO-8859-15 charsets for comparison. Note that windows uses the range 127-159 that is not used in the ISO charsets

Is there a database that accepts special characters by default (without converting them)?

I am currently starting from scratch choosing a database to store data collected from a suite of web forms. Humans will be filling out these forms, and as they're susceptible to using international characters, especially those humans named José and François and أسامة and 布鲁斯, I wanted to start with a modern database platform that accepts all types (so to speak), without conversion.
Q: Does a databases exist, from the start, that accepts a wide diversity of the characters found in modern typefaces? If so, what are the drawbacks to a database that doesn't need to convert as much data in order to store that data?
// Anticipating two answers that I'm not looking for:
I found many answers to how someone could CONVERT (or encode) a special character, like é or a copyright symbol © into database-legal character set like © (for ©) so that a database can then accept it. This requires a conversion/translation layer to shuttle data into and out of the database. I know that has to happen on a level like the letter z is reducible to 1's and 0's, but I'm really talking about finding a human-readable database, one that doesn't need to translate.
I also see suggestions that people change the character encoding of their current database to one that accepts a wider range of characters. This is a good solution for someone who is carrying over a legacy system and wants to make it relevant to the wider range of characters that early computers, and the early web, didn't anticipate. I'm not starting with a legacy system. I'm looking for some modern database options.
Yes, there are databases that support large character sets. How to accomplish this is different from one database to another. For example:
In MS SQL Server you can use the nchar, nvarchar and ntext data types to store Unicode (UCS-2) text.
In MySQL you can choose UTF-8 as encoding for a table, so that it will be able to store Unicode text.
For any database that you consider using, you should look for Unicode support to see if can handle large character sets.

Weird character (�) in SQL Server View definition

I have generated the Create statement for a SQL Server view.
Pretty standard, although there is a some replacing happening on a varchar column, such as:
select Replace(txt, '�', '-')
What the heck is '�'?
When I run that against a row that contains that character, I am seeing the literal '?' being replaced.
Any ideas? Do I need some special encoding in my editor?
Edit
If it helps the end point is a Google feed.
You need to read the script in the same encoding as that in which it was written. Even then, if your editor's font doesn't include a glyph for the character, it may still not display correctly.
When the script was created, did you choose an encoding, or accept the default? If the later, you need to find out which encoding was used. UTF-8 is likely.
However, in this case, the character may not be a mis-representation. Unicode replacement character explains that this character is used as a replacement for some other character that cannot be represented. It's possible in your case that the code you are looking at is simply saying, if we have some data that could not be represented, treat it as a hyphen instead. In other words, this may be nothing to do with the script generation/viewing process, but rather a deliberate piece of code.