Postgresql searching with non-ascii character using ILIKE - sql

I have encountered interesting problem with character encodings in postgresql. The problem occurs with searching the table using ILIKE operator and using non-ascii characters ('ä','ö',..) in the search term. Seems that the search results differ depending on the encoding of the search term (as you can encode these non-ascii characters with many different way).
I know for a fact that I have in some tables (within 'character varying' columns) non-ascii characters with UTF-8 encoding and also with some other encoding (maybe latin-1, not sure). For example 'ä' character is present sometimes as correct UTF-8 representation 'C3 A4' but sometimes like this: '61 cc 88'. This is probably caused by the fact I imported some legacy data to the database that was not in UTF-8. This is not a problem most of time as they are presented in the Web application UI correctly anyway. Only the search is the issue. I.e. I cannot figure out a way to make the search find all the relevant entries in database as the results vary depending on the search term encoding (=basically problem is that search does not get the legacy data correctly as it has it's non-ascii characters mixed up).
Here are the facts on the application:
Postgresql 11.5 (via Amazon RDS)
Npgsql v4.1.2
.net core 3
react frontend
database encoding UTF8, Collation en_US.UTF-8, Character type en_US.UTF-8
Any ideas what to do?
Write a batch job to convert all data to UTF-8 and update database & guard further data inserts somehow for using nothing but UTF-8?
Make postgresql ILIKE ignore the character encoding?
Edit: Found good page on unicode normalization in Postgresql: https://www.2ndquadrant.com/en/blog/unicode-normalization-in-postgresql-13/

Related

Text format in Data bricks (AZURE) to CSV

I have special characters and florigen language in data set.
when I run a SQL ( Select * from table1 ), results are fine. I get in required format
eg output of SQL :
1 | 请回复邮件
2 | Don’t know
when the same is exported to CSV to my local machine those text changes to wired symbols
Exporter data :
1 | 请回å¤é‚®
2 | Don’t know
How do I get the same format to CSV as in SQL?
Ensure that UTF-8 encoding is used instead of ISO-8859-1/Windows-1252 encoding in the browser and editor.
It's better to utilise CHARACTER SET utf8mb4 and COLLATION utf8mb4 unicode 520 ci in the future. (A revised version of the Unicode collation is in the works.)
utf8mb4 is a superset of utf8, as it can support 4-byte utf8 codes, which are required by Emoji and some Chinese characters.
"UTF-8" outside of MySQL refers to all size encodings, thus it's basically the same as MySQL's utf8mb4, not utf8.
The data can't be trusted whether viewed with a tool or with SELECT. Too many of these clients, particularly browsers, attempt to compensate for erroneous encodings by displaying proper text even if the database is messed up. So, choose a table and column with some non-English content and work on it.
WHERE... SELECT col, HEX(col) FROM tbl
For correctly saved UTF-8, the HEX will be
20 (in any language) for a blank area
4x, 5x, 6x, or 7x for English
Accented letters in much of Western Europe should be Cxyy.
Dxyy in Cyrillic, Hebrew, and Farsi/Arabic
The majority of Asia uses Exyyzz
Emoji, although some Chinese use F0yyzzww.
There are few fixes here: Fixes for various Cases
After downloading the data form Data bricks, Open the csv in notepad and save as under save select the Encoding option UTF-8

SSIS error "UTF8" has no equivalant in encoding "WIN1252"

I'm using SSIS package to extract the data from a Postgres database, but I'm getting following error one of the tables.
Character with byte sequence 0xef 0xbf 0xbd in encoding "UTF8" has no
equivalant in encoding "WIN1252
I have no idea how to resolve it. I made all the columns in the sql table to NVARCHAR(MAX) but still no use. Please provide the solution.
The full Unicode character set (as encoded in UTF8) contains tens of thousands of different characters. WIN1252 contains 256. Your data contains characters that cannot be represented in WIN1252.
You either need to export to a more useful character encoding, remove the "awkward" characters from the source database or do some (lossy) translation with SSIS itself (I believe "character map translation" is what you want to search for).
I would recommended first though spending am hour or so googling around the subject of Unicode, it's utf encodings and its relationship to the ISO and WIN character sets. That way you will understand which of the above to choose.

"¿" (inverted question mark) character in oracle

I found "¿" character (inverted question mark) in the database tables in place of single quote (') character.
Can any one let me know how i can avoid this character from the table.
There are many rows which contains the text with this character but not all single quotes are turning to this ¿ symbol.
I am not even able to filter the rows to update this character (¿) with single quote again.
When i user Like "%¿%" it filters me the text containing ordinary question mark (?)
In general there are two possibilities:
Your database tables really have ¿ characters caused by wrong NLS_LANG settings when data was inserted (or the database character set does not support the special character). In such case the LIKE '%¿%' condition should work. However, this also means you have corrupt data in your database and it is almost impossible to correct them because¿ stands for any wrong character.
Your client (e.g. SQL*Plus) is not able to display the special character caused by wrong NLS_LANG settings or the font does not support the special character.
Which client do you use (SQL*Plus, TOAD, SQL Developer, etc.)?
What is your NLS_LANG Environment variable, resp. your Registry key HKLM\SOFTWARE\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG or HKLM\SOFTWARE\Wow6432Node\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG?
What do you get when you select DUMP(... , 1016) from your table?
Was it really a simple quote (ASCII 39)? If they actually were some sort of "smart quotes", those do have mappings in windows-1252 but there is no ISO-8859 mapping for them, so if your database charset is ISO-8859-1 and you try to insert some windows-1252 text, Oracle tries to translete from windows-1252 to ISO-8859-1 and uses chr(191) to signal an unmappable character. chr(191) happens to be the opening question mark.
You can check that by executing this (copy the select to preserve the smart quotes):
select dump('‘’“”') from dual
This behaviour is basically "correct", as what you are asking Oracle to do cannot be done, though it is not very intuitive.
See windows-1252, ISO8859-1 and ISO-8859-15 charsets for comparison. Note that windows uses the range 127-159 that is not used in the ISO charsets

Western European Characterset to Turkish in sql

I am having a serious issue with character encoding. To give some background:
I have turkish business users who enter some data on Unix screens in Turkish language.
My database NLS parameter is set to AMERICAN, WE8ISO8859P1 and Unix NLS_LANG to AMERICAN_AMERICA.WE8ISO8859P1.
Turkey business is able to see all the turkish characters on UNIX screens and TOAD while I'm not. I can only see them in Western European Character set.
At business end: ÖZER İNŞAAT TAAHHÜT VE
At our end : ÖZER ÝNÞAAT TAAHHÜT VE
If you notice the turkish characters İ and Ş are getting converted to ISO 8859-1 character set. However, all the settings(NLS paramaters in db and unix) are same at both end- ISO8859-1(Western European)
With some study, I can understand - Turkish machines can display turkish data by doing conversion in real-time(DB NLS settings are overridden by local NLS settings).
Now, I have a interface running in my db- have some PL/SQL scripts(run through shell script) that extracts some data from database and spool them to a .csv file on a unix path. Then that .csv file is transferred to an external system via MFT(Managed File transfer).
The problem is- Exract never conains any turkish character. Every turkish character is getting converted into Western European Characterset and goes like this to the external system which is treated as a case of data conversion/loss and my business is really unhappy.
Could anyone tell me - How could I retain all the turkish characters?
P.S. : External System's characterset could be set to ISP8859-9 charcterset.
Many thanks in advance.
If you are saying that your database character set is ISO-8859-1, i.e.
SELECT parameter, value
FROM v$nls_parameters
WHERE parameter = 'NLS_CHARACTERSET'
returns a value of WE8ISO8859P1 and you are storing the data in CHAR, VARCHAR, or VARCHAR2 columns, the problem is that the database character set does not support the full set of Turkish characters. If a character is not in the ISO-8859-1 codepage layout, it cannot be stored properly in database columns governed by the database character set. If you want to store Turkish data in an ISO-8859-1 database, you could potentially use the workaround characters instead (i.e. substituting S for Ş). If you want to support the full range of Turkish characters, however, you would need to move to a character set that supported all those characters-- either ISO-8859-9 or UTF-8 would be relatively common.
Changing the character set of your existing database is a non-trivial undertaking, however. There is a chapter in the Globalization Support Guide for whatever version of Oracle you are using that covers character set migration. If you want to move to a Unicode character set (which is generally the preferred approach rather than sticking with one of the single-byte ISO character sets), you can potentially leverage the Oracle Database Migration Assistant for Unicode.
At this point, you'll commonly see the objection that at least some applications are seeing the data "correctly" so the database must support the Turkish characters. The problem is that if you set up your NLS_LANG incorrectly, it is possible to bypass character set conversion entirely meaning that whatever binary representation a character has on the client gets persisted without modification to the database. As long as every process that reads the data configures their NLS_LANG identically and incorrectly, things may appear to work. However, you will very quickly find that some other application won't be able to configure their NLS_LANG identically incorrectly. A Java application, for example, will always want to convert the data from the database into a Unicode string internally. So if you're storing the data incorrectly in the database, as it sounds like you are, there is no way to get those applications to read it correctly. If you are simply using SQL*Plus in a shell script to generate the file, it is almost certainly possible to get your client configured incorrectly so that the data file appears to be correct. But it would be a very bad idea to let the existing misconfiguration persist. You open yourself up to much bigger problems in the future (if you're not already there) where different clients insert data in different character sets into the database making it much more difficult to disentangle, when you find that tools like the Oracle export utility have corrupted the data that is exported or when you want to use a tool that can't be configured incorrectly to view the data. You're much better served getting the problem corrected early.
Just setting your NLS_LANG parameter to AMERICAN_AMERICA.WE8ISO8859P9 is enough for Turkish language.

Converting a Postgresql database from SQL_ASCII, containing mixed encoging types, to UTF-8

I have a postgresql database I would like to convert to UTF-8.
The problem is that it is currently SQL_ASCII, so hasn't been doing any kind of encoding conversion on its input, and as such has ended up with data of a mix of encoding types in the tables. One row might contain values encoded as UTF-8, another might be ISO-8859-x, or Windows-125x, etc.
This has made performing a dump of the database, and converting it to UTF-8 with the intention of importing it into a fresh UTF-8 database, difficult. If the data were all of one encoding type, I could just run the dump file through iconv, but I don't think that approach works here.
Is the problem fundamentally down to knowing how each data is encoded? Here, where that is not known, can it be worked out, or even guessed? Ideally I'd love a script which would take a file, any file, and spit out valid UTF-8.
This is exactly the problem that Encoding::FixLatin was written to solve*.
If you install the Perl module then you'll also get the fix_latin command-line utility which you can use like this:
pg_restore -O dump_file | fix_latin | psql -d database
Read of the 'Limitations' section of the documentation to understand how it works.
[*] Note I'm assuming that when you say ISO-8859-x you mean ISO-8859-1 and when you say CP125x you mean CP1252 - because the mix of ASCII, UTF-8, Latin-1 and WinLatin-1 is a common case. But if you really do have a mixture of eastern and western encodings then sorry but you're screwed :-(
It is impossible without some knowledge of the data first. Do you know if it is a text message or people's names or places? In some particular language?
You can try to encode a line of a dump and apply some heuristic — for example try an automatic spell checker and choose an encoding that generates the lowest number of errors or the most known words etc.
You can use for example aspell list -l en (en for English, pl for Polish, fr for French etc.) to get a list of misspelled words. Then you can choose encoding which generates the least of them. You'd need to install corresponding dictionary package, for example "aspell-en" in my Fedora 13 Linux system.
I've seen exactly this problem myself, actually. The short answer: there's no straightforward algorithm. But there is some hope.
First, in my experience, the data tends to be:
99% ASCII
.9% UTF-8
.1% other, 75% of which is Windows-1252.
So let's use that. You'll want to analyze your own dataset, to see if it follows this pattern. (I am in America, so this is typical. I imagine a DB containing data based in Europe might not be so lucky, and something further east even less so.)
First, most every encoding out there today contains ASCII as a subset. UTF-8 does, ISO-8859-1 does, etc. Thus, if a field contains only octets within the range [0, 0x7F] (ie, ASCII characters), then it's probably encoded in ASCII/UTF-8/ISO-8859-1/etc. If you're dealing with American English, this will probably take care of 99% of your data.
On to what's left.
UTF-8 has some nice properties, in that it will either be 1 byte ASCII characters, OR everything after the first byte will be 10xxxxxx in binary. So: attempt to run your remaining fields through a UTF-8 decoder (one that will choke if you give it garbage.) On the fields it doesn't choke on, my experience has been that they're probably valid UTF-8. (It is possible to get a false positive here: we could have a tricky ISO-8859-1 field that is also valid UTF-8.)
Last, if it's not ASCII, and it doesn't decode as UTF-8, Windows-1252 seems to be the next good choice to try. Almost everything is valid Windows-1252 though, so it's hard to get failures here.
You might do this:
Attempt to decode as ASCII. If successful, assume ASCII.
Attempt to decode as UTF-8.
Attempt to decode as Windows-1252
For the UTF-8 and Windows-1252, output the table's PK and the "guess" decoded text to a text file (convert the Windows-1252 to UTF-8 before outputting). Have a human look over it, see if they see anything out of place. If there's not too much non-ASCII data (and like I said, ASCII tends to dominate, if you're in America...), then a human could look over the whole thing.
Also, if you have some idea about what your data looks like, you could restrict decodings to certain characters. For example, if a field decodes as valid UTF-8 text, but contains a "©", and the field is a person's name, then it was probably a false positive, and should be looked at more closely.
Lastly, be aware that when you change to a UTF-8 database, whatever has been inserting this garbage data in the past is probably still there: you'll need to track down this system and teach it character encoding.
I resolved using this commands;
1-) Export
pg_dump --username=postgres --encoding=ISO88591 database -f database.sql
and after
2-) Import
psql -U postgres -d database < database.sql
these commands helped me solve the problem of conversion SQL_ASCII - UTF-8