UTF-8 , Classic ASP and SQL Server - sql

I'm having a weird problem that is getting me really really confused here.
I'm working on internationalization of a web app, and implementing UTF-8 all over.
The app has a lot of legacy code in Classic ASP, which is working fine so far.
What is getting me confused here is the following.
From the admin side of the APP I'm saving this string to test special characters:
Á, É, Í, Ó, Ú, Ü, Ñ ± ' Z Ž
If I run the SQL Server Profiler, I do not see the Ž character being inserted
If I do a Response.Write of the query that is running the UPDATE, the character is there
If I try to edit what was saved from the web front end, the character is there.
If I check the HTML Source code of the page I'm editing the character is correctly encoded as HTML using Server.HTMLEncode
If I run a select query from SQL Server Management Studio I do not see the character
I have the html meta tag to set UTF-8
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
The files are saved with UTF-8 Encoding
How can it be that the character is not visible in SQL Server Profiler or queries but if I do it from the front end the character is there?
I'm using the N prefix to save unicode in SQL Server and the column is of type nvarchar(128)
Then from another part of the system with the same setup if I try to do the same thing, the character is visible when doing the insert.
Any ideas?

Related

How to not lose unicode characters in Stored Procedure parameter

Problem Statement:
We are having a legacy application with backend as SQL Server. Till now, we did not face any issues in passing non-unicode values. Now, we are getting unicode characters from user interface.
The unicode character is getting passed as given below, in UI. These data are being inserted into table.
Currently, we pass unicode characters like below and we are losing non-english characters.
EXEC dbo.ProcedureName #IN_DELIM_VALS = '삼성~AX~Aland Islands~ALLTest1~Aland Islands~~~~'
What we tried:
If we pass unicode with N prefix, the non-english characters are being inserted into table properly.
EXEC dbo.ProcedureName #IN_DELIM_VALS = N'삼성~AX~Aland Islands~ALLTest1~Aland Islands~~~~'
But, adding N prefix, requires UI code change. As it is legacy application, we want to avoid UI change. We want to handle in the sql server side.
when I read about passing parameter without N prefix, the data is implicitly converted to default code page and korean characters are getting lost. Reference
Prefix a Unicode character string constants with the letter N to
signal UCS-2 or UTF-16 input, depending on whether an SC collation is
used or not. Without the N prefix, the string is converted to the
default code page of the database that may not recognize certain
characters. Starting with SQL Server 2019 (15.x), when a UTF-8 enabled
collation is used, the default code page is capable of storing UNICODE
UTF-8 character set.
Our Ask:
Is there a way to add N prefix to the stored procedure parameter, before being assigned to stored procedure parameter and so, we are not losing unicode characters.
As it is not possible to add N prefix, after parameter is being passed to SQL Server, we are going with below application code change. Ideally, the application should pass the parameter with right nvarchar datatype, so that it is having N prefix.
SqlCommand cmd = new SqlCommand("EXEC dbo.ProcedureName #IN_DELIM_VALS", myConnection);
cmd.Parameters.Add(new SqlParameter("#IN_DELIM_VALS", SqlDbType.NVarChar,400)).Value
= "삼성~AX~Aland Islands~ALLTest1~Aland Islands~~~~";
cmd.ExecuteNonQuery();

Garbage data while inserting data with special characters in SQL Server 2012 using Perl

I have an XML file with data in multiple languages (eg. - Russian, Japanese, Chinese, English). This XML is created on Linux platform and it has passed xmllint checks.
Now, I am reading this data from XML file and inserting into SQL Server 2012 on Windows 7 platform (XML also present on Windows). But I am getting ???? as a value in fields. This is happening for some of the cases like all the sentence in other language.
But, if any sentence having some special characters it's working fine.
I am using function
$row_value = decode("utf-8",$row_value);
use Encode;
require Encode::Detect;
my $utf8 = decode("Detect", $data);
Try this for decode data...

Openfire: Offline UTF-8 encoded messages are saved wrong

We use Openfire 3.9.3. Its MySql database uses utf8_persian_ci collation and in openfire.xml we have:
...<defaultProvider>
<driver>com.mysql.jdbc.Driver</driver>
<serverURL>jdbc:mysql://localhost:3306/openfire?useUnicode=true&amp;characterEncoding=UTF-8</serverURL>
<mysql>
<useUnicode>true</useUnicode>
</mysql> ....
The problem is that offline messages which contain Persian characters (UTF-8 encoded) are saved as strings of question marks. For example سلام (means hello in Persian) is stored and showed like ????.
MySQL does not have proper Unicode support, which makes supporting data in non-Western languages difficult. However, the MySQL JDBC driver has a workaround which can be enabled by adding
?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8
to the URL of the JDBC driver. You can edit the conf/openfire.xml file to add this value.
Note: If the mechanism you use to configure a JDBC URL is XML-based, you will need to use the XML character literal & to separate configuration parameters, as the ampersand is a reserved character for XML.
Also be sure that your DB and tables have utf8 encoding.

Where to trim/encode/fix characters from end to end ASP to SQL?

I have an ASP Classic app that allows people to copy and paste Word documents into a regular form field. I then post that document via jQuery Ajax to SQL Server, where the information is saved.
My problem is that the curly quotes and other word characters turn into strange characters when they come back out.
I'm trying to filter them on my save routines (classic asp stored procedure), but I still can't quite eliminate the problems.
The ASP pages have this header with the ISO-8859-1 charset. Characters look fine when pasted into the text input fields.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
My jQuery code builds the following JSON in the ASP Page:
var jsonToSend = { serial: serial, critiqueText: escape(critiqueText) };
The database collation is set to SQL_Latin1_General_CP1_CI_AS
I use TEXT and VARCHAR fields to hold the text (yes, I know the Text field type is not preferred, but it's what I have right now).
What must I do at each point to ensure that (1) the Word characters are stripped out, and (2) the encoding is consistent from front to back so I don't get any odd characters displaying?
Oh- ASP Classic 3 running in 32-bit mode on Windows Server 2003 against SQL Server 2005.
Quick and dirty solution would be using nvarchar and ntext in your backend database. Strange chars you mention is problem of encoding. For example see below example.
İiıIÜĞ in turkish language win-1254
Ä°iıIÃœÄ in normal ANSI
C4B069C4B149C39CC49E both of them have same hex value.
You use ISO-8859-1 encoding in web page. This means that you are only able to save only ASCII characters that is only first 256 bit of full unicode. See this answer.
You use Latin1 in database. Approximately this three characters sets are equal. Latin1-General = Win 1252 = IEC_8859-1.
ISO/IEC_8859-1 is the basis for most popular 8-bit character sets, including Windows-1252 and the first block of characters in Unicode.
SQL_Latin1_General_CP1_CI_AS:- Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive,
width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data
This means that whatever character you entered to database first 256 bits values are safe. If you know your client's default encodings.
I suggest to try this default encoding to see if you can recover some information.
I gave example in Turkey, I know that most client's use Win1254 therefore I will try to change values to that encoding and see I can recover anything.
Second part of your answer is that you can safely change from varchar to nvarchar without loss of information.
Here this without loss of information would be first part hex value (first 256 value).
Your strange chars would remain but other characters stays.
This answer and linked article gives more information.
You should not use the javascript function escape, it uses non-standard encoding that is a mix of standard URL encoding using ISO-8859-1 and a weird %uxxxx scheme for anything not in ISO-8859-1.
Additionally, you should not manually escape anything at all, since jQuery will use proper escaping on your jsonToSend-object anyway.
So when you do this:
var jsonToSend= { serial: serial, critiqueText: escape(critiqueText) } ;
$.post( "example.asp", jsonToSend );
And critiqueText is, say, “hello world”. First the escape will turn it into:
%u201Chello%20world%u201D
Then jQuery will apply standard URL encoding on that before sending and it will become:
%25u201Chello%2520world%25u201D
So simply change your jsonToSend to:
var jsonToSend= { serial: serial, critiqueText: critiqueText) } ;
Which results in
%E2%80%9Chello%20world%E2%80%9D
I.E. standard URL encoding, you can even point your browser to http://en.wikipedia.org/wiki/%E2%80%9Chello%20world%E2%80%9D
Note, it's likely that Classic ASP won't recognize standard URL encoding, so here's a function to apply Win1252 URL encoding:
var map = {
0x20AC: 128,
0x201A: 130,
0x0192: 131,
0x201E: 132,
0x2026: 133,
0x2020: 134,
0x2021: 135,
0x02C6: 136,
0x2030: 137,
0x0160: 138,
0x2039: 139,
0x0152: 140,
0x017D: 142,
0x2018: 145,
0x2019: 146,
0x201C: 147,
0x201D: 148,
0x2022: 149,
0x2013: 150,
0x2014: 151,
0x02DC: 152,
0x2122: 153,
0x0161: 154,
0x203A: 155,
0x0153: 156,
0x017E: 158,
0x0178: 159
};
function urlEncodeWin1252( str ) {
return escape( str.replace( /[\d\D]/g, function(m){
var cc = m.charCodeAt(0);
if( cc in map ) {
return String.fromCharCode(map[cc]);
}
return m;
}));
}
You still can't have jQuery double encoding the result from this, so pass it a plain string:
var jsonToSend= "serial=" + serial + "&critiqueText=" urlEncodeWin1252(critiqueText);
Which will result in:
serial=123&critiqueText=%93hello%20world%94
You might want to rename that variable, there is no JSON anywhere.
I deal with importing of crazy characters into SQL all day long and nvarchar is the way to go. Unless they're numbers or something of that sort I set the columns to nvarchar(max) so I won't have to deal with it. The only exception you have to keep in mind is if you're going to use Foreign Keys then you'll have to set it to nvarchar(450). This handles all kinds of crazy characters, spacing, and gaps in text as the result of tabs.

Replace character in SQL results

This is from a Oracle SQL query. It has these weird skinny rectangle shapes in the database in places where apostrophes should be. (I wish we would could paste screen shots in here)
It looks like this when I copy and paste the results.
spouse�s
is there a way to write a SQL SELECT statement that searches for this character in the field and replaces it with an apostrophe in the results?
Edit: I need to change only the results in a SELECT statement for reporting purposes, I can't change the Database.
I ran this
select dump('�') from dual;
which returned
Typ=96 Len=3: 239,191,189
This seems to work so far
select translate('What is your spouse�s first name?', '�', '''') from dual;
but this doesn't work
select translate(Fieldname, '�', '''') from TableName
Select FN from TN
What is your spouse�s first name?
SELECT DUMP(FN, 1016) from TN
Typ=1 Len=33 CharacterSet=US7ASCII: 57,68,61,74,20,69,73,20,79,6f,75,72,20,73,70,6f,75,73,65,92,73,20,66,69,72,73,74,20,6e,61,6d,65,3f
EDIT:
So I have established that is the backquote character. I can't get the DB updated so I'm trying this code
SELECT REGEX_REPLACE(FN,"\0092","\0027") FROM TN
and I"m getting ORA-00904:"Regex_Replace":invalid identifier
This seems a problem with your charset configuracion. Check your NLS_LANG and others NLS_xxx enviroment/regedit values. You have to check the oracle server, your client and the client of the inserter of that data.
Try to DUMP the value. you can do it with a select as simple as:
SELECT DUMP(the_column)
FROM xxx
WHERE xxx
UPDATE: I think that before try to replace, look for the root of the problem. If this happens because a charset trouble you can get big problems with bad data.
UPDATE 2: Answering the comments. The problem may be is not on the database server side, may be is in the client side. The problem (if this is the problem) can be a translation on server to/from client comunication. It's for a server-client bad configuracion-coordination. For instance if the server has defined UTF8 charset and your client uses US7ASCII, then all acutes will appear as ?.
Another approach can be that if the server has defined UTF8 charset and your client also UTF8 but the application is not able to show UTF8 chars, then the problem is in the application side.
UPDATE 3: On your examples:
select translate('What. It works because the � is exactly the same char: You have pasted on both sides.
select translate(Fieldname. It does not work because the � is not stored on database, it's the char that the client receives may be because some translation occurs from the data table until it's showed to you.
Next step: Look in DUMP syntax and try to extract the codes for the mysterious char (from the table not pasting �!).
I would say there's a good chance the character is a single-tick "smart quote" (I hate the name). The smart quotes are characters 91-94 (using a Windows encoding), or Unicode U+2018, U+2019, U+201C, and U+201D.
I'm going to propose a front-end application-based, client-side approach to the problem:
I suspect that this problem has more to do with a mismatch between the font you are trying to display the word spouse�s with, and the character �. That icon appears when you are trying to display a character in a Unicode font that doesn't have the glyph for the character's code.
The Oracle database will dutifully return whatever characters were INSERTed into its' column. It's more up to you, and your application, to interpret what it will look like given the font you are trying to display your data with in your application, so I suggest investigating as to what this mysterious � character is that is replacing your apostrophes. Start by using FerranB's recommended DUMP().
Try running the following query to get the character code:
SELECT DUMP(<column with weird character>, 1016)
FROM <your table>
WHERE <column with weird character> like '%spouse%';
If that doesn't grab your actual text from the database, you'll need to modify the WHERE clause to actually grab the offending column.
Once you've found the code for the character, you could just replace the character by using the regex_replace() built-in function by determining the raw hex code of the character and then supplying the ASCII / C0 Controls and Basic Latin character 0x0027 ('), using code similar to this:
UPDATE <table>
set <column with offending character>
= REGEX_REPLACE(<column with offending character>,
"<character code of �>",
"'")
WHERE regex_like(<column with offending character>,"<character code of �>");
If you aren't familiar with Unicode and different ways of character encoding, I recommend reading Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I wasn't until I read that article.
EDIT: If your'e seeing 0x92, there's likely a charset mismatch here:
0x92 in CP-1252 (default Windows code page) is a backquote character, which looks kinda like an apostrophe. This code isn't a valid ASCII character, and it isn't valid in IS0-8859-1 either. So probably either the database is in CP-1252 encoding (don't find that likely), or a database connection which spoke CP-1252 inserted it, or somehow the apostrophe got converted to 0x92. The database is returning values that are valid in CP-1252 (or some other charset where 0x92 is valid), but your db client connection isn't expecting CP-1252. Hence, the wierd question mark.
And FerranB is likely right. I would talk with your DBA or some other admin about this to get the issue straightened out. If you can't, I would try either doing the update above (seems like you can't), or doing this:
INSERT (<normal table columns>,...,<column with offending character>) INTO <table>
SELECT <all normal columns>, REGEX_REPLACE(<column with offending character>,
"\0092",
"\0027") -- for ASCII/ISO-8859-1 apostrophe
FROM <table>
WHERE regex_like(<column with offending character>,"\0092");
DELETE FROM <table> WHERE regex_like(<column with offending character>,"\0092");
Before you do this you need to understand what actually happened. It looks to me that someone inserted non-ascii strings in the database. For example Unicode or UTF-8. Before you fix this, be very sure that this is actually a bug. The apostrophe comes in many forms, not just the "'".
TRANSLATE() is a useful function for replacing or eliminating known single character codes.