Where to trim/encode/fix characters from end to end ASP to SQL? - sql-server-2005

I have an ASP Classic app that allows people to copy and paste Word documents into a regular form field. I then post that document via jQuery Ajax to SQL Server, where the information is saved.
My problem is that the curly quotes and other word characters turn into strange characters when they come back out.
I'm trying to filter them on my save routines (classic asp stored procedure), but I still can't quite eliminate the problems.
The ASP pages have this header with the ISO-8859-1 charset. Characters look fine when pasted into the text input fields.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
My jQuery code builds the following JSON in the ASP Page:
var jsonToSend = { serial: serial, critiqueText: escape(critiqueText) };
The database collation is set to SQL_Latin1_General_CP1_CI_AS
I use TEXT and VARCHAR fields to hold the text (yes, I know the Text field type is not preferred, but it's what I have right now).
What must I do at each point to ensure that (1) the Word characters are stripped out, and (2) the encoding is consistent from front to back so I don't get any odd characters displaying?
Oh- ASP Classic 3 running in 32-bit mode on Windows Server 2003 against SQL Server 2005.

Quick and dirty solution would be using nvarchar and ntext in your backend database. Strange chars you mention is problem of encoding. For example see below example.
İiıIÜĞ in turkish language win-1254
Ä°iıIÃœÄ in normal ANSI
C4B069C4B149C39CC49E both of them have same hex value.
You use ISO-8859-1 encoding in web page. This means that you are only able to save only ASCII characters that is only first 256 bit of full unicode. See this answer.
You use Latin1 in database. Approximately this three characters sets are equal. Latin1-General = Win 1252 = IEC_8859-1.
ISO/IEC_8859-1 is the basis for most popular 8-bit character sets, including Windows-1252 and the first block of characters in Unicode.
SQL_Latin1_General_CP1_CI_AS:- Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive,
width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data
This means that whatever character you entered to database first 256 bits values are safe. If you know your client's default encodings.
I suggest to try this default encoding to see if you can recover some information.
I gave example in Turkey, I know that most client's use Win1254 therefore I will try to change values to that encoding and see I can recover anything.
Second part of your answer is that you can safely change from varchar to nvarchar without loss of information.
Here this without loss of information would be first part hex value (first 256 value).
Your strange chars would remain but other characters stays.
This answer and linked article gives more information.

You should not use the javascript function escape, it uses non-standard encoding that is a mix of standard URL encoding using ISO-8859-1 and a weird %uxxxx scheme for anything not in ISO-8859-1.
Additionally, you should not manually escape anything at all, since jQuery will use proper escaping on your jsonToSend-object anyway.
So when you do this:
var jsonToSend= { serial: serial, critiqueText: escape(critiqueText) } ;
$.post( "example.asp", jsonToSend );
And critiqueText is, say, “hello world”. First the escape will turn it into:
%u201Chello%20world%u201D
Then jQuery will apply standard URL encoding on that before sending and it will become:
%25u201Chello%2520world%25u201D
So simply change your jsonToSend to:
var jsonToSend= { serial: serial, critiqueText: critiqueText) } ;
Which results in
%E2%80%9Chello%20world%E2%80%9D
I.E. standard URL encoding, you can even point your browser to http://en.wikipedia.org/wiki/%E2%80%9Chello%20world%E2%80%9D
Note, it's likely that Classic ASP won't recognize standard URL encoding, so here's a function to apply Win1252 URL encoding:
var map = {
0x20AC: 128,
0x201A: 130,
0x0192: 131,
0x201E: 132,
0x2026: 133,
0x2020: 134,
0x2021: 135,
0x02C6: 136,
0x2030: 137,
0x0160: 138,
0x2039: 139,
0x0152: 140,
0x017D: 142,
0x2018: 145,
0x2019: 146,
0x201C: 147,
0x201D: 148,
0x2022: 149,
0x2013: 150,
0x2014: 151,
0x02DC: 152,
0x2122: 153,
0x0161: 154,
0x203A: 155,
0x0153: 156,
0x017E: 158,
0x0178: 159
};
function urlEncodeWin1252( str ) {
return escape( str.replace( /[\d\D]/g, function(m){
var cc = m.charCodeAt(0);
if( cc in map ) {
return String.fromCharCode(map[cc]);
}
return m;
}));
}
You still can't have jQuery double encoding the result from this, so pass it a plain string:
var jsonToSend= "serial=" + serial + "&critiqueText=" urlEncodeWin1252(critiqueText);
Which will result in:
serial=123&critiqueText=%93hello%20world%94
You might want to rename that variable, there is no JSON anywhere.

I deal with importing of crazy characters into SQL all day long and nvarchar is the way to go. Unless they're numbers or something of that sort I set the columns to nvarchar(max) so I won't have to deal with it. The only exception you have to keep in mind is if you're going to use Foreign Keys then you'll have to set it to nvarchar(450). This handles all kinds of crazy characters, spacing, and gaps in text as the result of tabs.

Related

SQL: Storing Extended ASCII (128 to 255) in VARCHAR

How do you store chars 128 to 255 in VARCHAR..?
SQL seems to change some of these to char(63) '?'. I'm not sure if it's something to do with collation? UTF-8? N'..'? I've tried COLLATE Latin1_General_Bin, not sure if it supports extended ascii though..
Obviously works with NVARCHAR, but in theory this should work in VARCHAR too..?
The character stored in varchar/char columns beyond the ASCII 0-127 character range is determined by the code page associated with the collation. Characters not specifically defined by the code page are ether mapped to a similar character or, when there is none, '?'.
You can list collations along with the associated code page with this query:
SELECT name, description, COLLATIONPROPERTY(name, 'CodePage') AS CodePage
FROM fn_helpcollations();
Dan's answer got me on the write track.
VARCHAR definitely does store Extended ASCII, but it depends on the code page associated with the collation. I'm using Latin1_General_100_BIN which uses code page 1252.
https://en.wikipedia.org/wiki/Windows-1252
According to this code page the the following chars are undefined:
129, 141, 143, 144, 157
In reality it looks like SQL exclude most chars from 128 to 159. Easy solution was just to remove those characters.

UTF-8 , Classic ASP and SQL Server

I'm having a weird problem that is getting me really really confused here.
I'm working on internationalization of a web app, and implementing UTF-8 all over.
The app has a lot of legacy code in Classic ASP, which is working fine so far.
What is getting me confused here is the following.
From the admin side of the APP I'm saving this string to test special characters:
Á, É, Í, Ó, Ú, Ü, Ñ ± ' Z Ž
If I run the SQL Server Profiler, I do not see the Ž character being inserted
If I do a Response.Write of the query that is running the UPDATE, the character is there
If I try to edit what was saved from the web front end, the character is there.
If I check the HTML Source code of the page I'm editing the character is correctly encoded as HTML using Server.HTMLEncode
If I run a select query from SQL Server Management Studio I do not see the character
I have the html meta tag to set UTF-8
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
The files are saved with UTF-8 Encoding
How can it be that the character is not visible in SQL Server Profiler or queries but if I do it from the front end the character is there?
I'm using the N prefix to save unicode in SQL Server and the column is of type nvarchar(128)
Then from another part of the system with the same setup if I try to do the same thing, the character is visible when doing the insert.
Any ideas?

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.

Is there a field in which PDF files specify their encoding?

I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question.
My question is: Is there a field in a PDF file where, by convention, the encoding scheme is specified (e.g.: UTF-8)? This would be something roughly analogous to <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in HTML.
Thank you very much in advance,
Blz
A quick look at the PDF specification seems to suggest that you can have different encoding inside a PDF-file. Have a look at page 86. So a PDF library with some kind of low level access should be able to provide you with encoding used for a string. But if you just want the text and don't care about the internal encodings used I would suggest to let the library take care of conversions for you.
PDF uses "named" characters, in the sense that a character is a name and not a numeric code. Character "a" has name "a", character "2" has name "two" and the euro sign has name "euro", to give a few examples. PDF defines a few "standard" "base" encodings (named "WinAnsiEncoding", "MacRomanEncoding" and a few more, can't remember exactly), an encoding being a one-to-one correspondence between character names and byte values (yes, only 0 to 255). The exact, normative values for these predefined encodings are in the PDF specification. All these encodings use the ASCII values for the US-ASCII characters, but they differ in higher byte values.
A PDF file may define new encodings by taking a "base" encoding (say, WinAnsiEncoding) and redefining a few bytes, so a PDF author may, for example, define a new encoding named "MySuperbEncoding" as WinAnsiEncoding but with byte value 65 changed to mean character "ntilde" (this definition goes inside the PDF file), and then specifying that some strings in the file use encoding "MySuperbEncoding". In this case, a string containing byte values 65-66-67 would mean characters "ñBC" and not "ABC". And note that I mean characters, nothing to do with glyphs or fonts. Different strings withing the PDF file may use different encodings (this provides a way for using more tan 256 characters in the PDF file, even though every string is defined as a byte sequence, and one byte always corresponds to one character).
So, the answer to your question is: characters within a PDF file can well be encoded internally in an ad-hoc encoding made on the spot for that specific PDF file. PDF parsers should make the appropriate substitutions when necessary. I do not know PDFMiner but I'm surprised that it (being a PDF parser) gives incorrect values, as the specification is very clear on how this must be interpreted. It IS possible to get all the necessary information from the PDF file, but, as Mattias said, it might be a large project and I think a program named PDFMiner should do exactly this kind of job.

Replacing wrong codificaton letters with SQL

I have a database with data from internet, but some pages have wrong codification and letters like ã becomes ã and çbecomes ç.
What are the possibilities to fix this? I'm using PostgreSQL.
I can use replace, but I need to do a replace for each case? I was thinking about translate, but I see that it transforms only one char into other. Is possible translate two chars into one? Something like: TRANSLATE(text,'ã|ç','ã|ç').
This particular problem looks like you have UTF-8 encoding being interpreted as a single-byte character set ("ç" becoming "ç" suggests iso-8859-1).
You can fix these up individually with a long chain of replace(...) calls. Or you can use postgresql's own character-conversion facilities:
select convert_from(convert_to('£20 - garçon', 'iso-8859-1'), 'utf-8')
In order, this:
Converts the string back to binary using the iso-8859-1 codec (which will just change unicode codepoints back to bytes, assuming all the codepoints are under 256)
Reinterprets that binary output as UTF-8, so sequences such as {0xc2, 0xa3} are translated to '£'
You can fix some of the characters by replacing them, but not all. By decoding the data using the wrong encoding you have already removed some information, and that is impossible to get back.
You should find out what the correct encoding is for those pages, and use that when decoding the data.
Some pages have the encoding in the response header, e.g.
Content-Type: text/html; charset=utf8
Some pages have the encoding in the HTML head, e.g.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
If the information is not in the header you would first have to decode the page (or at least a part of it) using the ASCII encoding (which is not a problem as the meta tag contains no special characters), find out the encoding, then decode the page using the correct encoding.
PostgreSQL has a string replacement function:
replace(string text, from text, to text): Replace all occurrences in string of substring from with substring to
Example:
replace ('abcdefabcdef', 'cd', 'XX') ==> abXXefabXXef