SQL: Storing Extended ASCII (128 to 255) in VARCHAR - sql

How do you store chars 128 to 255 in VARCHAR..?
SQL seems to change some of these to char(63) '?'. I'm not sure if it's something to do with collation? UTF-8? N'..'? I've tried COLLATE Latin1_General_Bin, not sure if it supports extended ascii though..
Obviously works with NVARCHAR, but in theory this should work in VARCHAR too..?

The character stored in varchar/char columns beyond the ASCII 0-127 character range is determined by the code page associated with the collation. Characters not specifically defined by the code page are ether mapped to a similar character or, when there is none, '?'.
You can list collations along with the associated code page with this query:
SELECT name, description, COLLATIONPROPERTY(name, 'CodePage') AS CodePage
FROM fn_helpcollations();

Dan's answer got me on the write track.
VARCHAR definitely does store Extended ASCII, but it depends on the code page associated with the collation. I'm using Latin1_General_100_BIN which uses code page 1252.
https://en.wikipedia.org/wiki/Windows-1252
According to this code page the the following chars are undefined:
129, 141, 143, 144, 157
In reality it looks like SQL exclude most chars from 128 to 159. Easy solution was just to remove those characters.

Related

How to search in SQL Server for text that has special characters?

I have a SQL Server table with a column of type TEXT that would store candidate resumes in different format. RTF is the most common one but often we get resume data from a 3rd party converter which stores the resume as special characters (maybe Unicode or I don't know what they are).
How do I search my table to find all the rows that have these special characters? For example the rows with id = 4,6,7, 9 etc. all are the records with special characters.
What format are these special characters called? Unicode??
Assuming that by "special" characters you mean anything outside the set of printable ASCII and certain common whitespace characters , you can try the following:
DECLARE #SpecialPattern VARCHAR(100) =
'%[^'
+ CHAR(9) + CHAR(10) + CHAR(13) -- tab, CR, LF
+ CHAR(32) + '-' + CHAR(126) -- Range from space to last printable ASCII
+ ']%'
SELECT
RESUME_TEXT,
cast(left(cast(resume_text as varchar(max)),20) as varbinary(max))` -- Borrowed from userMT's comment
FROM RESUME
WHERE RESUME_TEXT LIKE #SpecialPattern COLLATE Latin1_General_Bin -- Use exact compare
You may get some false hits against some perfectly valid extended characters such as accented vowels, curly quotes, or m- and n- dashes that may exist in the text.
My first though is that the weird characters might be a UTF-8 BOM (hex EF, BB, BF), but the display didn't seem to match the how I would expect SQL Server to render them. The inverse dot isn't present at all in the default windows code page (1252).
We need at least some hex data (at least the first few bytes) to help further. Often, common binary file types have a recognizable signature in the first 3-5 bytes.

how replace accented letter in a varchar2 column in oracle

I have a varchar2 column named NAME_USER. for example the data is: JUAN ROMÄN but I try to show JUAN ROMAN, replace Á to A in my statement results. How Can I do that?. Thanks in advance.
Use convert function with the appropriate charset
select CONVERT('JUAN ROMÄN', 'US7ASCII') from dual;
below are the charset which can be used in oracle:
US7ASCII: US 7-bit ASCII character set
WE8DEC: West European 8-bit character set
WE8HP: HP West European Laserjet 8-bit character set
F7DEC: DEC French 7-bit character set
WE8EBCDIC500: IBM West European EBCDIC Code Page 500
WE8PC850: IBM PC Code Page 850
WE8ISO8859P1: ISO 8859-1 West European 8-bit character set
You could use replace, regexp_replace or translate, but they would each require you to map all possible accented characters to their unaccented versions.
Alternatively, there's a function called nlssort() which is typically used to override the default language settings used for the order by clause. It has an option for accent-insensitive sorting, which can be creatively misused to solve your problem. nlssort() returns a binary, so you have to convert back to varchar2 using utl_raw.cast_to_varchar2():
select utl_raw.cast_to_varchar2(nlssort(NAME_USER, 'nls_sort=binary_ai'))
from YOUR_TABLE;
Try this, for a list of accented characters from the extended ASCII set, together with their derived, unaccented values:
select level+192 ascii_code,
chr(level+192) accented,
utl_raw.cast_to_varchar2(nlssort(chr(level+192),'nls_sort=binary_ai')) unaccented
from dual
connect by level <= 63
order by 1;
Not really my answer - I've used this before and it seemed to work ok, but have to credit this post: https://community.oracle.com/thread/1117030
ETA: nlssort() can't do accent-insensitive without also doing case-insensitive, so this solution will always convert to lower case. Enclosing the expression above in upper() will of course get your example value back to "JUAN ROMAN". If your values can be mixed case, and you need to preserve the case of each character, and initcap() isn't flexible enough, then you'll need to write a bit of PL/SQL.
select replace('JUAN ROMÄN','Ä','A')
from dual;
If you have more mappings to make, you could use TRANSLATE ...
You can use regular expressions:
SELECT regexp_replace('JUAN ROMÄNí','[[=A=]]+','A' )
FROM dual;

Removing hidden character at end of SQL server field

I have a strange situation displaying value from SQL server. There is a value stored in SQL server 2008 field which is hidden when queried from server and shown in Management Studio (see below).
Test template 2​
But when displayed on a screen in HTML editor it is showing as ? (see below)
Test template 2?
When I check for ascii value it shows 63. Not sure how user got this special value into this field in SQL server. When I test by entering ? into input field and display it works fine without any issues.
I don't want to blindly remove last character from this field. I am trying to determine a solution to identify this invisible value and remove it either while storing or displaying.
Any solution is greatly appreciated.
As comments below suggests this turned out to be Unicode 8203 (zero width space).
My next question is how to replace this Unicode 8203 in one statement in T-SQL without parsing through each character?
Use REPLACE to remove the zero-width space character:
-- setup unicode string containing zero-width character
DECLARE #UnicodeReplace NVARCHAR(5) = N'Test' + NCHAR(8203);
-- check that unicode string length is 5,
-- and prove existence of zero-width space character matching unicode 8203
SELECT #UnicodeReplace AS String,
LEN(#UnicodeReplace) AS Length,
UNICODE(SUBSTRING(#UnicodeReplace, 5, 1)) AS UnicodeValue
-- replace and prove the unicode string length is reduced to 4
SELECT REPLACE(#UnicodeReplace, NCHAR(8203), N''),
LEN(REPLACE(#UnicodeReplace, NCHAR(8203), N'')) AS Length;
SQL Fiddle
Such characters could not be replaced if database collation has default values like this: SQL_Latin1_General_CP1_CI_AS. In such cases this command could work:
set #word=replace(#word collate Latin1_General_100_BIN2, nchar(8205),N'')

Processing: How to convert a char datatype into its utf-8 int representation?

How can I convert a char datatype into its utf-8 int representation in Processing?
So if I had an array ['a', 'b', 'c'] I'd like to obtain another array [61, 62, 63].
After my answer I figured out a much easier and more direct way of converting to the types of numbers you wanted. What you want for 'a' is 61 instead of 97 and so forth. That is not very hard seeing that 61 is the hexadecimal representation of the decimal 97. So all you need to do is feed your char into a specific method like so:
Integer.toHexString((int)'a');
If you have an array of chars like so:
char[] c = {'a', 'b', 'c', 'd'};
Then you can use the above thusly:
Integer.toHexString((int)c[0]);
and so on and so forth.
EDIT
As per v.k.'s example in the comments below, you can do the following in Processing:
char c = 'a';
The above will give you a hex representation of the character as a String.
// to save the hex representation as an int you need to parse it since hex() returns a String
int hexNum = PApplet.parseInt(hex(c));
// OR
int hexNum = int(c);
For the benefit of the OP and the commenter below. You will get 97 for 'a' even if you used my previous suggestion in the answer because 97 is the decimal representation of hexadecimal 61. Seeing that UTF-8 matches with the first 127 ASCII entries value for value, I don't see why one would expect anything different anyway. As for the UnsupportedEncodingException, a simple fix would be to wrap the statements in a try/catch block. However that is not necessary seeing that the above directly answers the question in a much simpler way.
what do you mean "utf-8 int"? UTF8 is a multi-byte encoding scheme for letters (technically, glyphs) represented as Unicode numbers. In your example you use trivial letters from the ASCII set, but that set has very little to do with a real unicode/utf8 question.
For simple letters, you can literally just int cast:
print((int)'a') -> 97
print((int)'A') -> 65
But you can't do that with characters outside the 16 bit char range. print((int)'二') works, (giving 20108, or 4E8C in hex) but print((int)'𠄢') will give a compile error because the character code for 𠄢 does not fit in 16 bits (it's supposed to be 131362, or 20122 in hex, which gets encoded as a three byte UTF-8 sequence 239+191+189)
So for Unicode characters with a code higher than 0xFFFF you can't use int casting, and you'll actually have to think hard about what you're decoding. If you want true Unicode point values, you'll have to literally decode the byte print, but the Processing IDE doesn't actually let you do that; it will tell you that "𠄢".length() is 1, when in real Java it's really actually 3. There is -in current Processing- no way to actually get the Unicode value for any character with a code higher than 0xFFFF.
update
Someone mentioned you actually wanted hex strings. If so, use the built in hex function.
println(hex((int)'a')) -> 00000061
and if you only want 2, 4, or 6 characters, just use substring:
println(hex((int)'a').substring(4)) -> 0061

Unable to replace Char(63) by SQL query

I am having some rows in table with some unusual character. When I use ascii() or unicode() for that character, it returns 63. But when I try this -
update MyTable
set MyColumn = replace(MyColumn,char(63),'')
it does not replace. The unusual character still exists after the replace function. Char(63) incidentally looks like a question mark.
For example my string is 'ddd#dd ddd' where # it's my unusual character and
select unicode('#')
return me 63.But this code
declare #str nvarchar(10) = 'ddd#dd ddd'
set #char = char(unicode('#'))
set #str = replace(#str,#char,'')
is working!
Any ideas how to resolve this?
Additional information:
select ascii('�') returns 63, and so does select ascii('?'). Finally select char(63) returns ? and not the diamond-question-mark.
When this character is pasted into Excel or a text editor, it looks like a space, but in an SQL Server Query window (and, apparently, here on StackOverflow as well), it looks like a diamond containing a question mark.
Not only does char(63) look like a '?', it is actually a '?'.
(As a simple test ensure you have numlock on your keyboard on, hold down the alt key andtype '63' into the number pad - you can all sorts of fun this way, try alt-205, then alt-206 and alt-205 again: ═╬═)
Its possible that the '?' you are seeing isn't a char(63) however, and more indicitive of a character that SQL Server doesn't know how to display.
What do you get when you run:
select ascii(substring('[yourstring]',[pos],1));
--or
select unicode(substring('[yourstring]',[pos],1));
Where [yourstring] is your string and [pos] is the position of your char in the string
EDIT
From your comment it seems like it is a question mark. Have you tried:
replace(MyColumn,'?','')
EDIT2
Out of interest, what does the following do for you:
replace(replace(MyColumn,char(146),''),char(63),'')
char(63) is a question mark. It sounds like these "unusual" characters are displayed as a question mark, but are not actually characters with char code 63.
If this is the case, then removing occurrences of char(63) (aka '?') will of course have no effect on these "unusual" characters.
I believe you actually didn't have issues with literally CHAR(63), because that should be just a normal character and you should be able to properly work with it.
What I think happened is that, by mistake, an UTF character (for example, a cyrilic "А") was inserted into the table - and either your:
columns setup,
the SQL code,
or the passed in parameters
were not prepared for that.
In this case, the sign might be visible to you as ?, and its CHAR() function would actually give 63, but you should really use the NCHAR() to figure out the real code of it.
Let me give a specific example, that I had multiple times - issues
with that Cyrilic "А", which looks identical to the Latin one, but has
a unicode of 1040.
If you try to use the non-UTF CHAR function on that 1040 character,
you would get a code 63, which is not true (and is probably just an
info about the first byte of multibyte character).
Actually, run this to make the differences in my example obvious:
SELECT NCHAR(65) AS Latin_A, NCHAR(1040) Cyrilic_A, ASCII(NCHAR(1040)) Latin_A_Code, UNICODE(NCHAR(1040)) Cyrilic_A_Code;
That empty string Which shows us '?' in substring.
Gives us Ascii value as 63.
It's a Zero Width space which gets appended if you copy data from ui and insert into the database.
To replace the data, you can use below query
**set MyColumn = replace(MyColumn,NCHAR(8203),'')**
It's an older question, but I've run into this problem as well. I found the solution somewhere else on internet, but I thought it would be good to share it here as well. Have a good day.
Replace(YourString, nchar(65533) COLLATE Latin1_General_BIN2, '')
This should work as well:
UPDATE TABLE
SET [FieldName] = SUBSTRING([FieldName], 2, LEN([FieldName]))
WHERE ASCII([FieldName]) = 63