Inserting UTF-32 characters - sql

I'm testing UTF-32 characters (specifically emojis) with SQL Server (2008 R2, 10.5) and at this stage I'm checking if the server supports the given code
For this case I'm using the :rose with the following query
SELECT '' + nchar(0x1F339) + 'test'
which returns back in Management Studio with (NULL).
What format do I need to encode the character to have it not return null in SQL Server

SQL Server only supports UCS-2, which is currently (almost) the same as UTF-16. So exactly 2 bytes per character and all that.

An idea, if I may. You can store the data in a BINARY or VARBINARY data field which doesn't care about encoding. You can then use a mapping table or external script to parse the binary into a text field replacing 0x1F339 with :rose: or your own custom forma for example.

Since it's UTF-32, it has two be written as two UTF-16 characters:
-- Returns: 🌹test
SELECT '' + nchar(0xD83C) + nchar(0xDF39) + 'test'
You can find this code under "UTF-16 Hex (C Syntax)" title, following your link.
Also I have to recommend this article, because it was very helpful during investigation: Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)
Couple of options for those who are looking for answers:
SQL Server technically does not have character escape sequences, but
you can still create characters using either byte sequences or Code
Points using the CHAR() and NCHAR() functions. We are only concerned
with Unicode here, so we will only be using NCHAR().
All versions:
NCHAR(0 - 65535) for BMP Code Points (using an int/decimal value)
NCHAR(0x0 - 0xFFFF) for BMP Code Points (using a binary/hex value)
NCHAR(0 - 65535) + NCHAR(0 - 65535) for a Surrogate Pair / Two UTF-16
Code Units
NCHAR(0x0 - 0xFFFF) + NCHAR(0x0 - 0xFFFF) for a Surrogate Pair / Two
UTF-16 Code Units
CONVERT(NVARCHAR(size), 0xHHHH) for one or more characters in UTF-16
Little Endian (“HHHH” is 1 or more sets of 4 hex digits)
Starting in SQL Server 2012:
If database’s default collation supports Supplementary Characters
(collation name ends in _SC, or starting in SQL Server 2017 name
contains 140 but does not end in _BIN*, or starting in SQL Server
2019 name ends in _UTF8 but does not contain _BIN2), then NCHAR() can
be given Supplementary Character Code Points:
decimal value can go up to 1114111
hex value can go up to 0x10FFFF
Starting in SQL Server 2019:
“_UTF8” collations enable CHAR and VARCHAR data to use the UTF-8
encoding:
CONVERT(VARCHAR(size), 0xHH) for one or more characters in UTF-8 (“HH”
is 1 or more sets of 2 hex digits)
NOTE: The CHAR() function does not work for this purpose. It can only
produce a single byte, and UTF-8 is only a single byte for values 0 –
127 / 0x00 – 0x7F.

Related

How to search in SQL Server for text that has special characters?

I have a SQL Server table with a column of type TEXT that would store candidate resumes in different format. RTF is the most common one but often we get resume data from a 3rd party converter which stores the resume as special characters (maybe Unicode or I don't know what they are).
How do I search my table to find all the rows that have these special characters? For example the rows with id = 4,6,7, 9 etc. all are the records with special characters.
What format are these special characters called? Unicode??
Assuming that by "special" characters you mean anything outside the set of printable ASCII and certain common whitespace characters , you can try the following:
DECLARE #SpecialPattern VARCHAR(100) =
'%[^'
+ CHAR(9) + CHAR(10) + CHAR(13) -- tab, CR, LF
+ CHAR(32) + '-' + CHAR(126) -- Range from space to last printable ASCII
+ ']%'
SELECT
RESUME_TEXT,
cast(left(cast(resume_text as varchar(max)),20) as varbinary(max))` -- Borrowed from userMT's comment
FROM RESUME
WHERE RESUME_TEXT LIKE #SpecialPattern COLLATE Latin1_General_Bin -- Use exact compare
You may get some false hits against some perfectly valid extended characters such as accented vowels, curly quotes, or m- and n- dashes that may exist in the text.
My first though is that the weird characters might be a UTF-8 BOM (hex EF, BB, BF), but the display didn't seem to match the how I would expect SQL Server to render them. The inverse dot isn't present at all in the default windows code page (1252).
We need at least some hex data (at least the first few bytes) to help further. Often, common binary file types have a recognizable signature in the first 3-5 bytes.

sql studio can't see special characters in XML

For some reason Visual Studio does not show me special characters when I query for an XML field. Maybe I stored them wrong? These are smart quotes
Here's the query:
select CustomFields from TABLE where ID=422567 FOR XML PATH('')
When I copy/paste into notepad++ I see this:
What are STS and CCH?
Strings are - as you surely know - just chains of numbers. What they mean and how they are interpreted is depending on codepages, encodings, little or big endian ...
Just have a look on this
SELECT 'test' AS NormalText
--non printable characters
--they are things like backspace, carriage return
,CHAR(0x6) AS ACK --DEC 7
,CHAR(0x7) AS BEL --DEC 9
,CHAR(0x1A) AS CR --DEC 13
,CHAR(0x1B) AS ESC --DEC 27
--printable characters from 0x21 (DEC 33) up to 0x7F (DEC 127) - (almost) not depending on encoding
,CHAR(0x41) AS BigA --DEC 65
,CHAR(0x7E) AS Tilde --DEC 126
--extended - from 0x80 (DEC 128) - very much depending on encoding!
,CHAR(0x93) AS STS --DEC 147
,CHAR(0x94) AS CCH --DEC 148
,CHAR(0x93) + 'test' + CHAR(0x94) AS Mixed
FOR XML PATH('')
This will produce this
<NormalText>test</NormalText>
<ACK></ACK>
<BEL></BEL>
<CR></CR>
<ESC></ESC>
<BigA>A</BigA>
<Tilde>~</Tilde>
<STS>“</STS>
<CCH>”</CCH>
<Mixed>“test”</Mixed>
As you see, there are characters which must be encoded, as there is no character expression for them, others are displayed with their corresponding "picture".
With codes above DEC 127 you enter dangerous terrain. The same string can produce quite different output depending on where you read it.
The "STS" and "CCH" Notepad shows to you, are taken from C1 Controls and Latin-1 Supplement.
This, and the written Smart qoutes in your example point to this. In order to allow smart qoutes there are general characters for start and end which are "replaced" with the fitting opening and closing qoutation marks.
Finally XML in SQL Server is always UTF16. Have a look at this feff0093 and feff0094. These are the signs UTF16 binds to 0x93 and 0x94. My small example shows this clearly...
So the question is: Why does your picture not show the “ and the ” ?
I don't know... The select you put in the first line would not "produce" this XML, it rather takes existing XML out of a column "CustomFields". I'm fairly sure, that this is not a "real" XML-column...

EncryptByPassPhrase returns special characters

I'm trying to encrypt with EncryptByPassPhrase in SQL Server 2012 but when I execute this function I get values like "öK{8+¨´¡¿" ... maybe someone can help me?.
This is the code that i'm using:
IF(#MODE = 1)
BEGIN
SET #RESUL = convert(varchar(100),ENCRYPTBYPASSPHRASE('Prueba','200000'))<br>
PRINT 'ENCRYPT'+ (CAST(#RESUL AS varchar(20)))
END
Let's break it down. According to the documentation, the output of ENCRYPTBYPASSPHRASE() is varbinary. You're CONVERTing that to varchar. According to the documenation for convert, if you don't provide a style, convert "Translates ASCII characters to binary bytes or binary bytes to ASCII characters. Each character or byte is converted 1:1.". If you're looking for something more like 0x123abc, pass an additional parameter (1) to CONVERT to make it do that.
All that said, unless you need a human to be able to transcribe the encrypted content (or otherwise interpret it), I'd leave it in its varbinary representation. Less room for error on the decryption side. Specifically:
DECLARE #resul VARBINARY(8000);
SET #RESUL = ENCRYPTBYPASSPHRASE('Prueba','200000');
SELECT CAST(DECRYPTBYPASSPHRASE('Prueba', #resul) AS VARCHAR(50));

Removing hidden character at end of SQL server field

I have a strange situation displaying value from SQL server. There is a value stored in SQL server 2008 field which is hidden when queried from server and shown in Management Studio (see below).
Test template 2​
But when displayed on a screen in HTML editor it is showing as ? (see below)
Test template 2?
When I check for ascii value it shows 63. Not sure how user got this special value into this field in SQL server. When I test by entering ? into input field and display it works fine without any issues.
I don't want to blindly remove last character from this field. I am trying to determine a solution to identify this invisible value and remove it either while storing or displaying.
Any solution is greatly appreciated.
As comments below suggests this turned out to be Unicode 8203 (zero width space).
My next question is how to replace this Unicode 8203 in one statement in T-SQL without parsing through each character?
Use REPLACE to remove the zero-width space character:
-- setup unicode string containing zero-width character
DECLARE #UnicodeReplace NVARCHAR(5) = N'Test' + NCHAR(8203);
-- check that unicode string length is 5,
-- and prove existence of zero-width space character matching unicode 8203
SELECT #UnicodeReplace AS String,
LEN(#UnicodeReplace) AS Length,
UNICODE(SUBSTRING(#UnicodeReplace, 5, 1)) AS UnicodeValue
-- replace and prove the unicode string length is reduced to 4
SELECT REPLACE(#UnicodeReplace, NCHAR(8203), N''),
LEN(REPLACE(#UnicodeReplace, NCHAR(8203), N'')) AS Length;
SQL Fiddle
Such characters could not be replaced if database collation has default values like this: SQL_Latin1_General_CP1_CI_AS. In such cases this command could work:
set #word=replace(#word collate Latin1_General_100_BIN2, nchar(8205),N'')

Two characters with the same ASCII Code?

I'm trying to clean a recently imported sql server 2008 database that have to many invalid charcters for my application. and I found different characters with the same ASCII code, ¿that is posible?.
If I execute this query:
select ASCII('║'), ASCII('¦')
I get:
166 166
I need to do a similar work, but with .net code.
If I ask for these char in .net:
? ((int)'║').ToString() + ", " + ((int)'¦').ToString()
I get:
"9553, 166"
Anybody can Explain what happens
Instead of ASCII, use the UNICODE function.
Both ║ and | are not an ASCII characters, so calling ASCII with them would convert incorrectly and result in the wrong value.
Additionally, you need to use unicode strings when calling the UNICODE function, using the N prefix:
SELECT UNICODE(N'║'), UNICODE(N'|')
-- Results in: 9553, 166