How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table? - sql

I have a table with a couple thousand rows. The description and summary fields are NTEXT, and sometimes have non-ASCII chars in them. How can I locate all of the rows with non ASCII characters?

I have sometimes been using this "cast" statement to find "strange" chars
select
*
from
<Table>
where
<Field> != cast(<Field> as varchar(1000))

First build a string with all the characters you're not interested in (the example uses the 0x20 - 0x7F range, or 7 bits without the control characters.) Each character is prefixed with |, for use in the escape clause later.
-- Start with tab, line feed, carriage return
declare #str varchar(1024)
set #str = '|' + char(9) + '|' + char(10) + '|' + char(13)
-- Add all normal ASCII characters (32 -> 127)
declare #i int
set #i = 32
while #i <= 127
begin
-- Uses | to escape, could be any character
set #str = #str + '|' + char(#i)
set #i = #i + 1
end
The next snippet searches for any character that is not in the list. The % matches 0 or more characters. The [] matches one of the characters inside the [], for example [abc] would match either a, b or c. The ^ negates the list, for example [^abc] would match anything that's not a, b, or c.
select *
from yourtable
where yourfield like '%[^' + #str + ']%' escape '|'
The escape character is required because otherwise searching for characters like ], % or _ would mess up the LIKE expression.
Hope this is useful, and thanks to JohnFX's comment on the other answer.

Here ya go:
SELECT *
FROM Objects
WHERE
ObjectKey LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?#\[\^_`{|}~\]\\]%' ESCAPE '\'

It's probably not the best solution, but maybe a query like:
SELECT *
FROM yourTable
WHERE yourTable.yourColumn LIKE '%[^0-9a-zA-Z]%'
Replace the "0-9a-zA-Z" expression with something that captures the full ASCII set (or a subset that your data contains).

Technically, I believe that an NCHAR(1) is a valid ASCII character IF & Only IF UNICODE(#NChar) < 256 and ASCII(#NChar) = UNICODE(#NChar) though that may not be exactly what you intended. Therefore this would be a correct solution:
;With cteNumbers as
(
Select ROW_NUMBER() Over(Order By c1.object_id) as N
From sys.system_columns c1, sys.system_columns c2
)
Select Distinct RowID
From YourTable t
Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))
Where UNICODE(Substring(TXT, n.N, 1)) > 255
OR UNICODE(Substring(TXT, n.N, 1)) <> ASCII(Substring(TXT, n.N, 1))
This should also be very fast.

I started with #CC1960's solution but found an interesting use case that caused it to fail. It seems that SQL Server will equate certain Unicode characters to their non-Unicode approximations. For example, SQL Server considers the Unicode character "fullwidth comma" (http://www.fileformat.info/info/unicode/char/ff0c/index.htm) the same as a standard ASCII comma when compared in a WHERE clause.
To get around this, have SQL Server compare the strings as binary. But remember, nvarchar and varchar binaries don't match up (16-bit vs 8-bit), so you need to convert your varchar back up to nvarchar again before doing the binary comparison:
select *
from my_table
where CONVERT(binary(5000),my_table.my_column) != CONVERT(binary(5000),CONVERT(nvarchar(1000),CONVERT(varchar(1000),my_table.my_column)))

If you are looking for a specific unicode character, you might use something like below.
select Fieldname from
(
select Fieldname,
REPLACE(Fieldname COLLATE Latin1_General_BIN,
NCHAR(65533) COLLATE Latin1_General_BIN,
'CustomText123') replacedcol
from table
) results where results.replacedcol like '%CustomText123%'

My previous answer was confusing UNICODE/non-UNICODE data. Here is a solution that should work for all situations, although I'm still running into some anomalies. It seems like certain non-ASCII unicode characters for superscript characters are being confused with the actual number character. You might be able to play around with collations to get around that.
Hopefully you already have a numbers table in your database (they can be very useful), but just in case I've included the code to partially fill that as well.
You also might need to play around with the numeric range, since unicode characters can go beyond 255.
CREATE TABLE dbo.Numbers
(
number INT NOT NULL,
CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (number)
)
GO
DECLARE #i INT
SET #i = 0
WHILE #i < 1000
BEGIN
INSERT INTO dbo.Numbers (number) VALUES (#i)
SET #i = #i + 1
END
GO
SELECT *,
T.ID, N.number, N'%' + NCHAR(N.number) + N'%'
FROM
dbo.Numbers N
INNER JOIN dbo.My_Table T ON
T.description LIKE N'%' + NCHAR(N.number) + N'%' OR
T.summary LIKE N'%' + NCHAR(N.number) + N'%'
and t.id = 1
WHERE
N.number BETWEEN 127 AND 255
ORDER BY
T.id, N.number
GO

-- This is a very, very inefficient way of doing it but should be OK for
-- small tables. It uses an auxiliary table of numbers as per Itzik Ben-Gan and simply
-- looks for characters with bit 7 set.
SELECT *
FROM yourTable as t
WHERE EXISTS ( SELECT *
FROM msdb..Nums as NaturalNumbers
WHERE NaturalNumbers.n < LEN(t.string_column)
AND ASCII(SUBSTRING(t.string_column, NaturalNumbers.n, 1)) > 127)

Related

What's the Differences between These Two Strings?

I'm baffling that I cannot find the differences between these two sets of strings, which look to be exactly the same for me. I checked for white space in between the strings, but no luck. When running below queries in SQL Management Studio, only one of them return results... Please help, thank you.
--return row
SELECT * FROM Vendors WHERE VendorCode = 'SRP 85072B'
--does not return row
SELECT * FROM Vendors WHERE VendorCode = 'SRP  85072B'
--return rows
SELECT * FROM Vendors WHERE VendorCode IN (
'ATT 60197S',
'GMI 98661A')
--does NOT RETURN rows
SELECT * FROM Vendors WHERE VendorCode IN (
'ATT  60197S',
'GMI  98661A')
One of the strings has two consecutive regular spaces, the other has a non breaking space (character 160 decimal 0xA0 hex) followed by a regular space (character 32 decimal 0x20 hex).
You can see this from copying and pasting the strings from here as I have done here https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=0dc6ccc48439c3dc27a227aa2dffb4d2
0x4154542020363031393753
0x415454A020363031393753
It could be that your string or the column has special character CR or CRLF, try removing those characters in the comparison:
SELECT *
FROM Vendors
WHERE replace(replace(ltrim(rtrim(VendorCode)), char(13), ''), char(10), '') = replace(replace(ltrim(rtrim('SRP 85072B')), char(13), ''), char(10), '')
This code uses a tally table to split apart the two strings, convert each character to its ascii value, and return the differences.
declare
#string1 nvarchar(200)='zdq%E^&$DGE%^#((',
#string2 nvarchar(200)='zdq%E^&$DGx%^#((';
select ascii(SUBSTRING(#string1, t.n, 1)), t.n from dbo.fnTally(1, len(#string1)) t
except
select ascii(SUBSTRING(#string2, t.n, 1)), t.n from dbo.fnTally(1, len(#string2)) t;
They seem similar if you run the following sql query
DECLARE #first nvarchar(max) = 'SELECT * FROM Vendors WHERE VendorCode = ''SRP 85072B''',
#second nvarchar(max) = 'SELECT * FROM Vendors WHERE VendorCode = ''SRP 85072B'''
IF (#first = #second)
SELECT 'Similar';
ELSE
SELECT 'Not Similar';
If the parameters are coming from the same source you may check using the following query
DECLARE #param nvarchar(max) = 'SRP 85072B'
SELECT * FROM Vendors WHERE VendorCode like '%'+ #param+'%'

Update or replace a special character text in my database field

I am facing to an issue that I cannot update or replace some characters in my database.
Here is how this text look like in my column when I retrieve it:
As you can see, there is an unknown characters between 'master' and 'degree' which I cannot even paste it here.
I also tried to update and replace it with below code (I cannot paste that two vertical lines here since they are not supported in any browser and I am not sure what they are, Please see the picture above to see what is in my SQL statement).
begin transaction
update gm_desc set projdesc=replace(projdesc,'%â s%','') where projdesc like '%âs%' and proposalno = '15-149-01'
You can see the real SQL Statement here:
I tried to update, or replace it but I cannot do it. The update statement successfully works but I still see that weird special charters. I would be appreciate to help me.
Here's a scalar-valued function which removes all non-alphanumeric characters (preserves spaces) from a string.
Hopefully it helps!
dbfiddle
create function dbo.get_alphanumeric_str
(
#string varchar(max)
)
returns varchar(max)
as
begin
declare #ret varchar(max);
with nums as (
select 1 as n
union all select n+1 from nums
where n < 256
)
select #ret = replace(stuff(
(
select '' + substring(#string, nums.n, 1)
from nums
where patindex('%[^0-9A-Za-z ]%', substring(#string, nums.n,1)) = 0
for xml path('')
), 1, 0, ''
), ' ', ' ')
option (MAXRECURSION 256)
return #ret;
end
Usage
select dbo.get_alphanumeric_str('Helloᶄ âWorld 1234⅊⅐')
Returns: Hello World 1234
How it works
The nums CTE is just to get a list of numbers (you can set the maximum of 256 to a higher value if your strings are longer; n.b. option (MAXRECURSION n) is for this CTE but has to be placed at the query)
The stuff essentially iterates through the string, using the list of numbers above and extracts a substring of length 1; each of these chars are checked if they match the [^0-9A-Za-z ] regex group (0-9 all digits, A-Za-z all letters both lower and upper case, and a single space character)
If they match, patindex() should return 0; i.e. index zero.
Use replace(string, ' ', ' ') for the space character as the xml path returns a special encoding, see this question.
Use a binary collation for accented characters; see this answer

select and concatenate everything before and after a certain character

I've got a string like AAAA.BBB.CCCC.DDDD.01.A and I'm looking to manipulate this and end up with AAAA-BBB
I've achieved this by writing this debatable piece of code
declare #string varchar(100) = 'AAAA.BBB.CCCC.DDDD.01.A'
select replace(substring(#string,0,charindex('.',#string)) + substring(#string,charindex('.',#string,CHARINDEX('.',#string)),charindex('.',#string,CHARINDEX('.',#string)+1)-charindex('.',#string)),'.','-')
Is there any other way to achieve this which is more elegant and readable ?
I was looking at some string_split operations, but can't wrap my head around it.
If you are open to some JSON transformations, the following approach is an option. You need to transform the text into a valid JSON array (AAAA.BBB.CCCC.DDDD.01.A is transformed into ["AAAA","BBB","CCCC","DDDD","01","A"]) and get the required items from this array using JSON_VALUE():
Statement:
DECLARE #string varchar(100) = 'AAAA.BBB.CCCC.DDDD.01.A'
SET #string = CONCAT('["', REPLACE(#string, '.', '","'), '"]')
SELECT CONCAT(JSON_VALUE(#string, '$[0]'), '-', JSON_VALUE(#string, '$[1]'))
Result:
AAAA-BBB
Notes: With this approach you can easily access all parts from the input string by index (0-based).
I think this is a little cleaner:
declare #string varchar(100) = 'AAAA.BBB.CCCC.DDDD.01.A'
select
replace( -- replace '.' with '-' (A)
substring(#string, 1 -- in the substring of #string starting at 1
,charindex('.', #string -- and going through 1 before the index of '.'(B)
,charindex('.',#string)+1) -- that is after the first index of the first '.'
-1) -- (B)
,'.','-') -- (A)
Depending on what is in your string you might be able to abuse PARSENAME into doing it. Intended for breaking up names like adventureworks.dbo.mytable.mycolumn it works like this:
DECLARE #x as VARCHAR(100) = 'aaaa.bbb.cccc.ddddd'
SELECT CONCAT( PARSENAME(#x,4), '-', PARSENAME(#x,3) )
You could also look at a mix of STUFF to delete the first '.' and replace with '-' then LEFT the result by the index of the next '.' but it's unlikely to be neater than this or Kevin's proposal
Using string split would likely be as unwieldy:
SELECT CONCAT(MAX(CASE WHEN rn = 1 THEN v END), '-', MAX(CASE WHEN rn = 2 THEN v END))
FROM (
SELECT row_number () over (order by (select 0)) rn, value as v
FROM string_split(#x,'.')
) y WHERE rn IN (1,2)
Because the string is split to rows which then need to be numbered in order to filter and pull the parts you want. This also relies on the strings coming out of string split in the order they were in the original string, which MS do not guarantee will be the case

need to replace some underscores with hypens

I have a column with value
AAA_ZZZZ_7890_10_28_2014_123456.jpg
I need to replace the middle underscores so that it displays it as date i.e.
AAA_ZZZZ_7890_10-28-2014_123456.jpg
Can some one please suggest a simple update query for this.
The Number of Underscores would be same for all the values in the column but the length will vary for example some can have
AAA_q10WRQ_001_10_28_2014_12.jpg
The following should do it:
http://sqlfiddle.com/#!3/d41d8/30384/0
declare #filename varchar(64) = 'AAA_ZZZZ_7890_10_28_2014_123456.jpg'
declare #datepattern varchar(64) = '%[_][0-1][0-9][_][0-3][0-9][_][1-2][0-9][0-9][0-9][_]%'
select
filename,
substring(filename,1,datepos+2)+'-'+
substring(filename,datepos+4,2)+'-'+
substring(filename,datepos+7,1000)
from
(
select
#filename filename,
patindex(#datepattern,#filename)
as datepos
) t
;
Resulting in
AAA_ZZZZ_7890_10-28-2014_123456.jpg
Caveats to watch out for:
It is important to exactly define how you find the date. In my definition it is MM_DD_YYYY surrounded by further two underscores, and I check that the first digits of M,D,Y are 0-1,0-3,1-2 respectively (i.e. I do NOT check if month is e.g. 13.) -- of course we assume that there is only one such string in any file name.
datepos actually finds the position of the underscore before the date -- this is not an issue if taken into account in the indexing of substring.
in the 3rd substring the length cannot be NULL or infinity and I couldn't get LEN() to work in SQL Fiddle so I dirty hardcoded a large enough number (1000). Corrections to this are welcome.
Try this (assuming that the DATE portion always starts at the same character index)
declare #string varchar(64) = 'AAA_ZZZZ_7890_10_28_2014_123456.jpg'
select replace(#string, reverse(substring(reverse(#string), charindex('_', reverse(#string), 0) + 1, 10)), replace(reverse(substring(reverse(#string), charindex('_', reverse(#string), 0) + 1, 10)), '_', '-'))
If there are exactly 6 _ then for the first
select STUFF ( 'AAA_ZZZZ_7890_10_28_2014_123456.jpg' , CHARINDEX ( '_' ,'AAA_ZZZZ_7890_10_28_2014_123456.jpg', CHARINDEX ( '_' ,'AAA_ZZZZ_7890_10_28_2014_123456.jpg', CHARINDEX ( '_' ,'AAA_ZZZZ_7890_10_28_2014_123456.jpg', CHARINDEX ( '_' ,'AAA_ZZZZ_7890_10_28_2014_123456.jpg', 0 ) + 1 ) + 1 ) + 1 ) , 1 , '-' )

Most efficient method for adding leading 0's to an int in sql

I need to return two fields from a database concatenated as 'field1-field2'. The second field is an int, but needs to be returned as a fixed length of 5 with leading 0's. The method i'm using is:
SELECT Field1 + '-' + RIGHT('0000' + CAST(Field2 AS varchar),5) FROM ...
Is there a more efficient way to do this?
That is pretty much the way: Adding Leading Zeros To Integer Values
So, to save following the link, the query looks like this, where #Numbers is the table and Num is the column:
SELECT RIGHT('000000000' + CONVERT(VARCHAR(8),Num), 8) FROM #Numbers
for negative or positive values
declare #v varchar(6)
select #v = -5
SELECT case when #v < 0
then '-' else '' end + RIGHT('00000' + replace(#v,'-',''), 5)
Another way (without CAST or CONVERT):
SELECT RIGHT(REPLACE(STR(#NUM),' ','0'),5)
If you can afford/want to have a function in your database you could use something like:
CREATE FUNCTION LEFTPAD
(#SourceString VARCHAR(MAX),
#FinalLength INT,
#PadChar CHAR(1))
RETURNS VARCHAR(MAX)
AS
BEGIN
RETURN
(SELECT Replicate(#PadChar, #FinalLength - Len(#SourceString)) + #SourceString)
END
I would do it like this.
SELECT RIGHT(REPLICATE('0', 5) + CAST(Field2 AS VARCHAR(5),5)
Not necessarily all that "Easier", or more efficient, but better to read. Could be optimized to remove the need for "RIGHT"
If you want to get a consistent number of total strings in the final result by adding different number of zeros, here is a little bit modification (for vsql)
SELECT
CONCAT(
REPEAT('0', 9-length(TO_CHAR(var1))),
CAST(var1 AS VARCHAR(9))
) as var1
You can replace 9 by any number for your need!
BRD