mssql select all nvarchar with wrong encoding - sql

I am working with a old database where someone didn't encode the data the right way before inserting it into the database. which result in text like
"Wrong t�xt" (in my case the '�' is a ø).
I am looking for a way to find all rows where the column contains data like this, so i can correct it.
So far i tried using regex like
SELECT * FROM table WHERE ([colm] not like '[a-zA-Z\s]%')
but no matter what i do, i can't find a way to select only the ones containing the '�'
a search like
SELECT * FROM table WHERE ([colm] like '%�%')
won't return anything either. (tried it, just in cases).
I been search for this on Google and here on Stackoverflow, but either there is no one having this problem, or I am searching for the wrong thing.
So if someone would be so kind to help me with this, I would be really happy.
Thanks for your time.

Assuming the character in the string really is U+FFFD REPLACEMENT CHARACTER (�), and it's not displayed as a replacement character because there are actually other bytes in there that can't be decoded properly, you can find it with
SELECT * FROM table WHERE [colm] LIKE N'%�%' COLLATE Latin1_General_BIN2
Or (to avoid any further issues with encoding mangling characters)
SELECT * FROM table WHERE [colm] LIKE N'%' + NCHAR(0xfffd) + N'%' COLLATE Latin1_General_BIN2
Unicode is required because � does not exist in any single-byte collation, and a binary collation is required because the regular collations treat � as if it did not occur in strings at all.

Try this:
WHERE [colm] not like N'%[a-zA-Z]%'
Of course, this should return values with numbers, spaces, and punctuation.

As Jeroen mentioned, using a binary seems to be the way to go. Personally I would suggest using NGrams4k here, but I built a quick tally table instead that does the job:
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) N(N)),
Tally AS(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2, N N3, N N4)
SELECT V.Colm
FROM (VALUES(N'Wrong t�xt" (in my case the ''�'' is a ø)'),
(N'This string is ok'))V(colm)
JOIN Tally T ON LEN(V.Colm) >= T.I
CROSS APPLY (VALUES(SUBSTRING(V.Colm,T.I,1))) SS(C)
GROUP BY V.colm
HAVING COUNT(CASE CONVERT(binary(2),SS.C) WHEN 0xFDFF THEN 1 END) > 0;

You could replace occurences of the U+FFFD REPLACEMENT CHARACTER (�) and compare it with the original value:
SELECT *
, CASE WHEN CONVERT(VARBINARY(MAX), t.colm) = CAST(REPLACE(CONVERT(VARBINARY(MAX), t.colm), 0xFDFF, 0x) AS VARBINARY(MAX)) THEN 1 ELSE 0 END AS EncodingCorrect
FROM (
SELECT N'Wrong t�xt" (in my case the ''�'' is a ø)' AS colm
UNION ALL
SELECT 'Correct text'
UNION ALL
SELECT 'Wrong t?xt" (in my case the ''?'' is a ø)'
) t
#Jeroen Mostert's suggestion WHERE colm LIKE N'%�%' COLLATE Latin1_General_BIN2 seems like the better and more readable solution.

Related

Replace a recurring word and the character before it

I am using SQL Server trying to replace each recurring "[BACKSPACE]" in a string and the character that came before the word [BACKSPACE] to mimic what a backspace would do.
Here is my current string:
"This is a string that I would like to d[BACKSPACE]correct and see if I could make it %[BACKSPACE] cleaner by removing the word and $[BACKSPACE] character before the backspace."
Here is what I want it to say:
"This is a string that I would like to correct and see if I could make it cleaner by removing the word and character before the backspace."
Let me make this clearer. In the above example string, the $ and % signs were just used as examples of characters that would need to be removed since they are before the [BACKSPACE] word that I want to replace.
Here is another before example:
The dog likq[BACKSPACE]es it's owner
I want to edit it to read:
The dog likes it's owner
One last before example is:
I am frequesn[BACKSPACE][BACKSPACE]nlt[BACKSPACE][BACKSPACE]tly surprised
I want to edit it to read:
I am frequently surprised
Without a CLR function that provides Regex replacement the only way you'll be able to do this is with iteration in T-SQL. Note, however, that the below solution does not give you the results you ask for, but does the logic you ask. You state that you want to remove the string and the character before, but in 2 of your scenarios that isn't true. For the last 2 strings you remove ' %[BACKSPACE]' and ' $[BACKSPACE]' respectively (notice the leading whitespace).
This leading whitespace is left in this solution. I am not entertaining fixing that, as the real solution is don't use T-SQL for this, use something that supports Regex.
I also assume this string is coming from a column in a table, and said table has multiple rows (with a distinct value for the string on each).
Anyway, the solution:
WITH rCTE AS(
SELECT V.YourColumn,
STUFF(V.YourColumn,CHARINDEX('[BACKSPACE]',V.YourColumn)-1,LEN('[BACKSPACE]')+1,'') AS ReplacedColumn,
1 AS Iteration
FROM (VALUES('"This is a string that I would like to d[BACKSPACE]correct and see if I could make it %[BACKSPACE] cleaner by removing the word and $[BACKSPACE] character before the backspace."'))V(YourColumn)
UNION ALL
SELECT r.YourColumn,
STUFF(r.ReplacedColumn,CHARINDEX('[BACKSPACE]',r.ReplacedColumn)-1,LEN('[BACKSPACE]')+1,''),
r.Iteration + 1
FROM rCTE r
WHERE CHARINDEX('[BACKSPACE]',r.ReplacedColumn) > 0)
SELECT TOP (1) WITH TIES
r.YourColumn,
r.ReplacedColumn
FROM rCTE r
ORDER BY ROW_NUMBER() OVER (PARTITION BY r.YourColumn ORDER BY r.Iteration DESC);
dB<>fiddle
I've had a crack to see if I can get this to work using the traditional tally-table method without any recursion.
I think I have something that works - however the recursive cte version is definitely a cleaner solution and probably better performing, however throwing this in as just an alternative non-recursive way.
/* tally table for use below */
select top 1000 N=Identity(int, 1, 1)
into dbo.Digits
from master.dbo.syscolumns a cross join master.dbo.syscolumns
with w as (
select seq = Row_Number() over (order by t.N),
part = Replace(Substring(#string, t.N, CharIndex(Left(#delimiter,1), #string + #delimiter, t.N) - t.N),Stuff(#delimiter,1,1,''),'')
from Digits t
where t.N <= DataLength(#string)+1 and Substring(Left(#delimiter,1) + #string, t.N, 1) = Left(#delimiter,1)
),
p as (
select seq,Iif(Iif(Lead(part) over(order by seq)='' and lag(part) over(order by seq)='',1,0 )=1 ,'', Iif( seq<Max(seq) over() and part !='',Left(part,Len(part)-1),part)) part
from w
)
select result=(
select ''+ part
from p
where part!=''
order by seq
for xml path('')
)
Here's a simple RegEx pattern that should work:
/.\[BACKSPACE\]/g
EDIT
I have no way to test this right now on my chromebook, but this seems like it should work for T-SQL in the LIKE clause
LIKE '_\[BACKSPACE]' ESCAPE '\'

Remove ASCII Extended Characters 128 onwards (SQL)

Is there a simple way to remove extended ASCII characters in a varchar(max). I want to remove all ASCII characters from 128 onwards. eg - ù,ç,Ä
I have tried this solution and its not working, I think its because they are still valid ASCII characters?
How do I remove extended ASCII characters from a string in T-SQL?
Thanks
The linked solution is using a loop which is - if possible - something you should avoid.
My solution is completely inlineable, it's easy to create an UDF (or maybe even better: an inline TVF) from this.
The idea: Create a set of running numbers (here it's limited with the count of objects in sys.objects, but there are tons of example how to create a numbers tally on the fly). In the second CTE the strings are splitted to single characters. The final select comes back with the cleaned string.
DECLARE #tbl TABLE(ID INT IDENTITY, EvilString NVARCHAR(100));
INSERT INTO #tbl(EvilString) VALUES('ËËËËeeeeËËËË'),('ËaËËbËeeeeËËËcË');
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
,SingleChars AS
(
SELECT tbl.ID,rn.Nmbr,SUBSTRING(tbl.EvilString,rn.Nmbr,1) AS Chr
FROM #tbl AS tbl
CROSS APPLY (SELECT TOP(LEN(tbl.EvilString)) Nmbr FROM RunningNumbers) AS rn
)
SELECT ID,EvilString
,(
SELECT '' + Chr
FROM SingleChars AS sc
WHERE sc.ID=tbl.ID AND ASCII(Chr)<128
ORDER BY sc.Nmbr
FOR XML PATH('')
) AS GoodString
FROM #tbl As tbl
The result
1 ËËËËeeeeËËËË eeee
2 ËaËËbËeeeeËËËcË abeeeec
Here is another answer from me where this approach is used to replace all special characters with secure characters to get plain latin

Sort varchar datatype with numeric characters

SQL SERVER 2005
SQL Sorting :
Datatype varchar
Should sort by
1.aaaa
5.xx
11.bbbbbb
12
15.
how can i get this sorting order
Wrong
1.aaaa
11.bbbbbb
12
15.
5.xx
On Oracle, this would work.
SELECT
*
FROM
table
ORDER BY
to_number(regexp_substr(COLUMN,'^[0-9]+')),
regexp_substr(column,'\..*');
You could do this by calculating a column based on what's on the left hand side of the period('.').
However this method will be very difficult to make robust enough to use in a production system, unless you can make a lot of assertions about the content of the strings.
Also handling strings without periods could cause some grief
with r as (
select '1.aaaa' as string
union select '5.xx'
union select '11.bbbbbb'
union select '12'
union select '15.' )
select *
from r
order by
CONVERT(int, left(r.string, case when ( CHARINDEX('.', r.string)-1 < 1)
then LEN(r.string)
else CHARINDEX('.', r.string)-1 end )),
r.string
If all the entries have this form, you could split them into two parts and sort be these, for example like this:
ORDER BY
CONVERT(INT, SUBSTRING(fieldname, 1, CHARINDEX('.', fieldname))),
SUBSTRING(fieldname, CHARINDEX('.', fieldname) + 1, LEN(fieldname))
This should do a numeric sort on the part before the . and an alphanumeric sort for the part after the ., but may need some tuning, as I haven't actually tried it.
Another way (and faster) might be to create computed columns that contain the part before the . and after the . and sort by them.
A third way (if you can't create computed columns) could be to create a view over the table that has two additional columns with the respective parts of the field and then do the select on that view.

T-SQL Substring - Last 3 Characters

Using T-SQL, how would I go about getting the last 3 characters of a varchar column?
So the column text is IDS_ENUM_Change_262147_190 and I need 190
SELECT RIGHT(column, 3)
That's all you need.
You can also do LEFT() in the same way.
Bear in mind if you are using this in a WHERE clause that the RIGHT() can't use any indexes.
You can use either way:
SELECT RIGHT(RTRIM(columnName), 3)
OR
SELECT SUBSTRING(columnName, LEN(columnName)-2, 3)
Because more ways to think about it are always good:
select reverse(substring(reverse(columnName), 1, 3))
declare #newdata varchar(30)
set #newdata='IDS_ENUM_Change_262147_190'
select REVERSE(substring(reverse(#newdata),0,charindex('_',reverse(#newdata))))
=== Explanation ===
I found it easier to read written like this:
SELECT
REVERSE( --4.
SUBSTRING( -- 3.
REVERSE(<field_name>),
0,
CHARINDEX( -- 2.
'<your char of choice>',
REVERSE(<field_name>) -- 1.
)
)
)
FROM
<table_name>
Reverse the text
Look for the first occurrence of a specif char (i.e. first occurrence FROM END of text). Gets the index of this char
Looks at the reversed text again. searches from index 0 to index of your char. This gives the string you are looking for, but in reverse
Reversed the reversed string to give you your desired substring
if you want to specifically find strings which ends with desired characters then this would help you...
select * from tablename where col_name like '%190'

How to check if a string is a uniqueidentifier?

Is there an equivalent to IsDate or IsNumeric for uniqueidentifier (SQL Server)?
Or is there anything equivalent to (C#) TryParse?
Otherwise I'll have to write my own function, but I want to make sure I'm not reinventing the wheel.
The scenario I'm trying to cover is the following:
SELECT something FROM table WHERE IsUniqueidentifier(column) = 1
SQL Server 2012 makes this all much easier with TRY_CONVERT(UNIQUEIDENTIFIER, expression)
SELECT something
FROM your_table
WHERE TRY_CONVERT(UNIQUEIDENTIFIER, your_column) IS NOT NULL;
For prior versions of SQL Server, the existing answers miss a few points that mean they may either not match strings that SQL Server will in fact cast to UNIQUEIDENTIFIER without complaint or may still end up causing invalid cast errors.
SQL Server accepts GUIDs either wrapped in {} or without this.
Additionally it ignores extraneous characters at the end of the string. Both SELECT CAST('{5D944516-98E6-44C5-849F-9C277833C01B}ssssssssss' as uniqueidentifier) and SELECT CAST('5D944516-98E6-44C5-849F-9C277833C01BXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' as uniqueidentifier) succeed for instance.
Under most default collations the LIKE '[a-zA-Z0-9]' will end up matching characters such as À or Ë
Finally if casting rows in a result to uniqueidentifier it is important to put the cast attempt in a case expression as the cast may occur before the rows are filtered by the WHERE.
So (borrowing #r0d30b0y's idea) a slightly more robust version might be
;WITH T(C)
AS (SELECT '5D944516-98E6-44C5-849F-9C277833C01B'
UNION ALL
SELECT '{5D944516-98E6-44C5-849F-9C277833C01B}'
UNION ALL
SELECT '5D944516-98E6-44C5-849F-9C277833C01BXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
UNION ALL
SELECT '{5D944516-98E6-44C5-849F-9C277833C01B}ssssssssss'
UNION ALL
SELECT 'ÀD944516-98E6-44C5-849F-9C277833C01B'
UNION ALL
SELECT 'fish')
SELECT CASE
WHEN C LIKE expression + '%'
OR C LIKE '{' + expression + '}%' THEN CAST(C AS UNIQUEIDENTIFIER)
END
FROM T
CROSS APPLY (SELECT REPLACE('00000000-0000-0000-0000-000000000000', '0', '[0-9a-fA-F]') COLLATE Latin1_General_BIN) C2(expression)
WHERE C LIKE expression + '%'
OR C LIKE '{' + expression + '}%'
Not mine, found this online... thought i'd share.
SELECT 1 WHERE #StringToCompare LIKE
REPLACE('00000000-0000-0000-0000-000000000000', '0', '[0-9a-fA-F]');
SELECT something
FROM table1
WHERE column1 LIKE '[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]-[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]-[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]-[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]-[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]';
UPDATE:
...but I much prefer the approach in the answer by #r0d30b0y:
SELECT something
FROM table1
WHERE column1 LIKE REPLACE('00000000-0000-0000-0000-000000000000', '0', '[0-9a-fA-F]');
I am not aware of anything that you could use "out of the box" - you'll have to write this on your own, I'm afraid.
If you can: try to write this inside a C# library and deploy it into SQL Server as a SQL-CLR assembly - then you could use things like Guid.TryParse() which is certainly much easier to use than anything in T-SQL....
A variant of r0d30b0y answer is to use PATINDEX to find within a string...
PATINDEX('%'+REPLACE('00000000-0000-0000-0000-000000000000', '0', '[0-9a-fA-F]')+'%',#StringToCompare) > 0
Had to use to find Guids within a URL string..
HTH
Dave
Like to keep it simple. A GUID has four - in it even, if is just a string
WHERE column like '%-%-%-%-%'
Though an older post, just a thought for a quick test ...
SELECT [A].[INPUT],
CAST([A].[INPUT] AS [UNIQUEIDENTIFIER])
FROM (
SELECT '5D944516-98E6-44C5-849F-9C277833C01B' Collate Latin1_General_100_BIN AS [INPUT]
UNION ALL
SELECT '{5D944516-98E6-44C5-849F-9C277833C01B}'
UNION ALL
SELECT '5D944516-98E6-44C5-849F-9C277833C01BXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
UNION ALL
SELECT '{5D944516-98E6-44C5-849F-9C277833C01B}ssssssssss'
UNION ALL
SELECT 'ÀD944516-98E6-44C5-849F-9C277833C01B'
UNION ALL
SELECT 'fish'
) [A]
WHERE PATINDEX('[^0-9A-F-{}]%', [A].[INPUT]) = 0
This is a function based on the concept of some earlier comments. This function is very fast.
CREATE FUNCTION [dbo].[IsGuid] (#input varchar(50))
RETURNS bit AS
BEGIN
RETURN
case when #input like '[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]-[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]-[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]-[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]-[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]'
then 1 else 0 end
END
GO
/*
Usage:
select [dbo].[IsGuid]('123') -- Returns 0
select [dbo].[IsGuid]('ebd8aebd-7ea3-439d-a7bc-e009dee0eae0') -- Returns 1
select * from SomeTable where dbo.IsGuid(TableField) = 0 -- Returns table with all non convertable items!
*/
DECLARE #guid_string nvarchar(256) = 'ACE79678-61D1-46E6-93EC-893AD559CC78'
SELECT
CASE WHEN #guid_string LIKE '________-____-____-____-____________'
THEN CONVERT(uniqueidentifier, #guid_string)
ELSE NULL
END
You can write your own UDF. This is a simple approximation to avoid the use of a SQL-CLR assembly.
CREATE FUNCTION dbo.isuniqueidentifier (#ui varchar(50))
RETURNS bit AS
BEGIN
RETURN case when
substring(#ui,9,1)='-' and
substring(#ui,14,1)='-' and
substring(#ui,19,1)='-' and
substring(#ui,24,1)='-' and
len(#ui) = 36 then 1 else 0 end
END
GO
You can then improve it to check if it´s just about HEX values.
I use :
ISNULL(convert(nvarchar(50), userID), 'NULL') = 'NULL'
I had some Test users that were generated with AutoFixture, which uses GUIDs by default for generated fields. My FirstName fields for the users that I need to delete are GUIDs or uniqueidentifiers. That's how I ended up here.
I was able to cobble together some of your answers into this.
SELECT UserId FROM [Membership].[UserInfo] Where TRY_CONVERT(uniqueidentifier, FirstName) is not null
Use RLIKE for MYSQL
SELECT 1 WHERE #StringToCompare
RLIKE REPLACE('00000000-0000-0000-0000-000000000000', '0', '[0-9a-fA-F]');
In a simplest scenario. When you sure that given string can`t contain 4 '-' signs.
SELECT * FROM City WHERE Name LIKE('%-%-%-%-%')
In BigQuery you can use
SELECT *
FROM table
WHERE
REGEXP_CONTAINS(uuid, REPLACE('^00000000-0000-0000-0000-000000000000$', '0', '[0-9a-fA-F]'))