Identify Hidden Characters - sql

In my SQL tables I have text which has hidden characters which is only visible when I copy and paste it in notepad++.
How to find those rows which has hidden characters using SQL Server queries?
I have tried comparing the lengths using datalength and len
it did not work.
DATALENGTH(name) AS BinaryLength != LEN(name)
I want the row which has hidden characters.

On the assumption that this is being caused by control characters. Some of which are invisible. But also include tabs, newlines and spaces. An example to illustrate and how to get them to appear.
--DROP TABLE #SillyTemp
DECLARE #InvisibleChar1 NCHAR(1) = NCHAR(28), #InvisibleChar2 NCHAR(1) = NCHAR(30), #NonControlChar NCHAR(1) = NCHAR(33);
DECLARE #InputString NVARCHAR(500) = N'Some |' + #InvisibleChar1 +'| random string |' + #InvisibleChar2 + '|' + '; Thank god Finally a normal character |' + #NonControlChar + '|';
SELECT #InputString AS OhNoInvisibleCharacters
DECLARE #ControlCharRange NVARCHAR(50) = N'%[' + NCHAR(1) + '-' + NCHAR(31) + ']%';
CREATE TABLE #SillyTemp
(
input nvarchar(500)
)
INSERT INTO #SillyTemp(input)
VALUES (#InputString),(N'A normal string')
SELECT #ControlCharRange;
SELECT input FROM #SillyTemp AS #SI WHERE input LIKE #ControlCharRange;
This produces 3 results. A string with invisiblechars within them like such:
Some || random string ||; Thank god Finally a normal character |!|
Note, the are actually invisible inside SQL. But stackoverflow shows them as such. The output in SQL Server is simply.
Some || random string ||; Thank god Finally a normal character |!|
But these characters still have a corresponding (N)CHAR(X) value. (N)CHAR(0) is a NULL character and is highly unlikely to be in a string, in my setup to detect them it also provides some problems in building a range. (N)CHAR(32) is the ' ' space character.
The way the [X-Y] string operator works is also based on the (N)CHAR numbers. Therefore we can make a range of [NCHAR(1)-NCHAR(31)]
The last select goes through the temporary table, one which has invisible characters. Since we're looking for any NCHARS between 1 and 31, only those with invisible characters (and often invalid characters or tabs/newlines) satisfy the where condition. Thus only they get returned. In this case only the 'faulty' string gets returned in my select statement.

Related

In SQL Server, how can I search for column with 1 or 2 whitespace characters?

So I need to filter column which contains either one, two or three whitespace character.
CREATE TABLE a
(
[col] [char](3) NULL,
)
and some inserts like
INSERT INTO a VALUES (' ',' ', ' ')
How do I get only the row with one white space?
Simply writing
SELECT *
FROM a
WHERE column = ' '
returns all rows irrespective of one or more whitespace character.
Is there a way to escape the space? Or search for specific number of whitespaces in column? Regex?
Use like clause - eg where column like '%[ ]%'
the brackets are important, like clauses provide a very limited version of regex. If its not enough, you can add a regex function written in C# to the DB and use that to check each row, but it won't be indexed and thus will be very slow.
The other alternative, if you need speed, is to look into full text search indexes.
Here is one approach you can take:
DECLARE #data table ( txt varchar(50), val varchar(50) );
INSERT INTO #data VALUES ( 'One Space', ' ' ), ( 'Two Spaces', ' ' ), ( 'Three Spaces', ' ' );
;WITH cte AS (
SELECT
txt,
DATALENGTH ( val ) - ( DATALENGTH ( REPLACE ( val, ' ', '' ) ) ) AS CharCount
FROM #data
)
SELECT * FROM cte WHERE CharCount = 1;
RETURNS
+-----------+-----------+
| txt | CharCount |
+-----------+-----------+
| One Space | 1 |
+-----------+-----------+
You need to use DATALENGTH as LEN ignores trailing blank spaces, but this is a method I have used before.
NOTE:
This example assumes the use of a varchar column.
Trailing spaces are often ignored in string comparisons in SQL Server. They are treated as significant on the LHS of the LIKE though.
To search for values that are exactly one space you can use
select *
from a
where ' ' LIKE col AND col = ' '
/*The second predicate is required in case col contains % or _ and for index seek*/
Note with your example table all the values will be padded out to three characters with trailing spaces anyway though. You would need a variable length datatype (varchar/nvarchar) to avoid this.
The advantage this has over checking value + DATALENGTH is that it is agnostic to how many bytes per character the string is using (dependant on datatype and collation)
DB Fiddle
How to get only rows with one space?
SELECT *
FROM a
WHERE col LIKE SPACE(1) AND col NOT LIKE SPACE(2)
;
Though this will only work for variable length datatypes.
Thanks guys for answering.
So I converted the char(3) column to varchar(3).
This seemed to work for me. It seems sql server has ansi padding that puts three while space in char(3) column for any empty or single space input. So any search or len or replace will take the padded value.

Update or replace a special character text in my database field

I am facing to an issue that I cannot update or replace some characters in my database.
Here is how this text look like in my column when I retrieve it:
As you can see, there is an unknown characters between 'master' and 'degree' which I cannot even paste it here.
I also tried to update and replace it with below code (I cannot paste that two vertical lines here since they are not supported in any browser and I am not sure what they are, Please see the picture above to see what is in my SQL statement).
begin transaction
update gm_desc set projdesc=replace(projdesc,'%â s%','') where projdesc like '%âs%' and proposalno = '15-149-01'
You can see the real SQL Statement here:
I tried to update, or replace it but I cannot do it. The update statement successfully works but I still see that weird special charters. I would be appreciate to help me.
Here's a scalar-valued function which removes all non-alphanumeric characters (preserves spaces) from a string.
Hopefully it helps!
dbfiddle
create function dbo.get_alphanumeric_str
(
#string varchar(max)
)
returns varchar(max)
as
begin
declare #ret varchar(max);
with nums as (
select 1 as n
union all select n+1 from nums
where n < 256
)
select #ret = replace(stuff(
(
select '' + substring(#string, nums.n, 1)
from nums
where patindex('%[^0-9A-Za-z ]%', substring(#string, nums.n,1)) = 0
for xml path('')
), 1, 0, ''
), ' ', ' ')
option (MAXRECURSION 256)
return #ret;
end
Usage
select dbo.get_alphanumeric_str('Helloᶄ âWorld 1234⅊⅐')
Returns: Hello World 1234
How it works
The nums CTE is just to get a list of numbers (you can set the maximum of 256 to a higher value if your strings are longer; n.b. option (MAXRECURSION n) is for this CTE but has to be placed at the query)
The stuff essentially iterates through the string, using the list of numbers above and extracts a substring of length 1; each of these chars are checked if they match the [^0-9A-Za-z ] regex group (0-9 all digits, A-Za-z all letters both lower and upper case, and a single space character)
If they match, patindex() should return 0; i.e. index zero.
Use replace(string, ' ', ' ') for the space character as the xml path returns a special encoding, see this question.
Use a binary collation for accented characters; see this answer

SQL - How to remove a space character from a string

I have a table where the max length of a column (varchar) is 12, someone has loaded some value with a space, so rather than 'SPACE' it's 'SPACE '
I want to remove the space using a script, I was positive RTRIM or REPLACE(myValue, ' ', '') would work but LEN(myValue) shows there is still and extra character?
As mentioned by a couple folks, it may not be a space. Grab a copy of ngrams8k and you use it to identify the issue. For example, here we have the text, " SPACE" with a preceding space and trailing CHAR(160) (HTML BR tag). CHAR(160) looks like a space in SSMS but isn't "trimable". For example consider this query:
DECLARE #string VARCHAR(100) = ' SPACE'+CHAR(160);
SELECT '"'+#string+'"'
Using ngrams8k you could do this:
DECLARE #string VARCHAR(100) = ' SPACE'+CHAR(160);
SELECT
ng.position,
ng.token,
asciival = ASCII(ng.token)
FROM dbo.ngrams8k(#string,1) AS ng;
Returns:
position token asciival
---------- ------- -----------
1 32
2 S 83
3 P 80
4 A 65
5 C 67
6 E 69
7   160
As you can see, the first character (position 1) is CHAR(32), that's a space. The last character (postion 7) is not a space.
Knowing that CHAR(160) is the issue you could fix it like so:
SET #string = REPLACE(LTRIM(#string),CHAR(160),'')
If you are using SQL Server 2017+ you can also use TRIM which does a whole lot more than just LTRIM-and-RTRIM-ing. For example, this will remove
leading and trailing tabs, spaces, carriage returns, line returns and HTML BR tags.
SET #string = SELECT TRIM(CHAR(32)+CHAR(9)+CHAR(10)+CHAR(13)+CHAR(160) FROM #string)
Odds are it is some other non-printing character, carriage return is a big one when going from between *nix and other OS. One way to tell is to use the DUMP function. So you could start with a query like:
SELECT dump(column_name)
FROM your_table
WHERE column_name LIKE 'SPACE%'
That should help you find the offending character, however, that doesn't fix your problem. Instead, I would use something like REGEXP_REPLACE:
SELECT REGEXP_REPLACE(column_name, '[^A-z]')
FROM your_table
That should take care of any non-printing characters. You may need to play with the regular expression if you expect numbers or symbols in your string. You could switch to a character class like:
SELECT REGEXP_REPLACE(column_name, '[:cntrl:]')
FROM your_table

Find and Replace credit card numbers

We have a large database with a lot of data in it. I found out recently our sales and shipping department have been using a part of the application to store clients credit card numbers in the open. We've put a stop to it, but now there are thousands of rows with the numbers.
We're trying to figure out how to scan certain columns for 16 digits in a row (or dash separation) and replace them with X's.
It's not a simple UPDATE statement because the card numbers are stored among large amounts of text. So far I've been unable to figure out if SQL Server is capable of regex (it would seem not).
All else fails i will do this through PHP since that is what i'm best at... but it'll be painful.
Sounds like you need to use PATINDEX with a WHERE LIKE clause.
Something like this. Create a stored proc with something similar, then call it with a bunch of different parameters (make #pattern & #patternlength the params) that you have identified, until you've replaced all of the instances.
declare #pattern varchar(100), #patternlength int
set #pattern = '[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
set #patternlength = 19
update tableName
set fieldName =
LEFT(fieldName, patindex('%'+ #pattern + '%', fieldName)-1)
+ 'XXXX-XXXX-XXXX-XXXX'
+ SUBSTRING(fieldName, PATINDEX('%'+ #pattern + '%', fieldName)+#patternlength, LEN(fieldName))
from tableName
where fieldName like '%'+ #pattern + '%'
The trick is just finding the appropriate patterns, and setting the appropriate #patternlength value (not the length of #pattern as that won't work!)
I think you are better off doing this programatically, especially since you mentioned the data can be in a couple of different formats. Do keep in mind that not all credit card numbers are 16 digits long (Amex is 15, Visa is 13 or 16, etc).
The ability to check for various regexes and validate code will probably be best served at a cleanup job level, if possible.
Improvised Sean's answer.
The following will find all the occurrences of #maskPattern in #text and replace them with 'x'.
Example, If #maskPattern = XXXX-XXXX-XXXX-XXXX, it will find this pattern in #text and replace all occurrences with XXXX-XXXX-XXXX-XXXX. If it does not find any occurrence, it will leave the text as is.
This stored procedure can also be manipulated to only mask 3/4th of the beginning of the maskPattern. Cheers!
ALTER PROCEDURE [dbo].[SP_MaskCharacters] #text nvarchar(max),
#maskPattern nvarchar(500)
AS
BEGIN
DECLARE #numPattern nvarchar(max) = REPLACE(#maskPattern, 'x', '[0-9]')
DECLARE #patternLength int = LEN(#maskPattern)
WHILE (#text IS NOT NULL)
BEGIN
IF PATINDEX('%' + #numPattern + '%', #text) = 0 BREAK;
SET #text =
LEFT(#text, PATINDEX('%' + #numPattern + '%', #text)-1) --Get beginning chars of the input text until first occurance of pattern is found
+ #maskPattern --Append aasking pattern
+ SUBSTRING(#text, PATINDEX('%' + #numPattern + '%', #text) + #patternLength, LEN(#text)) -- Get & append rest of the text found after masking attern
END
SELECT #text
END
I faced this situation recently. Using Patindex and Stuff should help, but you would need to repeat for CC numbers with different number of digits separately.
-- For 16 digits CC numbers
UPDATE table
SET columnname = Stuff (columnname, Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname), 16, '################')
WHERE Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname) > 0
You can use patindex. It won't be pretty and there might be a more concise way to write it. But you can use sets ie [0-9]
patindex: http://msdn.microsoft.com/en-us/library/ms188395.aspx
similar question: SQL Server Regular expressions in T-SQL
For anyone finding this question who does want to use PHP, here's a function I use that takes a credit card number (all digits, with dashes, or with spaces) and replaces all but the first and last 4 digits with 'X'.
To accept credit card numbers with dashes as well, use this regex pattern instead:
$cc_regex_pattern = '/(\d{4})(-)?(\d{4})(-)?(\d{4})(-)?(\d{4})/'
and remove the preprocessing of the cc number that removes the dashes:
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
and so the replacement string becomes (because we've changed the index of patterns - note the $7):
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$7';
or if you want, simply replace the whole cc number, like in the original question:
$cc_regex_replacement = 'XXXX$2XXXX$4XXXX$6XXXX';
Here's the original function for credit card numbers with or without spaces or dashes, which obfuscates and removes any dashes:
/**
* #param integer|string $credit_card_number
* #return mixed
*/
static function obfuscate_credit_card($credit_card_number)
{
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
$cc_length = strlen($compressed_cc_number);
$cc_middle_length = $cc_length >= 9 ? $cc_length - 8 : 0;
//create middle pattern
$cc_middle_pattern = '';
for ($i = 0; $i < $cc_middle_length; $i++) {
$cc_middle_pattern .= 'X';
}
//replace cc middle digits with middle pattern
$cc_regex_pattern = '/(\d{4})(\d+)(\d{4})/';
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$3';
$obfuscated_cc = preg_replace($cc_regex_pattern, $cc_regex_replacement, $compressed_cc_number);
return $obfuscated_cc;
}

SQL Command to replace embedded spaces with another character

I have a relational database with several fixed length character fields. I need to permanently replace all the embedded spaces with another character like - so JOHN DOE would become JOHN-DOE and ITSY BISTSY SPIDER would become ITSY-BISTSY-SPIDER. I can search before hand to make sure there are no strings that would conflict. I just need to be able to print the requested files with no embedded spaces. I would do the replacement in the C code but I want to make sure that there is never a future case where there is a JANE DOE and JANE-DOE in the DB.
By the way I have already made sure that there are no strings with more than one consecutive embedded space or leading spaces only trailing spaces to fill the fixed length fields.
Edit: thanks for all the help!
It looks like when I cut & pasted my question from Word to StackOverflow the trailing spaces got lost so the meaning my question was lost a bit.
I need to replace only the embedded spaces not the trailing spaces!
Note: I am using middle dot to stand in for spaces that don't show well.
Using:
SELECT REPLACE(operator_name, ' ', '-') FROM operator_info ;
the string JOHN·DOE············ became JOHN-DOE------------.
I need JOHN-DOE············.
I am thinking I need to use aliasing and the TRIM command but not sure how.
With whatever REPLACE function is built into your particular database.
MySQL:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_replace
Oracle:
http://psoug.org/reference/translate_replace.html
SQLServer:
http://msdn.microsoft.com/en-us/library/ms186862.aspx
Edits below based on your comment.
I've done this in SQLServer syntax so please modify the example as needed. The first example really breaks down what's going on and the second one bunches it all into a single ugly query :D
#output in this case contains your final value.
DECLARE #input VARCHAR (100) = ' some test ';
DECLARE #trimmed VARCHAR (100);
DECLARE #replaced VARCHAR (100);
DECLARE #output VARCHAR (100);
-- Get just the inner text without the preceding / trailing spaces.
SET #trimmed = LTRIM (RTRIM (#input));
-- Replace the spaces *inside* the trimmed text with a dash.
SET #replaced = REPLACE (#trimmed, ' ', '-');
-- Take the original text and replace the trimmed version (with the inner spaces) with the dash version.
SET #output = REPLACE (#input, #trimmed, #replaced);
-- Show each step of the process!
SELECT #input AS INPUT,
#trimmed AS TRIMMED,
#replaced AS REPLACED,
#output AS OUTPUT;
And as a SELECT statement.
DECLARE #inputTable TABLE (Value VARCHAR (100) NOT NULL);
INSERT INTO #inputTable (Value)
VALUES (' some test '),
(' another test ');
SELECT REPLACE (Value,
LTRIM (RTRIM (Value)),
REPLACE (LTRIM (RTRIM (Value)), ' ', '-'))
FROM #inputTable;
If you are using MSSQL:
SELECT REPLACE(field_name,' ','-');
Edit: After the requirement about skipping the trailing spaces.
You can try this one-liner:
SELECT REPLACE(RTRIM(#name), ' ', '-') + SUBSTRING(#name, LEN(RTRIM(#name)) + 1, LEN(#NAME))
However I would recommend that you put it into a user defined function instead.
assuming SQL Server:
update TABLE set column = replace (column, ' ','-')
SELECT REPLACE(field_name,' ','-');
Edit: After the requirement about skipping the trailing spaces. You can try this one-liner:
SELECT REPLACE(RTRIM(#name), ' ', '-') + SUBSTRING(#name, LEN(RTRIM(#name)) + 1;