Obfuscate / Mask / Scramble personal information - sql

I'm looking for a homegrown way to scramble production data for use in development and test. I've built a couple of scripts that make random social security numbers, shift birth dates, scramble emails, etc. But I've come up against a wall trying to scramble customer names. I want to keep real names so we can still use or searches so random letter generation is out. What I have tried so far is building a temp table of all last names in the table then updating the customer table with a random selection from the temp table. Like this:
DECLARE #Names TABLE (Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names SELECT LastName FROM Customer ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
This worked well in test, but completely bogs down dealing with larger amounts of data (>20 minutes for 40K rows)
All of that to ask, how would you scramble customer names while keeping real names and the weight of the production data?
UPDATE: Never fails, you try to put all the information in the post, and you forget something important. This data will also be used in our sales & demo environments which are publicly available. Some of the answers are what I am attempting to do, to 'switch' the names, but my question is literally, how to code in T-SQL?

I use generatedata. It is an open source php script which can generate all sorts of dummy data.

A very simple solution would be to ROT13 the text.
A better question may be why you feel the need to scramble the data? If you have an encryption key, you could also consider running the text through DES or AES or similar. Thos would have potential performance issues, however.

When doing something like that I usually write a small program that first loads a lot of names and surnames in two arrays, and then just updates the database using random name/surname from arrays. It works really fast even for very big datasets (200.000+ records)

I use a method that changes characters in the name to other characters that are in the same "range" of usage frequency in English names. Apparently, the distribution of characters in names is different than it is for normal conversational English. For example, "x" and "z" occur 0.245% of the time, so they get swapped. The the other extreme, "w" is used 5.5% of the time, "s" 6.86% and "t", 15.978%. I change "s" to "w", "t" to "s" and "w" to "t".
I keep the vowels "aeio" in a separate group so that a vowel is only replaced by another vowel. Similarly, "q", "u" and "y" are not replaced at all. My grouping and decisions are totally subjective.
I ended up with 7 different "groups" of 2-5 characters , based mostly on frequency. characters within each group are swapped with other chars in that same group.
The net result is names that kinda look like the might be names, but from "not around here".
Original name Morphed name
Loren Nimag
Juanita Kuogewso
Tennyson Saggywig
David Mijsm
Julie Kunewa
Here's the SQL I use, which includes a "TitleCase" function. There are 2 different versions of the "morphed" name based on different frequencies of letters I found on the web.
-- from https://stackoverflow.com/a/28712621
-- Convert and return param as Title Case
CREATE FUNCTION [dbo].[fnConvert_TitleCase] (#InputString VARCHAR(4000) )
RETURNS VARCHAR(4000)AS
BEGIN
DECLARE #Index INT
DECLARE #Char CHAR(1)
DECLARE #OutputString VARCHAR(255)
SET #OutputString = LOWER(#InputString)
SET #Index = 2
SET #OutputString = STUFF(#OutputString, 1, 1,UPPER(SUBSTRING(#InputString,1,1)))
WHILE #Index <= LEN(#InputString)
BEGIN
SET #Char = SUBSTRING(#InputString, #Index, 1)
IF #Char IN (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&','''','(','{','[','#')
IF #Index + 1 <= LEN(#InputString)
BEGIN
IF #Char != '''' OR UPPER(SUBSTRING(#InputString, #Index + 1, 1)) != 'S'
SET #OutputString = STUFF(#OutputString, #Index + 1, 1,UPPER(SUBSTRING(#InputString, #Index + 1, 1)))
END
SET #Index = #Index + 1
END
RETURN ISNULL(#OutputString,'')
END
Go
-- 00.045 x 0.045%
-- 00.045 z 0.045%
--
-- Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z')
--
-- 00.456 k 0.456%
-- 00.511 j 0.511%
-- 00.824 v 0.824%
-- kjv
-- Replace(Replace(Replace(Replace(TS_NAME,'k','#'),'j','k'),'v','j'),'#','v')
--
-- 01.642 g 1.642%
-- 02.284 n 2.284%
-- 02.415 l 2.415%
-- gnl
-- Replace(Replace(Replace(Replace(TS_NAME,'g','#'),'n','g'),'l','n'),'#','l')
--
-- 02.826 r 2.826%
-- 03.174 d 3.174%
-- 03.826 m 3.826%
-- rdm
-- Replace(Replace(Replace(Replace(TS_NAME,'r','#'),'d','r'),'m','d'),'#','m')
--
-- 04.027 f 4.027%
-- 04.200 h 4.200%
-- 04.319 p 4.319%
-- 04.434 b 4.434%
-- 05.238 c 5.238%
-- fhpbc
-- Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c')
--
-- 05.497 w 5.497%
-- 06.686 s 6.686%
-- 15.978 t 15.978%
-- wst
-- Replace(Replace(Replace(Replace(TS_NAME,'w','#'),'s','w'),'t','s'),'#','t')
--
--
-- 02.799 e 2.799%
-- 07.294 i 7.294%
-- 07.631 o 7.631%
-- 11.682 a 11.682%
-- eioa
-- Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')
--
-- -- dont replace
-- 00.222 q 0.222%
-- 00.763 y 0.763%
-- 01.183 u 1.183%
-- Obfuscate a name
Select
ts_id,
Cast(ts_name as varchar(42)) as [Original Name]
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z'),'k','#'),'j','k'),'v','j'),'#','v'),'g','#'),'n','g'),'l','n'),'#','l'),'r','#'),'d','r'),'m','d'),'#','m'),'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c'),'w','#'),'s','w'),'t','s'),'#','t'),'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')) as VarChar(42)) As [morphed name] ,
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','t'),'~','e'),'t','~'),'a','o'),'~','a'),'o','~'),'i','n'),'~','i'),'n','~'),'s','h'),'~','s'),'h','r'),'r','~'),'d','l'),'~','d'),'l','~'),'m','w'),'~','m'),'w','f'),'f','~'),'g','y'),'~','g'),'y','p'),'p','~'),'b','v'),'~','b'),'v','k'),'k','~'),'x','~'),'j','x'),'~','j')) as VarChar(42)) As [morphed name2]
From
ts_users
;

Why not just use some sort of Random Name Generator?

Use a temporary table instead and the query is very fast. I just ran on 60K rows in 4 seconds. I'll be using this one going forward.
DECLARE TABLE #Names
(Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names
SELECT LastName
FROM Customer
ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID]
SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
DROP TABLE #Names

The following approach worked for us, lets say we have 2 tables Customers and Products:
CREATE FUNCTION [dbo].[GenerateDummyValues]
(
#dataType varchar(100),
#currentValue varchar(4000)=NULL
)
RETURNS varchar(4000)
AS
BEGIN
IF #dataType = 'int'
BEGIN
Return '0'
END
ELSE IF #dataType = 'varchar' OR #dataType = 'nvarchar' OR #dataType = 'char' OR #dataType = 'nchar'
BEGIN
Return 'AAAA'
END
ELSE IF #dataType = 'datetime'
BEGIN
Return Convert(varchar(2000),GetDate())
END
-- you can add more checks, add complicated logic etc
Return 'XXX'
END
The above function will help in generating different data based on the data type coming in.
Now, for each column of each table which does not have word "id" in it, use following query to generate further queries to manipulate the data:
select 'select ''update '' + TABLE_NAME + '' set '' + COLUMN_NAME + '' = '' + '''''''' + dbo.GenerateDummyValues( Data_type,'''') + '''''' where id = '' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, ' + table_name + ' where RIGHT(LOWER(COLUMN_NAME),2) <> ''id'' and TABLE_NAME = '''+ table_name + '''' + ';' from INFORMATION_SCHEMA.TABLES;
When you execute above query it will generate update queries for each table and for each column of that table, for example:
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Customers where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Customers';
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Products where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Products';
Now, when you execute above queries you will get final update queries, that will update the data of your tables.
You can execute this on any SQL server database, no matter how many tables do you have, it will generate queries for you that can be further executed.
Hope this helps.

Another site to generate shaped fake data sets, with an option for T-SQL output:
https://mockaroo.com/

Here's a way using ROT47 which is reversible, and another which is random. You can add a PK to either to link back to the "un scrambled" versions
declare #table table (ID int, PLAIN_TEXT nvarchar(4000))
insert into #table
values
(1,N'Some Dudes name'),
(2,N'Another Person Name'),
(3,N'Yet Another Name')
--split your string into a column, and compute the decimal value (N)
if object_id('tempdb..#staging') is not null drop table #staging
select
substring(a.b, v.number+1, 1) as Val
,ascii(substring(a.b, v.number+1, 1)) as N
--,dense_rank() over (order by b) as RN
,a.ID
into #staging
from (select PLAIN_TEXT b, ID FROM #table) a
inner join
master..spt_values v on v.number < len(a.b)
where v.type = 'P'
--select * from #staging
--create a fast tally table of numbers to be used to build the ROT-47 table.
;WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--Here we put it all together with stuff and FOR XML
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,case
when 47 + N <= 126 then char(47 + N)
when 47 + N > 126 then char(N-47)
end as ENCRYPTED_TEXT
from cteTally
where N between 33 and 126) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
--or if you want really random
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,char((select ROUND(((122 - N -1) * RAND() + N), 0))) as ENCRYPTED_TEXT
from cteTally
where (N between 65 and 122) and N not in (91,92,93,94,95,96)) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t

Encountered the same problem myself and figured out an alternative solution that may work for others.
The idea is to use MD5 on the name and then take the last 3 hex digits of it to map into a table of names. You can do this separately for first name and last name.
3 hex digits represent decimals from 0 to 4095, so we need a list of 4096 first names and 4096 last names.
So conv(substr(md5(first_name), 3),16,10) (in MySQL syntax) would be an index from 0 to 4095 that could be joined with a table that holds 4096 first names. The same concept could be applied to last names.
Using MD5 (as opposed to a random number) guarantees a name in the original data will always be mapped to the same name in the test data.
You can get a list of names here:
https://gist.github.com/elifiner/cc90fdd387449158829515782936a9a4

I am working on this at my company right now -- and it turns out to be a very tricky thing. You want to have names that are realistic, but must not reveal any real personal info.
My approach has been to first create a randomized "mapping" of last names to other last names, then use that mapping to change all last names. This is good if you have duplicate name records. Suppose you have 2 "John Smith" records that both represent the same real person. If you changed one record to "John Adams" and the other to "John Best", then your one "person" now has 2 different names! With a mapping, all occurrences of "Smith" get changed to "Jones", and so duplicates ( or even family members ) still end up with the same last name, keeping the data more "realistic".
I will also have to scramble the addresses, phone numbers, bank account numbers, etc...and I am not sure how I will approach those. Keeping the data "realistic" while scrambling is certainly a deep topic. This must have been done many times by many companies -- who has done this before? What did you learn?

Frankly, I'm not sure why this is needed. Your dev/test environments should be private, behind your firewall, and not accessible from the web.
Your developers should be trusted, and you have legal recourse against them if they fail to live up to your trust.
I think the real question should be "Should I scramble the data?", and the answer is (in my mind) 'no'.
If you're sending it offsite for some reason, or you have to have your environments web-accessible, or if you're paranoid, I would implement a random switch. Rather than build a temp table, run switches between each location and a random row in the table, swapping one piece of data at a time.
The end result will be a table with all the same data, but with it randomly reorganized. It should also be faster than your temp table, I believe.
It should be simple enough to implement the Fisher-Yates Shuffle in SQL...or at least in a console app that reads the db and writes to the target.
Edit (2): Off-the cuff answer in T-SQL:
declare #name varchar(50)
set #name = (SELECT lastName from person where personID = (random id number)
Update person
set lastname = #name
WHERE personID = (person id of current row)
Wrap this in a loop, and follow the guidelines of Fisher-Yates for modifying the random value constraints, and you'll be set.

Related

How to count words in specific column against matching words in another table

I want to be able to:
extract specific words from column1 in Table1 - but only the words that are matched from Table2 from a column called word,
perform a(n individual) count of the number of words that have been found, and
put this information into a permanent table with a format, that looks like:
Final
Word | Count
--------+------
Test | 7
Blue | 5
Have | 2
Currently I have tried this:
INSERT INTO final (word, count)
SELECT
extext
, SUM(dbo.WordRepeatedNumTimes(extext, 'test')) AS Count
FROM [dbo].[TestSite_Info], [dbo].[table_words]
WHERE [dbo].[TestSite_Info].ExText = [dbo].[table_words].Words
GROUP BY ExText;
The function dbo.WordRepeatedNumTimes is:
ALTER function [dbo].[WordRepeatedNumTimes]
(#SourceString varchar(8000),#TargetWord varchar(8000))
RETURNS int
AS
BEGIN
DECLARE #NumTimesRepeated int
,#CurrentStringPosition int
,#LengthOfString int
,#PatternStartsAtPosition int
,#LengthOfTargetWord int
,#NewSourceString varchar(8000)
SET #LengthOfTargetWord = len(#TargetWord)
SET #LengthOfString = len(#SourceString)
SET #NumTimesRepeated = 0
SET #CurrentStringPosition = 0
SET #PatternStartsAtPosition = 0
SET #NewSourceString = #SourceString
WHILE len(#NewSourceString) >= #LengthOfTargetWord
BEGIN
SET #PatternStartsAtPosition = CHARINDEX (#TargetWord,#NewSourceString)
IF #PatternStartsAtPosition <> 0
BEGIN
SET #NumTimesRepeated = #NumTimesRepeated + 1
SET #CurrentStringPosition = #CurrentStringPosition + #PatternStartsAtPosition +
#LengthOfTargetWord
SET #NewSourceString = substring(#NewSourceString, #PatternStartsAtPosition +
#LengthOfTargetWord, #LengthOfString)
END
ELSE
BEGIN
SET #NewSourceString = ''
END
END
RETURN #NumTimesRepeated
END
When I run the above INSERT statement, no record is inserted.
In the table TestSite_Info is a column called Extext. Within this column, there is random text - one of the words being 'test'.
In the other table called Table_Words, I have a column called Words and one of the words in there is 'Test'. So in theory, as the word is a match, I would pick it up, put it into the table Final, and then next to the word (in another column) the count of how many times the word has been found within TestSite_Info.Extext.
Table_Words
id|word
--+----
1 |Test
2 |Onsite
3 |Here
4 |As
TestSite_Info
ExText
-------------------------------------------------
This is a test, onsite test , test test i am here
The expected Final table has been given at the top.
-- Update
Now that I have run Abecee block of code this actually works in terms of bringing back a count column and the id relating to the word.
Here are the results :
id|total
--+----
169 |3
170 |0
171 |5
172 |7
173 |1
174 |3
Taken from the following text which it is extracting from :
Test test and I went to and this was a test I'm writing rubbish hello
but I don't care about care and care seems care to be the word that you will see appear
four times as well as word word word word word, but a .!
who knows whats going on here.
So as you can see, the count for ID 172 appears 7 times (as a reference please see below to what ID numbers relate to in terms of words) which is incorrect it should appear appear 6 times (its added +1 for some reason) as well as ID 171 which is the word care, that appears 4 times but is showing up as 5 times on the count. Any ideas why this would be?
Also what I was really after was a way as you have quite kindly done of the table showing the ID and count BUT also showing the word it relates to as well in the final table, so I don't have to link back through the ID table to see what the actual word is.
Word|id
--+----
as |174
here |173
word |172
care |171
hello |170
test |169
You could work along the updated
WITH
Detail AS (
SELECT
W.id
, W.word
, T.extext
, (LEN(REPLACE(T.extext, ' ', ' ')) + 2
- LEN(REPLACE(' '
+ UPPER(REPLACE(REPLACE(REPLACE(REPLACE(T.extext, ' ', ' '), ':', ' '), '.', ' '), ',', ' '))
+ ' ', ' ' + UPPER(W.word) + ' ', '')) - 1
) / (LEN(W.word) + 2) count
FROM Table_Words W
JOIN TestSite_Info T
ON CHARINDEX(UPPER(W.word), UPPER(T.extext)) > 0
)
INSERT INTO Result
SELECT
id
, SUM(count) total
FROM Detail
GROUP BY id
;
(Had forgotten to count in the blanks added to the front and the end, missed a sign change, and got mixed up as for the length of the word(s) surrounded by blanks. Sorry about that. Thanks for testing it more thoroughly than I did originally!)
Tested on SQL Server 2008: Updated SQL Fiddle and 2012: Updated SQL Fiddle.
And with your test case as well.
It:
is pure SQL (no UDF required),
has room for some tuning:
Store words all lower / all upper case, unless case matters (Which would require to adjust the suggested solution.)
Store strings to check with all punctuation marks removed.
Please comment if and as further detail is required.
From what I could understand, this might do the job. However, things will be more clear if you post the schema
create table final(
word varchar(100),
count integer
);
insert into final (word, count)
select column1, count(*)
from table1, table2
where table1.column1 = table2.words
group by column1;
Thanks for all your help.
The best approach for this solution was :
WITH
Detail AS (
SELECT
W.id
, W.word
, T.extext
, (LEN(REPLACE(T.extext, ' ', ' ')) + 2
- LEN(REPLACE(' '
+ UPPER(REPLACE(REPLACE(REPLACE(REPLACE(T.extext, ' ', ' '), ':', ' '), '.', ' '), ',', ' '))
+ ' ', ' ' + UPPER(W.word) + ' ', '')) - 1
) / (LEN(W.word) + 2) count
FROM Table_Words W
JOIN TestSite_Info T
ON CHARINDEX(UPPER(W.word), UPPER(T.extext)) > 0
)
INSERT INTO Result
SELECT
id
, SUM(count) total
FROM Detail
GROUP BY id
Re-reading the problem description, the presented SELECT seems to try to align/join full "exText" strings with individual "words" - but based on equality. Then, it has "exText" in the SELECT list whilst "Result" seems to wait for individual words. (That would not fail the INSERT, though, as long as the field is not guarded by a foreign key constraint. But as the WHERE/JOIN is not likely to let any data through, this is probably never coming up as an issue anyway.)
For an alternative to the pure declarative approach, you might want to try along
INSERT INTO final (word, count)
SELECT
word
, SUM(dbo.WordRepeatedNumTimes(extext, word)) AS Count
FROM [dbo].[TestSite_Info], [dbo].[table_words]
WHERE CHARINDEX(UPPER([dbo].[table_words].Word), UPPER([dbo].[TestSite_Info].ExText)) > 0
GROUP BY word;
You have "Word" in your "Table_Words" description - but use "[dbo].[table_words].Words" in your WHERE condition…

SQL Server : find percentage match of LIKE string

I'm trying to write a query to find the percentage match of a search string in a notes or TEXT column.
This is what I'm starting with:
SELECT *
FROM NOTES
WHERE UPPER(NARRATIVE) LIKE 'PAID CALLED RECEIVED'
Ultimately, what I want to do is:
Split the search string by spaces and search individually for all words in the string
Order the results descending based on percentage match
For example, in the above scenario, each word in the search string would constitute 33.333% of the total. A NARRATIVE with 3 matches (100%) should be at the top of the results, while a match containing 2 of the keywords (66.666%) would be lower, and a match containing 1 of the keywords (33.333%) would be even lower.
I then want to display the resulting percentage match for that row in a column, along with all the other columns from that table (*).
Hopefully, this makes sense and can be done. Any thoughts on how to proceed? This MUST all be done in SQL Server, and I would prefer not to write any CTEs.
Thank you in advance for any guidance.
Here is what I came up with:
DECLARE #VISIT VARCHAR(25) = '999232'
DECLARE #KEYWORD VARCHAR(100) = 'PAID,CALLED,RECEIVED'
DECLARE SPLIT_CURSOR CURSOR FOR
SELECT RTRIM(LTRIM(VALUE)) FROM Rpt_Split(#KEYWORD, ',')
IF OBJECT_ID('tempdb..#NOTES_FF_SEARCH') IS NOT NULL DROP TABLE #NOTES_FF_SEARCH
SELECT N.VISIT_NO
,N.CREATE_DATE
,N.CREATE_BY
,N.NARRATIVE
,0E8 AS PERCENTAGE
INTO #NOTES_FF_SEARCH
FROM NOTES_FF AS N
WHERE N.VISIT_NO = #VISIT
DECLARE #KEYWORD_VALUE AS VARCHAR(255)
OPEN SPLIT_CURSOR
FETCH NEXT FROM SPLIT_CURSOR INTO #KEYWORD_VALUE
WHILE ##FETCH_STATUS = 0
BEGIN
UPDATE #NOTES_FF_SEARCH
SET PERCENTAGE = PERCENTAGE + ( 100 / ##CURSOR_ROWS )
WHERE UPPER(NARRATIVE) LIKE '%' + UPPER(#KEYWORD_VALUE) + '%'
FETCH NEXT FROM SPLIT_CURSOR INTO #KEYWORD_VALUE
END
CLOSE SPLIT_CURSOR
DEALLOCATE SPLIT_CURSOR
SELECT * FROM #NOTES_FF_SEARCH
WHERE PERCENTAGE > 0
ORDER BY PERCENTAGE, CREATE_DATE DESC
There may be a more efficient way to do this but every other road I started down ended in a dead-end. Thanks for your help
If you want to do a "percentage" match, you need to do two things: calculate the number of words in the string and calculate the number of words you care about. Before giving some guidance, I will say that full text search probably does everything you want and much more efficiently.
Assuming the search string has space delimited words, you can count the words with the expression:
(len(narrative) - len(replace(narrative, ' ', '') + 1) as NumWords
You can count the matching words with success replaces. So, for keywords, it would be something like removing each key word, fixing the spaces, and counting the words.
The overall code is best represented with subqueries. The resulting query is something like:
select n.*
from (select n.*,
(len(narrative) - len(replace(narrative, ' ', '') + 1.0) as NumWords,
ltrim(rtrim(replace(replace(replace(narrative + ' ', #keyword1 + ' ', ''),
#keyword2 + ' ', ''),
#keyword3 + ' ', ''))) as NoKeywords
from notes n
) n
order by 1 - (len(NoKeywords) - len(replace(NoKeywords, ' ', '') + 1.0) / NumWords desc;
SQL Server -- as with many databases -- is not particularly good at parsing strings. You can do that outside the query and assign the #keyword variables accordingly.

SQL query--String Permutations

I am trying to create a query using a db on OpenOffice where a string is entered in the query, and all permutations of the string are searched in the database and the matches are displayed. My database has fields for a word and its definition, so if I am looking for GOOD I will get its definition as well as the definition for DOG.
You'll need a third column as well. In this column you'll have the word - but with the letters sorted in alphabetical order. For example, you'll have the word APPLE and in the next column the word AELPP.
You would sort the word your looking for - and run a some SQL code like
WHERE sorted_words = 'my_sorted_word'
for the word apple, you would get something like this:
unsorted sorted
AELPP APPLE
AELPP PEPLA
AELPP APPEL
Now, you also wanted - correct me if I'm wrong, but you want all the words that can be made with **any combination ** of the letters, meaning APPLE also returns words like LEAP and PEA.
To do this, you would have to use some programming language - you would have to write a function that preformed the above recursively, for example - for the word AELLP you have
ELLP
ALLP
AELP
and so forth.. (each time subtracting one letter in every combination, and then two letters in every combination possible ect..)
Basically, you can't easily do permutations in single SQL statement. You can easily do them in another language though, for example here's how to do it in C#: http://msdn.microsoft.com/en-us/magazine/cc163513.aspx
Ok, corrected version that I think handles all situations. This will work in MS SQL Server, so you may need to adjust it for your RDBMS as far as using the local table and the REPLICATE function. It assumes a passed parameter called #search_string. Also, since it's using VARCHAR instead of NVARCHAR, if you're using extended characters be sure to change that.
One last point that I'm just thinking of now... it will allow duplication of letters. For example, "GOOD" would find "DODO" even though there is only one "D" in "GOOD". It will NOT find words of greater length than your original word though. In other words, while it would find "DODO", it wouldn't find "DODODO". Maybe this will give you a starting point to work from though depending on your exact requirements.
DECLARE #search_table TABLE (search_string VARCHAR(4000))
DECLARE #i INT
SET #i = 1
WHILE (#i <= LEN(#search_string))
BEGIN
INSERT INTO #search_table (search_string)
VALUES (REPLICATE('[' + #search_string + ']', #i)
SET #i = #i + 1
END
SELECT
word,
definition
FROM
My_Words
INNER JOIN #search_table ST ON W.word LIKE ST.search_string
The original query before my edit, just to have it here:
SELECT
word,
definition
FROM
My_Words
WHERE
word LIKE REPLICATE('[' + #search_string + ']', LEN(#search_string))
maybe this can help:
Suppose you have a auxiliary Numbers table with integer numbers.
DECLARE #s VARCHAR(5);
SET #s = 'ABCDE';
WITH Subsets AS (
SELECT CAST(SUBSTRING(#s, Number, 1) AS VARCHAR(5)) AS Token,
CAST('.'+CAST(Number AS CHAR(1))+'.' AS VARCHAR(11)) AS Permutation,
CAST(1 AS INT) AS Iteration
FROM dbo.Numbers WHERE Number BETWEEN 1 AND 5
UNION ALL
SELECT CAST(Token+SUBSTRING(#s, Number, 1) AS VARCHAR(5)) AS Token,
CAST(Permutation+CAST(Number AS CHAR(1))+'.' AS VARCHAR(11)) AS
Permutation,
s.Iteration + 1 AS Iteration
FROM Subsets s JOIN dbo.Numbers n ON s.Permutation NOT LIKE
'%.'+CAST(Number AS CHAR(1))+'.%' AND s.Iteration < 5 AND Number
BETWEEN 1 AND 5
--AND s.Iteration = (SELECT MAX(Iteration) FROM Subsets)
)
SELECT * FROM Subsets
WHERE Iteration = 5
ORDER BY Permutation
Token Permutation Iteration
----- ----------- -----------
ABCDE .1.2.3.4.5. 5
ABCED .1.2.3.5.4. 5
ABDCE .1.2.4.3.5. 5
(snip)
EDBCA .5.4.2.3.1. 5
EDCAB .5.4.3.1.2. 5
EDCBA .5.4.3.2.1. 5
(120 row(s) affected)

How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table?

I have a table with a couple thousand rows. The description and summary fields are NTEXT, and sometimes have non-ASCII chars in them. How can I locate all of the rows with non ASCII characters?
I have sometimes been using this "cast" statement to find "strange" chars
select
*
from
<Table>
where
<Field> != cast(<Field> as varchar(1000))
First build a string with all the characters you're not interested in (the example uses the 0x20 - 0x7F range, or 7 bits without the control characters.) Each character is prefixed with |, for use in the escape clause later.
-- Start with tab, line feed, carriage return
declare #str varchar(1024)
set #str = '|' + char(9) + '|' + char(10) + '|' + char(13)
-- Add all normal ASCII characters (32 -> 127)
declare #i int
set #i = 32
while #i <= 127
begin
-- Uses | to escape, could be any character
set #str = #str + '|' + char(#i)
set #i = #i + 1
end
The next snippet searches for any character that is not in the list. The % matches 0 or more characters. The [] matches one of the characters inside the [], for example [abc] would match either a, b or c. The ^ negates the list, for example [^abc] would match anything that's not a, b, or c.
select *
from yourtable
where yourfield like '%[^' + #str + ']%' escape '|'
The escape character is required because otherwise searching for characters like ], % or _ would mess up the LIKE expression.
Hope this is useful, and thanks to JohnFX's comment on the other answer.
Here ya go:
SELECT *
FROM Objects
WHERE
ObjectKey LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?#\[\^_`{|}~\]\\]%' ESCAPE '\'
It's probably not the best solution, but maybe a query like:
SELECT *
FROM yourTable
WHERE yourTable.yourColumn LIKE '%[^0-9a-zA-Z]%'
Replace the "0-9a-zA-Z" expression with something that captures the full ASCII set (or a subset that your data contains).
Technically, I believe that an NCHAR(1) is a valid ASCII character IF & Only IF UNICODE(#NChar) < 256 and ASCII(#NChar) = UNICODE(#NChar) though that may not be exactly what you intended. Therefore this would be a correct solution:
;With cteNumbers as
(
Select ROW_NUMBER() Over(Order By c1.object_id) as N
From sys.system_columns c1, sys.system_columns c2
)
Select Distinct RowID
From YourTable t
Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))
Where UNICODE(Substring(TXT, n.N, 1)) > 255
OR UNICODE(Substring(TXT, n.N, 1)) <> ASCII(Substring(TXT, n.N, 1))
This should also be very fast.
I started with #CC1960's solution but found an interesting use case that caused it to fail. It seems that SQL Server will equate certain Unicode characters to their non-Unicode approximations. For example, SQL Server considers the Unicode character "fullwidth comma" (http://www.fileformat.info/info/unicode/char/ff0c/index.htm) the same as a standard ASCII comma when compared in a WHERE clause.
To get around this, have SQL Server compare the strings as binary. But remember, nvarchar and varchar binaries don't match up (16-bit vs 8-bit), so you need to convert your varchar back up to nvarchar again before doing the binary comparison:
select *
from my_table
where CONVERT(binary(5000),my_table.my_column) != CONVERT(binary(5000),CONVERT(nvarchar(1000),CONVERT(varchar(1000),my_table.my_column)))
If you are looking for a specific unicode character, you might use something like below.
select Fieldname from
(
select Fieldname,
REPLACE(Fieldname COLLATE Latin1_General_BIN,
NCHAR(65533) COLLATE Latin1_General_BIN,
'CustomText123') replacedcol
from table
) results where results.replacedcol like '%CustomText123%'
My previous answer was confusing UNICODE/non-UNICODE data. Here is a solution that should work for all situations, although I'm still running into some anomalies. It seems like certain non-ASCII unicode characters for superscript characters are being confused with the actual number character. You might be able to play around with collations to get around that.
Hopefully you already have a numbers table in your database (they can be very useful), but just in case I've included the code to partially fill that as well.
You also might need to play around with the numeric range, since unicode characters can go beyond 255.
CREATE TABLE dbo.Numbers
(
number INT NOT NULL,
CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (number)
)
GO
DECLARE #i INT
SET #i = 0
WHILE #i < 1000
BEGIN
INSERT INTO dbo.Numbers (number) VALUES (#i)
SET #i = #i + 1
END
GO
SELECT *,
T.ID, N.number, N'%' + NCHAR(N.number) + N'%'
FROM
dbo.Numbers N
INNER JOIN dbo.My_Table T ON
T.description LIKE N'%' + NCHAR(N.number) + N'%' OR
T.summary LIKE N'%' + NCHAR(N.number) + N'%'
and t.id = 1
WHERE
N.number BETWEEN 127 AND 255
ORDER BY
T.id, N.number
GO
-- This is a very, very inefficient way of doing it but should be OK for
-- small tables. It uses an auxiliary table of numbers as per Itzik Ben-Gan and simply
-- looks for characters with bit 7 set.
SELECT *
FROM yourTable as t
WHERE EXISTS ( SELECT *
FROM msdb..Nums as NaturalNumbers
WHERE NaturalNumbers.n < LEN(t.string_column)
AND ASCII(SUBSTRING(t.string_column, NaturalNumbers.n, 1)) > 127)

Uppercase first two characters in a column in a db table

I've got a column in a database table (SQL Server 2005) that contains data like this:
TQ7394
SZ910284
T r1534
su8472
I would like to update this column so that the first two characters are uppercase. I would also like to remove any spaces between the first two characters. So T q1234 would become TQ1234.
The solution should be able to cope with multiple spaces between the first two characters.
Is this possible in T-SQL? How about in ANSI-92? I'm always interested in seeing how this is done in other db's too, so feel free to post answers for PostgreSQL, MySQL, et al.
Here is a solution:
EDIT: Updated to support replacement of multiple spaces between the first and the second non-space characters
/* TEST TABLE */
DECLARE #T AS TABLE(code Varchar(20))
INSERT INTO #T SELECT 'ab1234x1' UNION SELECT ' ab1234x2'
UNION SELECT ' ab1234x3' UNION SELECT 'a b1234x4'
UNION SELECT 'a b1234x5' UNION SELECT 'a b1234x6'
UNION SELECT 'ab 1234x7' UNION SELECT 'ab 1234x8'
SELECT * FROM #T
/* INPUT
code
--------------------
ab1234x3
ab1234x2
a b1234x6
a b1234x5
a b1234x4
ab 1234x8
ab 1234x7
ab1234x1
*/
/* START PROCESSING SECTION */
DECLARE #s Varchar(20)
DECLARE #firstChar INT
DECLARE #secondChar INT
UPDATE #T SET
#firstChar = PATINDEX('%[^ ]%',code)
,#secondChar = #firstChar + PATINDEX('%[^ ]%', STUFF(code,1, #firstChar,'' ) )
,#s = STUFF(
code,
1,
#secondChar,
REPLACE(LEFT(code,
#secondChar
),' ','')
)
,#s = STUFF(
#s,
1,
2,
UPPER(LEFT(#s,2))
)
,code = #s
/* END PROCESSING SECTION */
SELECT * FROM #T
/* OUTPUT
code
--------------------
AB1234x3
AB1234x2
AB1234x6
AB1234x5
AB1234x4
AB 1234x8
AB 1234x7
AB1234x1
*/
UPDATE YourTable
SET YourColumn = UPPER(
SUBSTRING(
REPLACE(YourColumn, ' ', ''), 1, 2
)
)
+
SUBSTRING(YourColumn, 3, LEN(YourColumn))
UPPER isn't going to hurt any numbers, so if the examples you gave are completely representative, there's not really any harm in doing:
UPDATE tbl
SET col = REPLACE(UPPER(col), ' ', '')
The sample data only has spaces and lowercase letters at the start. If this holds true for the real data then simply:
UPPER(REPLACE(YourColumn, ' ', ''))
For a more specific answer I'd politely ask you to expand on your spec, otherwise I'd have to code around all the other possibilities (e.g. values of less than three characters) without knowing if I was overengineering my solution to handle data that wouldn't actually arise in reality :)
As ever, once you've fixed the data, put in a database constraint to ensure the bad data does not reoccur e.g.
ALTER TABLE YourTable ADD
CONSTRAINT YourColumn__char_pos_1_uppercase_letter
CHECK (ASCII(SUBSTRING(YourColumn, 1, 1)) BETWEEN ASCII('A') AND ASCII('Z'));
ALTER TABLE YourTable ADD
CONSTRAINT YourColumn__char_pos_2_uppercase_letter
CHECK (ASCII(SUBSTRING(YourColumn, 2, 1)) BETWEEN ASCII('A') AND ASCII('Z'));
#huo73: yours doesn't work for me on SQL Server 2008: I get 'TRr1534' instead of 'TR1534'.
update Table set Column = case when len(rtrim(substring (Column , 1 , 2))) < 2
then UPPER(substring (Column , 1 , 1) + substring (Column , 3 , 1)) + substring(Column , 4, len(Column)
else UPPER(substring (Column , 1 , 2)) + substring(Column , 3, len(Column) end
This works on the fact that if there is a space then the trim of that part of string would yield length less than 2 so we split the string in three and use upper on the 1st and 3rd char. In all other cases we can split the string in 2 parts and use upper to make the first two chars to upper case.
If you are doing an UPDATE, I would do it in 2 steps; first get rid of the space (RTRIM on a SUBSTRING), and second do the UPPER on the first 2 chars:
// uses a fixed column length - 20-odd in this case
UPDATE FOO
SET bar = RTRIM(SUBSTRING(bar, 1, 2)) + SUBSTRING(bar, 3, 20)
UPDATE FOO
SET bar = UPPER(SUBSTRING(bar, 1, 2)) + SUBSTRING(bar, 3, 20)
If you need it in a SELECT (i.e. inline), then I'd be tempted to write a scalar UDF