I have SQL query like
SELECT *, dbo.func(#param1, a.point) as fValue
FROM dbo.table AS a
WHERE dbo.func(#param1, a.point) < #param2
When this query is executed only once, everything is fine, but when I have array of input #param1 values let's say, over 100 values, executing and fetching results for every value take s a lot of time.
Is it possible to pass array of #param1 into the query somehow, and receive dataset for all the input values, instead of executing it for each value?
function func() doing some math on 2 values. #param1 and a.point are type of double. and, yeah, a.point - is not an ID, and it is not a unique value.
I know, it should be really easy, but it looks like I'm missing something.
You still need to execute that function 100 times for each row, right? I don't see any shortcuts here.
If you wanted to get them all at once, you could do
SELECT dbo.func(#param1, a.point) as fValue1,
dbo.func(#param2, a.point) as fValue2 ...
or something like that, but looping through them just seems more efficient to me anyway.
I suppose you could use a cursor to retrieve each a.point value once, then act on it 100 times, but that's a lot of coding, and not necessarily a simpler solution.
What exactly does dbo.func() do? Is it possible that you could insert the 100 values into a table structure, and perform that operation on the set of 100 all at once, instead of 1x1 100 times?
As an example, let's say you have this function, which just turns a comma-separated list of float values into a single-column table:
CREATE FUNCTION dbo.ListFloats
(
#List VARCHAR(MAX)
)
RETURNS TABLE
RETURN
(
SELECT i = CONVERT(FLOAT, Item)
FROM
(
SELECT Item = x.i.value('(./text())[1]', 'FLOAT')
FROM
(
SELECT [XML] = CONVERT(XML, '<i>'
+ REPLACE(#List, ',', '</i><i>')
+ '</i>').query('.')
) AS a
CROSS APPLY
[XML].nodes('i') AS x(i)
) AS y
WHERE Item IS NOT NULL
);
GO
Now you should be able to get your floats in a set by simply saying:
SELECT i FROM dbo.ListFloats('1.5, 3.0, 2.45, 1.9');
Taking that a step further, let's say dbo.func() takes these two inputs and says something like:
RETURN (SELECT (#param1 + #param2 / #param2));
Now, I know that you've always been told that modularization and encapsulation are good, but in the case of inline functions, I would suggest you avoid the function that gets this result (again you haven't explained what dbo.func() does, so I'm just guessing this will be easy) and do it inline. So instead of calling dbo.func() - twice for each row, no less - you can just say:
DECLARE
#Param1Array VARCHAR(MAX) = '1.5, 3.0, 2.45, 1.9',
#Param2 FLOAT = 2.0;
WITH x AS
(
SELECT t.point, x.i, fValue = ((x.i + t.point)/t.point)
FROM dbo.[table] AS t
CROSS JOIN dbo.ListFloats(#Param1Array) AS x
)
SELECT point, i, fValue FROM x
--WHERE fValue < #Param2
;
The keys are:
Avoiding processing each parameter individually.
Avoiding the individual calculations off in its own separate module.
Performing calculations as few times as possible.
If you can't change the structure this much, then at the very least, avoid calculating the function twice by writing instead:
;WITH x AS
(
SELECT *, dbo.func(#param1, a.point) as fValue
FROM dbo.table AS a
)
SELECT * FROM x
WHERE fValue < #param2;
If you provide details about the data types, what dbo.func() does, etc., people will be able to provide more tangible advice.
Do you have any indexes on this table? If you have an index on a.point, then you will never hit it using this code, ie will always table scan. This is to do with Search Arguments (you can google this). Example:
If you have table xTable with index on column xColumn, then this:
select colA, colB from xTable where xColumn/2 >= 5
will never use the index, but this probably will:
select colA, colB from xTable where xColumn >=10
So you might need something like this:
WHERE a.point < Otherfunc(#param1, #param2 )
I'm looking for a homegrown way to scramble production data for use in development and test. I've built a couple of scripts that make random social security numbers, shift birth dates, scramble emails, etc. But I've come up against a wall trying to scramble customer names. I want to keep real names so we can still use or searches so random letter generation is out. What I have tried so far is building a temp table of all last names in the table then updating the customer table with a random selection from the temp table. Like this:
DECLARE #Names TABLE (Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names SELECT LastName FROM Customer ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
This worked well in test, but completely bogs down dealing with larger amounts of data (>20 minutes for 40K rows)
All of that to ask, how would you scramble customer names while keeping real names and the weight of the production data?
UPDATE: Never fails, you try to put all the information in the post, and you forget something important. This data will also be used in our sales & demo environments which are publicly available. Some of the answers are what I am attempting to do, to 'switch' the names, but my question is literally, how to code in T-SQL?
I use generatedata. It is an open source php script which can generate all sorts of dummy data.
A very simple solution would be to ROT13 the text.
A better question may be why you feel the need to scramble the data? If you have an encryption key, you could also consider running the text through DES or AES or similar. Thos would have potential performance issues, however.
When doing something like that I usually write a small program that first loads a lot of names and surnames in two arrays, and then just updates the database using random name/surname from arrays. It works really fast even for very big datasets (200.000+ records)
I use a method that changes characters in the name to other characters that are in the same "range" of usage frequency in English names. Apparently, the distribution of characters in names is different than it is for normal conversational English. For example, "x" and "z" occur 0.245% of the time, so they get swapped. The the other extreme, "w" is used 5.5% of the time, "s" 6.86% and "t", 15.978%. I change "s" to "w", "t" to "s" and "w" to "t".
I keep the vowels "aeio" in a separate group so that a vowel is only replaced by another vowel. Similarly, "q", "u" and "y" are not replaced at all. My grouping and decisions are totally subjective.
I ended up with 7 different "groups" of 2-5 characters , based mostly on frequency. characters within each group are swapped with other chars in that same group.
The net result is names that kinda look like the might be names, but from "not around here".
Original name Morphed name
Loren Nimag
Juanita Kuogewso
Tennyson Saggywig
David Mijsm
Julie Kunewa
Here's the SQL I use, which includes a "TitleCase" function. There are 2 different versions of the "morphed" name based on different frequencies of letters I found on the web.
-- from https://stackoverflow.com/a/28712621
-- Convert and return param as Title Case
CREATE FUNCTION [dbo].[fnConvert_TitleCase] (#InputString VARCHAR(4000) )
RETURNS VARCHAR(4000)AS
BEGIN
DECLARE #Index INT
DECLARE #Char CHAR(1)
DECLARE #OutputString VARCHAR(255)
SET #OutputString = LOWER(#InputString)
SET #Index = 2
SET #OutputString = STUFF(#OutputString, 1, 1,UPPER(SUBSTRING(#InputString,1,1)))
WHILE #Index <= LEN(#InputString)
BEGIN
SET #Char = SUBSTRING(#InputString, #Index, 1)
IF #Char IN (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&','''','(','{','[','#')
IF #Index + 1 <= LEN(#InputString)
BEGIN
IF #Char != '''' OR UPPER(SUBSTRING(#InputString, #Index + 1, 1)) != 'S'
SET #OutputString = STUFF(#OutputString, #Index + 1, 1,UPPER(SUBSTRING(#InputString, #Index + 1, 1)))
END
SET #Index = #Index + 1
END
RETURN ISNULL(#OutputString,'')
END
Go
-- 00.045 x 0.045%
-- 00.045 z 0.045%
--
-- Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z')
--
-- 00.456 k 0.456%
-- 00.511 j 0.511%
-- 00.824 v 0.824%
-- kjv
-- Replace(Replace(Replace(Replace(TS_NAME,'k','#'),'j','k'),'v','j'),'#','v')
--
-- 01.642 g 1.642%
-- 02.284 n 2.284%
-- 02.415 l 2.415%
-- gnl
-- Replace(Replace(Replace(Replace(TS_NAME,'g','#'),'n','g'),'l','n'),'#','l')
--
-- 02.826 r 2.826%
-- 03.174 d 3.174%
-- 03.826 m 3.826%
-- rdm
-- Replace(Replace(Replace(Replace(TS_NAME,'r','#'),'d','r'),'m','d'),'#','m')
--
-- 04.027 f 4.027%
-- 04.200 h 4.200%
-- 04.319 p 4.319%
-- 04.434 b 4.434%
-- 05.238 c 5.238%
-- fhpbc
-- Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c')
--
-- 05.497 w 5.497%
-- 06.686 s 6.686%
-- 15.978 t 15.978%
-- wst
-- Replace(Replace(Replace(Replace(TS_NAME,'w','#'),'s','w'),'t','s'),'#','t')
--
--
-- 02.799 e 2.799%
-- 07.294 i 7.294%
-- 07.631 o 7.631%
-- 11.682 a 11.682%
-- eioa
-- Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')
--
-- -- dont replace
-- 00.222 q 0.222%
-- 00.763 y 0.763%
-- 01.183 u 1.183%
-- Obfuscate a name
Select
ts_id,
Cast(ts_name as varchar(42)) as [Original Name]
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z'),'k','#'),'j','k'),'v','j'),'#','v'),'g','#'),'n','g'),'l','n'),'#','l'),'r','#'),'d','r'),'m','d'),'#','m'),'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c'),'w','#'),'s','w'),'t','s'),'#','t'),'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')) as VarChar(42)) As [morphed name] ,
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','t'),'~','e'),'t','~'),'a','o'),'~','a'),'o','~'),'i','n'),'~','i'),'n','~'),'s','h'),'~','s'),'h','r'),'r','~'),'d','l'),'~','d'),'l','~'),'m','w'),'~','m'),'w','f'),'f','~'),'g','y'),'~','g'),'y','p'),'p','~'),'b','v'),'~','b'),'v','k'),'k','~'),'x','~'),'j','x'),'~','j')) as VarChar(42)) As [morphed name2]
From
ts_users
;
Why not just use some sort of Random Name Generator?
Use a temporary table instead and the query is very fast. I just ran on 60K rows in 4 seconds. I'll be using this one going forward.
DECLARE TABLE #Names
(Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names
SELECT LastName
FROM Customer
ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID]
SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
DROP TABLE #Names
The following approach worked for us, lets say we have 2 tables Customers and Products:
CREATE FUNCTION [dbo].[GenerateDummyValues]
(
#dataType varchar(100),
#currentValue varchar(4000)=NULL
)
RETURNS varchar(4000)
AS
BEGIN
IF #dataType = 'int'
BEGIN
Return '0'
END
ELSE IF #dataType = 'varchar' OR #dataType = 'nvarchar' OR #dataType = 'char' OR #dataType = 'nchar'
BEGIN
Return 'AAAA'
END
ELSE IF #dataType = 'datetime'
BEGIN
Return Convert(varchar(2000),GetDate())
END
-- you can add more checks, add complicated logic etc
Return 'XXX'
END
The above function will help in generating different data based on the data type coming in.
Now, for each column of each table which does not have word "id" in it, use following query to generate further queries to manipulate the data:
select 'select ''update '' + TABLE_NAME + '' set '' + COLUMN_NAME + '' = '' + '''''''' + dbo.GenerateDummyValues( Data_type,'''') + '''''' where id = '' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, ' + table_name + ' where RIGHT(LOWER(COLUMN_NAME),2) <> ''id'' and TABLE_NAME = '''+ table_name + '''' + ';' from INFORMATION_SCHEMA.TABLES;
When you execute above query it will generate update queries for each table and for each column of that table, for example:
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Customers where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Customers';
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Products where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Products';
Now, when you execute above queries you will get final update queries, that will update the data of your tables.
You can execute this on any SQL server database, no matter how many tables do you have, it will generate queries for you that can be further executed.
Hope this helps.
Another site to generate shaped fake data sets, with an option for T-SQL output:
https://mockaroo.com/
Here's a way using ROT47 which is reversible, and another which is random. You can add a PK to either to link back to the "un scrambled" versions
declare #table table (ID int, PLAIN_TEXT nvarchar(4000))
insert into #table
values
(1,N'Some Dudes name'),
(2,N'Another Person Name'),
(3,N'Yet Another Name')
--split your string into a column, and compute the decimal value (N)
if object_id('tempdb..#staging') is not null drop table #staging
select
substring(a.b, v.number+1, 1) as Val
,ascii(substring(a.b, v.number+1, 1)) as N
--,dense_rank() over (order by b) as RN
,a.ID
into #staging
from (select PLAIN_TEXT b, ID FROM #table) a
inner join
master..spt_values v on v.number < len(a.b)
where v.type = 'P'
--select * from #staging
--create a fast tally table of numbers to be used to build the ROT-47 table.
;WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--Here we put it all together with stuff and FOR XML
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,case
when 47 + N <= 126 then char(47 + N)
when 47 + N > 126 then char(N-47)
end as ENCRYPTED_TEXT
from cteTally
where N between 33 and 126) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
--or if you want really random
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,char((select ROUND(((122 - N -1) * RAND() + N), 0))) as ENCRYPTED_TEXT
from cteTally
where (N between 65 and 122) and N not in (91,92,93,94,95,96)) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
Encountered the same problem myself and figured out an alternative solution that may work for others.
The idea is to use MD5 on the name and then take the last 3 hex digits of it to map into a table of names. You can do this separately for first name and last name.
3 hex digits represent decimals from 0 to 4095, so we need a list of 4096 first names and 4096 last names.
So conv(substr(md5(first_name), 3),16,10) (in MySQL syntax) would be an index from 0 to 4095 that could be joined with a table that holds 4096 first names. The same concept could be applied to last names.
Using MD5 (as opposed to a random number) guarantees a name in the original data will always be mapped to the same name in the test data.
You can get a list of names here:
https://gist.github.com/elifiner/cc90fdd387449158829515782936a9a4
I am working on this at my company right now -- and it turns out to be a very tricky thing. You want to have names that are realistic, but must not reveal any real personal info.
My approach has been to first create a randomized "mapping" of last names to other last names, then use that mapping to change all last names. This is good if you have duplicate name records. Suppose you have 2 "John Smith" records that both represent the same real person. If you changed one record to "John Adams" and the other to "John Best", then your one "person" now has 2 different names! With a mapping, all occurrences of "Smith" get changed to "Jones", and so duplicates ( or even family members ) still end up with the same last name, keeping the data more "realistic".
I will also have to scramble the addresses, phone numbers, bank account numbers, etc...and I am not sure how I will approach those. Keeping the data "realistic" while scrambling is certainly a deep topic. This must have been done many times by many companies -- who has done this before? What did you learn?
Frankly, I'm not sure why this is needed. Your dev/test environments should be private, behind your firewall, and not accessible from the web.
Your developers should be trusted, and you have legal recourse against them if they fail to live up to your trust.
I think the real question should be "Should I scramble the data?", and the answer is (in my mind) 'no'.
If you're sending it offsite for some reason, or you have to have your environments web-accessible, or if you're paranoid, I would implement a random switch. Rather than build a temp table, run switches between each location and a random row in the table, swapping one piece of data at a time.
The end result will be a table with all the same data, but with it randomly reorganized. It should also be faster than your temp table, I believe.
It should be simple enough to implement the Fisher-Yates Shuffle in SQL...or at least in a console app that reads the db and writes to the target.
Edit (2): Off-the cuff answer in T-SQL:
declare #name varchar(50)
set #name = (SELECT lastName from person where personID = (random id number)
Update person
set lastname = #name
WHERE personID = (person id of current row)
Wrap this in a loop, and follow the guidelines of Fisher-Yates for modifying the random value constraints, and you'll be set.
I've got a column in a database table (SQL Server 2005) that contains data like this:
TQ7394
SZ910284
T r1534
su8472
I would like to update this column so that the first two characters are uppercase. I would also like to remove any spaces between the first two characters. So T q1234 would become TQ1234.
The solution should be able to cope with multiple spaces between the first two characters.
Is this possible in T-SQL? How about in ANSI-92? I'm always interested in seeing how this is done in other db's too, so feel free to post answers for PostgreSQL, MySQL, et al.
Here is a solution:
EDIT: Updated to support replacement of multiple spaces between the first and the second non-space characters
/* TEST TABLE */
DECLARE #T AS TABLE(code Varchar(20))
INSERT INTO #T SELECT 'ab1234x1' UNION SELECT ' ab1234x2'
UNION SELECT ' ab1234x3' UNION SELECT 'a b1234x4'
UNION SELECT 'a b1234x5' UNION SELECT 'a b1234x6'
UNION SELECT 'ab 1234x7' UNION SELECT 'ab 1234x8'
SELECT * FROM #T
/* INPUT
code
--------------------
ab1234x3
ab1234x2
a b1234x6
a b1234x5
a b1234x4
ab 1234x8
ab 1234x7
ab1234x1
*/
/* START PROCESSING SECTION */
DECLARE #s Varchar(20)
DECLARE #firstChar INT
DECLARE #secondChar INT
UPDATE #T SET
#firstChar = PATINDEX('%[^ ]%',code)
,#secondChar = #firstChar + PATINDEX('%[^ ]%', STUFF(code,1, #firstChar,'' ) )
,#s = STUFF(
code,
1,
#secondChar,
REPLACE(LEFT(code,
#secondChar
),' ','')
)
,#s = STUFF(
#s,
1,
2,
UPPER(LEFT(#s,2))
)
,code = #s
/* END PROCESSING SECTION */
SELECT * FROM #T
/* OUTPUT
code
--------------------
AB1234x3
AB1234x2
AB1234x6
AB1234x5
AB1234x4
AB 1234x8
AB 1234x7
AB1234x1
*/
UPDATE YourTable
SET YourColumn = UPPER(
SUBSTRING(
REPLACE(YourColumn, ' ', ''), 1, 2
)
)
+
SUBSTRING(YourColumn, 3, LEN(YourColumn))
UPPER isn't going to hurt any numbers, so if the examples you gave are completely representative, there's not really any harm in doing:
UPDATE tbl
SET col = REPLACE(UPPER(col), ' ', '')
The sample data only has spaces and lowercase letters at the start. If this holds true for the real data then simply:
UPPER(REPLACE(YourColumn, ' ', ''))
For a more specific answer I'd politely ask you to expand on your spec, otherwise I'd have to code around all the other possibilities (e.g. values of less than three characters) without knowing if I was overengineering my solution to handle data that wouldn't actually arise in reality :)
As ever, once you've fixed the data, put in a database constraint to ensure the bad data does not reoccur e.g.
ALTER TABLE YourTable ADD
CONSTRAINT YourColumn__char_pos_1_uppercase_letter
CHECK (ASCII(SUBSTRING(YourColumn, 1, 1)) BETWEEN ASCII('A') AND ASCII('Z'));
ALTER TABLE YourTable ADD
CONSTRAINT YourColumn__char_pos_2_uppercase_letter
CHECK (ASCII(SUBSTRING(YourColumn, 2, 1)) BETWEEN ASCII('A') AND ASCII('Z'));
#huo73: yours doesn't work for me on SQL Server 2008: I get 'TRr1534' instead of 'TR1534'.
update Table set Column = case when len(rtrim(substring (Column , 1 , 2))) < 2
then UPPER(substring (Column , 1 , 1) + substring (Column , 3 , 1)) + substring(Column , 4, len(Column)
else UPPER(substring (Column , 1 , 2)) + substring(Column , 3, len(Column) end
This works on the fact that if there is a space then the trim of that part of string would yield length less than 2 so we split the string in three and use upper on the 1st and 3rd char. In all other cases we can split the string in 2 parts and use upper to make the first two chars to upper case.
If you are doing an UPDATE, I would do it in 2 steps; first get rid of the space (RTRIM on a SUBSTRING), and second do the UPPER on the first 2 chars:
// uses a fixed column length - 20-odd in this case
UPDATE FOO
SET bar = RTRIM(SUBSTRING(bar, 1, 2)) + SUBSTRING(bar, 3, 20)
UPDATE FOO
SET bar = UPPER(SUBSTRING(bar, 1, 2)) + SUBSTRING(bar, 3, 20)
If you need it in a SELECT (i.e. inline), then I'd be tempted to write a scalar UDF