SQL Query to use column as a formula to calculate value - sql

I am working on a request that i need to calculate value based on formula specified in another column
Below is my table:
I need to write the query to get value which will be based on FORMULA column. eg I need result as
As formula could be anything consisting of my columns PRICE and SIZE, how do i write the query to achieve this?

Dynamic query is the (only) way to go and it's not that complicated:
DECLARE #query NVARCHAR(MAX) = '';
SELECT #query = #query + '
UNION
SELECT ItemID, Price, Size, Formula, ' + Formula + ' AS CalcValue FROM YourTable WHERE Formula = ''' + Formula + ''' '
FROM YourTable;
SET #query = STUFF(#query,1,8,'');
PRINT #query;
EXEC (#query);
SQLFiddle DEMO
But you must be aware how prone to errors this is. If value of Formula column is not valid formula query breaks.
edit: going with UNION instead of UNION ALL because of same formula appearing in multiple rows
edit2: Plan B - Instead of running bunch of same select queries and making distinct of results, better to make distinct formulas at beginning:
DECLARE #query NVARCHAR(MAX) = '';
WITH CTE_DistinctFormulas AS
(
SELECT DISTINCT Formula FROM YourTable
)
SELECT #query = #query + '
UNION ALL
SELECT ItemID, Price, Size, Formula, ' + Formula + ' AS CalcValue FROM YourTable WHERE Formula = ''' + Formula + ''' '
FROM CTE_DistinctFormulas;
SET #query = STUFF(#query,1,12,'');
PRINT #query;
EXEC (#query);
SQLFiddle DEMO 2 - added few more rows

Another alternative which is relatively easy to do is with a CLR. You can take advantage of the Compute Method of DataTable to give a simple one line code in C#.
[Microsoft.SqlServer.Server.SqlFunction]
public static double Evaluate(SqlString expression)
{
return double.Parse((new DataTable()).Compute(expression.ToString(), "").ToString());
}
Then add the assembly to SQL Server and create the wrapper function:
CREATE FUNCTION [dbo].[Evaluate](#expression [nvarchar](4000))
RETURNS [float] WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [YourAssemblyName].[YourClassName].[Evaluate]
GO
Now you can call the function as part of a simple select statement:
SELECT itemid, price, size, formula,
dbo.Evaluate(REPLACE(REPLACE(formula, 'PRICE', FORMAT(price,'0.00')),
'SIZE', FORMAT(size, '0'))) as calcvalue FROM YourTable

I did something similar to this and if you play in a smaller pool of operations it is not that hard. I went with a series where I had X and Y columns and an operator. Then I just did a large case when statement on identifying the operator and performing logic based on that. You may have to change your structure slightly. The core of the problem is that SQL is a result set based engine so anything where you have to do an operation to determine dynamic is going to be slower. EG:
DECLARE #Formula TABLE
(
FormulaId INT IDENTITY
, FormulaName VARCHAR(128)
, Operator VARCHAR(4)
);
DECLARE #Values TABLE
(
ValueId INT IDENTITY
, FormulaId INT
, Price MONEY
, Size INT
)
INSERT INTO #Formula (FormulaName, Operator)
VALUES ('Simple Addition', '+'), ( 'Simple Subtraction', '-'), ('Simple Multiplication', '*'), ('Simple Division', '/'), ('Squared', '^2'), ('Grow by 20 percent then Multiply', '20%*')
INSERT INTO #Values (FormulaId, Price, Size)
VALUES (1, 10, 5),(2, 10, 5),(3, 10, 5),(4, 10, 5),(5, 10, 5),(6, 10, 5),(1, 16, 12),(6, 124, 254);
Select *
From #Values
SELECT
f.FormulaId
, f.FormulaName
, v.ValueId
, Price
, Operator
, Size
, CASE WHEN Operator = '+' THEN Price + Size
WHEN Operator = '-' THEN Price - Size
WHEN Operator = '*' THEN Price * Size
WHEN Operator = '/' THEN Price / Size
WHEN Operator = '^2' THEN Price * Price
WHEN OPerator = '20%*' THEN (Price * 1.20) * Size
END AS Output
FROM #Values v
INNER JOIN #Formula f ON f.FormulaId = v.FormulaId
With this method my operation is really just a pointer reference to another table that has an operator that is really for all intents and purposes just a token I use for my case statement. You can even compound this potentially and if you wanted to do multiple passed you could add a 'Group' column and a 'Sequence' and do one after the other. It depends how difficult your 'formulas' become. Because if you get into more than 3 or 4 variables that change frequently with frequent operator changes, then you probably would want to do dynamic sql. But if they are just a few things, that should not be that hard. Just keep in mind the downside of this approach is that it is hard coded at a certain level, yet flexible that the parameters put into it can apply this formula over and over.

Related

Generate sequential number in SQL - not by using Identity

I am working on a task where my query will produce a fixed width column. One of the fields in the fixed width column needs to be a sequentially generated number.
Below is my query:
select
_row_ord = 40,
_cid = t.client_num,
_segment = 'ABC',
_value =
concat(
'ABC*',
'XX**', --Hierarchical ID number-this field should be sequentially generated
'20*',
'1*','~'
)
from #temp1 t
My output:
Is there a way to declare #num as a parameter that generates number sequentially?
PS: The fields inside the CONCAT function is all hardcoded. Only the 'XX' i.e., the sequential number has to be dynamically generated
Any help?!
You could create a SEQUENCE object, then call the NEXT VALUE FOR the SEQUENCE in your query.
Something along these lines:
CREATE SEQUENCE dbo.ExportValues
START WITH 1
INCREMENT BY 1 ;
GO
And then:
select
_row_ord = 40,
_cid = t.client_num,
_segment = 'ABC',
_value =
concat(
'ABC*',
RIGHT(CONCAT('000000000000000', NEXT VALUE FOR dbo.ExportValues),15)
'**',
'20*',
'1*','~'
)
from #temp1 t
You'd have to tweak how many zeros there are for the padding and how many digits to trim it to for your requirements. If duplicate values are ok, you could have the SEQUENCE reset periodically. See the documentation for more on that. It's just another line in the CREATE statement.
You can use row_number() -- made a little more complicated because you are zero-padding it:
select _row_ord = 40, _cid = t.client_num, _segment = 'ABC',
_value = concat('ABC*',
right('00' + convert(varchar(255), row_number() over (order by ?)), 2),
'XX**', --Hierarchical ID number-this field should be sequentially generated
'20*',
'1*','~'
)
from #temp1 t;
Note that the ? is for the column that specifies the ordering. If you don't care about the ordering of the numbers, use (select null) in place of the ?.

SQL Server - Select column that contains query string and split values into anothers 'columns'

I need to do a select in a column that contains a query string like:
user_id=300&company_id=201503&status=WAITING OPERATION&count=1
I want to perform a select and break each value in a new column, something like:
user_id | company_id | status | count
300 | 201503 | WAITING OPERATION | 1
How can i do it in SQL Server without use procs?
I've tried a function:
CREATE FUNCTION [xpto].[SplitGriswold]
(
#List NVARCHAR(MAX),
#Delim1 NCHAR(1),
#Delim2 NCHAR(1)
)
RETURNS TABLE
AS
RETURN
(
SELECT
Val1 = PARSENAME(Value,2),
Val2 = PARSENAME(Value,1)
FROM
(
SELECT REPLACE(Value, #Delim2, '&') FROM
(
SELECT LTRIM(RTRIM(SUBSTRING(#List, [Number],
CHARINDEX(#Delim1, #List + #Delim1, [Number]) - [Number])))
FROM (SELECT Number = ROW_NUMBER() OVER (ORDER BY name)
FROM sys.all_objects) AS x
WHERE Number <= LEN(#List)
AND SUBSTRING(#Delim1 + #List, [Number], LEN(#Delim1)) = #Delim1
) AS y(Value)
) AS z(Value)
);
GO
Execution:
select QueryString
from User.Log
CROSS APPLY notifier.SplitGriswold(REPLACE(QueryString, ' ', N'ŏ'), N'ŏ', '&') AS t;
But it returns me only one column with all inside:
QueryString
user_id=300&company_id=201503&status=WAITING OPERATION&count=1
Thanks in advance.
I've had to do this many times before, and you're in luck! Since you only have 3 delimiters per string, and that number is fixed, you can use SQL Server's PARSENAME function to do it. That's far less ugly than the best alternative (using the XML parsing stuff). Try this (untested) query (replace TABLE_NAME and COLUMN_NAME with the appropriate names):
SELECT
PARSENAME(REPLACE(COLUMN_NAME,'&','.'),1) AS 'User',
PARSENAME(REPLACE(COLUMN_NAME,'&','.'),2) AS 'Company_ID',
PARSENAME(REPLACE(COLUMN_NAME,'&','.'),3) AS 'Status',
PARSENAME(REPLACE(COLUMN_NAME,'&','.'),4) AS 'Count',
FROM TABLE_NAME
That'll get you the results in the form "user_id=300", which is far and away the hard part of what you want. I'll leave it to you to do the easy part (drop the stuff before the "=" sign).
NOTE: I can't remember if PARSENAME will freak out over the illegal name character (the "=" sign). If it does, simply nest another REPLACE in there to turn it into something else, like an underscore.
You need to use SQL SUBSTRING as part of your select statement. You would first need to build the first row, then use a UNION to return the second row.

SQL batch query processing (SQL query input array)

I have SQL query like
SELECT *, dbo.func(#param1, a.point) as fValue
FROM dbo.table AS a
WHERE dbo.func(#param1, a.point) < #param2
When this query is executed only once, everything is fine, but when I have array of input #param1 values let's say, over 100 values, executing and fetching results for every value take s a lot of time.
Is it possible to pass array of #param1 into the query somehow, and receive dataset for all the input values, instead of executing it for each value?
function func() doing some math on 2 values. #param1 and a.point are type of double. and, yeah, a.point - is not an ID, and it is not a unique value.
I know, it should be really easy, but it looks like I'm missing something.
You still need to execute that function 100 times for each row, right? I don't see any shortcuts here.
If you wanted to get them all at once, you could do
SELECT dbo.func(#param1, a.point) as fValue1,
dbo.func(#param2, a.point) as fValue2 ...
or something like that, but looping through them just seems more efficient to me anyway.
I suppose you could use a cursor to retrieve each a.point value once, then act on it 100 times, but that's a lot of coding, and not necessarily a simpler solution.
What exactly does dbo.func() do? Is it possible that you could insert the 100 values into a table structure, and perform that operation on the set of 100 all at once, instead of 1x1 100 times?
As an example, let's say you have this function, which just turns a comma-separated list of float values into a single-column table:
CREATE FUNCTION dbo.ListFloats
(
#List VARCHAR(MAX)
)
RETURNS TABLE
RETURN
(
SELECT i = CONVERT(FLOAT, Item)
FROM
(
SELECT Item = x.i.value('(./text())[1]', 'FLOAT')
FROM
(
SELECT [XML] = CONVERT(XML, '<i>'
+ REPLACE(#List, ',', '</i><i>')
+ '</i>').query('.')
) AS a
CROSS APPLY
[XML].nodes('i') AS x(i)
) AS y
WHERE Item IS NOT NULL
);
GO
Now you should be able to get your floats in a set by simply saying:
SELECT i FROM dbo.ListFloats('1.5, 3.0, 2.45, 1.9');
Taking that a step further, let's say dbo.func() takes these two inputs and says something like:
RETURN (SELECT (#param1 + #param2 / #param2));
Now, I know that you've always been told that modularization and encapsulation are good, but in the case of inline functions, I would suggest you avoid the function that gets this result (again you haven't explained what dbo.func() does, so I'm just guessing this will be easy) and do it inline. So instead of calling dbo.func() - twice for each row, no less - you can just say:
DECLARE
#Param1Array VARCHAR(MAX) = '1.5, 3.0, 2.45, 1.9',
#Param2 FLOAT = 2.0;
WITH x AS
(
SELECT t.point, x.i, fValue = ((x.i + t.point)/t.point)
FROM dbo.[table] AS t
CROSS JOIN dbo.ListFloats(#Param1Array) AS x
)
SELECT point, i, fValue FROM x
--WHERE fValue < #Param2
;
The keys are:
Avoiding processing each parameter individually.
Avoiding the individual calculations off in its own separate module.
Performing calculations as few times as possible.
If you can't change the structure this much, then at the very least, avoid calculating the function twice by writing instead:
;WITH x AS
(
SELECT *, dbo.func(#param1, a.point) as fValue
FROM dbo.table AS a
)
SELECT * FROM x
WHERE fValue < #param2;
If you provide details about the data types, what dbo.func() does, etc., people will be able to provide more tangible advice.
Do you have any indexes on this table? If you have an index on a.point, then you will never hit it using this code, ie will always table scan. This is to do with Search Arguments (you can google this). Example:
If you have table xTable with index on column xColumn, then this:
select colA, colB from xTable where xColumn/2 >= 5
will never use the index, but this probably will:
select colA, colB from xTable where xColumn >=10
So you might need something like this:
WHERE a.point < Otherfunc(#param1, #param2 )

Obfuscate / Mask / Scramble personal information

I'm looking for a homegrown way to scramble production data for use in development and test. I've built a couple of scripts that make random social security numbers, shift birth dates, scramble emails, etc. But I've come up against a wall trying to scramble customer names. I want to keep real names so we can still use or searches so random letter generation is out. What I have tried so far is building a temp table of all last names in the table then updating the customer table with a random selection from the temp table. Like this:
DECLARE #Names TABLE (Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names SELECT LastName FROM Customer ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
This worked well in test, but completely bogs down dealing with larger amounts of data (>20 minutes for 40K rows)
All of that to ask, how would you scramble customer names while keeping real names and the weight of the production data?
UPDATE: Never fails, you try to put all the information in the post, and you forget something important. This data will also be used in our sales & demo environments which are publicly available. Some of the answers are what I am attempting to do, to 'switch' the names, but my question is literally, how to code in T-SQL?
I use generatedata. It is an open source php script which can generate all sorts of dummy data.
A very simple solution would be to ROT13 the text.
A better question may be why you feel the need to scramble the data? If you have an encryption key, you could also consider running the text through DES or AES or similar. Thos would have potential performance issues, however.
When doing something like that I usually write a small program that first loads a lot of names and surnames in two arrays, and then just updates the database using random name/surname from arrays. It works really fast even for very big datasets (200.000+ records)
I use a method that changes characters in the name to other characters that are in the same "range" of usage frequency in English names. Apparently, the distribution of characters in names is different than it is for normal conversational English. For example, "x" and "z" occur 0.245% of the time, so they get swapped. The the other extreme, "w" is used 5.5% of the time, "s" 6.86% and "t", 15.978%. I change "s" to "w", "t" to "s" and "w" to "t".
I keep the vowels "aeio" in a separate group so that a vowel is only replaced by another vowel. Similarly, "q", "u" and "y" are not replaced at all. My grouping and decisions are totally subjective.
I ended up with 7 different "groups" of 2-5 characters , based mostly on frequency. characters within each group are swapped with other chars in that same group.
The net result is names that kinda look like the might be names, but from "not around here".
Original name Morphed name
Loren Nimag
Juanita Kuogewso
Tennyson Saggywig
David Mijsm
Julie Kunewa
Here's the SQL I use, which includes a "TitleCase" function. There are 2 different versions of the "morphed" name based on different frequencies of letters I found on the web.
-- from https://stackoverflow.com/a/28712621
-- Convert and return param as Title Case
CREATE FUNCTION [dbo].[fnConvert_TitleCase] (#InputString VARCHAR(4000) )
RETURNS VARCHAR(4000)AS
BEGIN
DECLARE #Index INT
DECLARE #Char CHAR(1)
DECLARE #OutputString VARCHAR(255)
SET #OutputString = LOWER(#InputString)
SET #Index = 2
SET #OutputString = STUFF(#OutputString, 1, 1,UPPER(SUBSTRING(#InputString,1,1)))
WHILE #Index <= LEN(#InputString)
BEGIN
SET #Char = SUBSTRING(#InputString, #Index, 1)
IF #Char IN (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&','''','(','{','[','#')
IF #Index + 1 <= LEN(#InputString)
BEGIN
IF #Char != '''' OR UPPER(SUBSTRING(#InputString, #Index + 1, 1)) != 'S'
SET #OutputString = STUFF(#OutputString, #Index + 1, 1,UPPER(SUBSTRING(#InputString, #Index + 1, 1)))
END
SET #Index = #Index + 1
END
RETURN ISNULL(#OutputString,'')
END
Go
-- 00.045 x 0.045%
-- 00.045 z 0.045%
--
-- Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z')
--
-- 00.456 k 0.456%
-- 00.511 j 0.511%
-- 00.824 v 0.824%
-- kjv
-- Replace(Replace(Replace(Replace(TS_NAME,'k','#'),'j','k'),'v','j'),'#','v')
--
-- 01.642 g 1.642%
-- 02.284 n 2.284%
-- 02.415 l 2.415%
-- gnl
-- Replace(Replace(Replace(Replace(TS_NAME,'g','#'),'n','g'),'l','n'),'#','l')
--
-- 02.826 r 2.826%
-- 03.174 d 3.174%
-- 03.826 m 3.826%
-- rdm
-- Replace(Replace(Replace(Replace(TS_NAME,'r','#'),'d','r'),'m','d'),'#','m')
--
-- 04.027 f 4.027%
-- 04.200 h 4.200%
-- 04.319 p 4.319%
-- 04.434 b 4.434%
-- 05.238 c 5.238%
-- fhpbc
-- Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c')
--
-- 05.497 w 5.497%
-- 06.686 s 6.686%
-- 15.978 t 15.978%
-- wst
-- Replace(Replace(Replace(Replace(TS_NAME,'w','#'),'s','w'),'t','s'),'#','t')
--
--
-- 02.799 e 2.799%
-- 07.294 i 7.294%
-- 07.631 o 7.631%
-- 11.682 a 11.682%
-- eioa
-- Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')
--
-- -- dont replace
-- 00.222 q 0.222%
-- 00.763 y 0.763%
-- 01.183 u 1.183%
-- Obfuscate a name
Select
ts_id,
Cast(ts_name as varchar(42)) as [Original Name]
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z'),'k','#'),'j','k'),'v','j'),'#','v'),'g','#'),'n','g'),'l','n'),'#','l'),'r','#'),'d','r'),'m','d'),'#','m'),'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c'),'w','#'),'s','w'),'t','s'),'#','t'),'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')) as VarChar(42)) As [morphed name] ,
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','t'),'~','e'),'t','~'),'a','o'),'~','a'),'o','~'),'i','n'),'~','i'),'n','~'),'s','h'),'~','s'),'h','r'),'r','~'),'d','l'),'~','d'),'l','~'),'m','w'),'~','m'),'w','f'),'f','~'),'g','y'),'~','g'),'y','p'),'p','~'),'b','v'),'~','b'),'v','k'),'k','~'),'x','~'),'j','x'),'~','j')) as VarChar(42)) As [morphed name2]
From
ts_users
;
Why not just use some sort of Random Name Generator?
Use a temporary table instead and the query is very fast. I just ran on 60K rows in 4 seconds. I'll be using this one going forward.
DECLARE TABLE #Names
(Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names
SELECT LastName
FROM Customer
ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID]
SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
DROP TABLE #Names
The following approach worked for us, lets say we have 2 tables Customers and Products:
CREATE FUNCTION [dbo].[GenerateDummyValues]
(
#dataType varchar(100),
#currentValue varchar(4000)=NULL
)
RETURNS varchar(4000)
AS
BEGIN
IF #dataType = 'int'
BEGIN
Return '0'
END
ELSE IF #dataType = 'varchar' OR #dataType = 'nvarchar' OR #dataType = 'char' OR #dataType = 'nchar'
BEGIN
Return 'AAAA'
END
ELSE IF #dataType = 'datetime'
BEGIN
Return Convert(varchar(2000),GetDate())
END
-- you can add more checks, add complicated logic etc
Return 'XXX'
END
The above function will help in generating different data based on the data type coming in.
Now, for each column of each table which does not have word "id" in it, use following query to generate further queries to manipulate the data:
select 'select ''update '' + TABLE_NAME + '' set '' + COLUMN_NAME + '' = '' + '''''''' + dbo.GenerateDummyValues( Data_type,'''') + '''''' where id = '' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, ' + table_name + ' where RIGHT(LOWER(COLUMN_NAME),2) <> ''id'' and TABLE_NAME = '''+ table_name + '''' + ';' from INFORMATION_SCHEMA.TABLES;
When you execute above query it will generate update queries for each table and for each column of that table, for example:
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Customers where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Customers';
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Products where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Products';
Now, when you execute above queries you will get final update queries, that will update the data of your tables.
You can execute this on any SQL server database, no matter how many tables do you have, it will generate queries for you that can be further executed.
Hope this helps.
Another site to generate shaped fake data sets, with an option for T-SQL output:
https://mockaroo.com/
Here's a way using ROT47 which is reversible, and another which is random. You can add a PK to either to link back to the "un scrambled" versions
declare #table table (ID int, PLAIN_TEXT nvarchar(4000))
insert into #table
values
(1,N'Some Dudes name'),
(2,N'Another Person Name'),
(3,N'Yet Another Name')
--split your string into a column, and compute the decimal value (N)
if object_id('tempdb..#staging') is not null drop table #staging
select
substring(a.b, v.number+1, 1) as Val
,ascii(substring(a.b, v.number+1, 1)) as N
--,dense_rank() over (order by b) as RN
,a.ID
into #staging
from (select PLAIN_TEXT b, ID FROM #table) a
inner join
master..spt_values v on v.number < len(a.b)
where v.type = 'P'
--select * from #staging
--create a fast tally table of numbers to be used to build the ROT-47 table.
;WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--Here we put it all together with stuff and FOR XML
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,case
when 47 + N <= 126 then char(47 + N)
when 47 + N > 126 then char(N-47)
end as ENCRYPTED_TEXT
from cteTally
where N between 33 and 126) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
--or if you want really random
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,char((select ROUND(((122 - N -1) * RAND() + N), 0))) as ENCRYPTED_TEXT
from cteTally
where (N between 65 and 122) and N not in (91,92,93,94,95,96)) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
Encountered the same problem myself and figured out an alternative solution that may work for others.
The idea is to use MD5 on the name and then take the last 3 hex digits of it to map into a table of names. You can do this separately for first name and last name.
3 hex digits represent decimals from 0 to 4095, so we need a list of 4096 first names and 4096 last names.
So conv(substr(md5(first_name), 3),16,10) (in MySQL syntax) would be an index from 0 to 4095 that could be joined with a table that holds 4096 first names. The same concept could be applied to last names.
Using MD5 (as opposed to a random number) guarantees a name in the original data will always be mapped to the same name in the test data.
You can get a list of names here:
https://gist.github.com/elifiner/cc90fdd387449158829515782936a9a4
I am working on this at my company right now -- and it turns out to be a very tricky thing. You want to have names that are realistic, but must not reveal any real personal info.
My approach has been to first create a randomized "mapping" of last names to other last names, then use that mapping to change all last names. This is good if you have duplicate name records. Suppose you have 2 "John Smith" records that both represent the same real person. If you changed one record to "John Adams" and the other to "John Best", then your one "person" now has 2 different names! With a mapping, all occurrences of "Smith" get changed to "Jones", and so duplicates ( or even family members ) still end up with the same last name, keeping the data more "realistic".
I will also have to scramble the addresses, phone numbers, bank account numbers, etc...and I am not sure how I will approach those. Keeping the data "realistic" while scrambling is certainly a deep topic. This must have been done many times by many companies -- who has done this before? What did you learn?
Frankly, I'm not sure why this is needed. Your dev/test environments should be private, behind your firewall, and not accessible from the web.
Your developers should be trusted, and you have legal recourse against them if they fail to live up to your trust.
I think the real question should be "Should I scramble the data?", and the answer is (in my mind) 'no'.
If you're sending it offsite for some reason, or you have to have your environments web-accessible, or if you're paranoid, I would implement a random switch. Rather than build a temp table, run switches between each location and a random row in the table, swapping one piece of data at a time.
The end result will be a table with all the same data, but with it randomly reorganized. It should also be faster than your temp table, I believe.
It should be simple enough to implement the Fisher-Yates Shuffle in SQL...or at least in a console app that reads the db and writes to the target.
Edit (2): Off-the cuff answer in T-SQL:
declare #name varchar(50)
set #name = (SELECT lastName from person where personID = (random id number)
Update person
set lastname = #name
WHERE personID = (person id of current row)
Wrap this in a loop, and follow the guidelines of Fisher-Yates for modifying the random value constraints, and you'll be set.

Uppercase first two characters in a column in a db table

I've got a column in a database table (SQL Server 2005) that contains data like this:
TQ7394
SZ910284
T r1534
su8472
I would like to update this column so that the first two characters are uppercase. I would also like to remove any spaces between the first two characters. So T q1234 would become TQ1234.
The solution should be able to cope with multiple spaces between the first two characters.
Is this possible in T-SQL? How about in ANSI-92? I'm always interested in seeing how this is done in other db's too, so feel free to post answers for PostgreSQL, MySQL, et al.
Here is a solution:
EDIT: Updated to support replacement of multiple spaces between the first and the second non-space characters
/* TEST TABLE */
DECLARE #T AS TABLE(code Varchar(20))
INSERT INTO #T SELECT 'ab1234x1' UNION SELECT ' ab1234x2'
UNION SELECT ' ab1234x3' UNION SELECT 'a b1234x4'
UNION SELECT 'a b1234x5' UNION SELECT 'a b1234x6'
UNION SELECT 'ab 1234x7' UNION SELECT 'ab 1234x8'
SELECT * FROM #T
/* INPUT
code
--------------------
ab1234x3
ab1234x2
a b1234x6
a b1234x5
a b1234x4
ab 1234x8
ab 1234x7
ab1234x1
*/
/* START PROCESSING SECTION */
DECLARE #s Varchar(20)
DECLARE #firstChar INT
DECLARE #secondChar INT
UPDATE #T SET
#firstChar = PATINDEX('%[^ ]%',code)
,#secondChar = #firstChar + PATINDEX('%[^ ]%', STUFF(code,1, #firstChar,'' ) )
,#s = STUFF(
code,
1,
#secondChar,
REPLACE(LEFT(code,
#secondChar
),' ','')
)
,#s = STUFF(
#s,
1,
2,
UPPER(LEFT(#s,2))
)
,code = #s
/* END PROCESSING SECTION */
SELECT * FROM #T
/* OUTPUT
code
--------------------
AB1234x3
AB1234x2
AB1234x6
AB1234x5
AB1234x4
AB 1234x8
AB 1234x7
AB1234x1
*/
UPDATE YourTable
SET YourColumn = UPPER(
SUBSTRING(
REPLACE(YourColumn, ' ', ''), 1, 2
)
)
+
SUBSTRING(YourColumn, 3, LEN(YourColumn))
UPPER isn't going to hurt any numbers, so if the examples you gave are completely representative, there's not really any harm in doing:
UPDATE tbl
SET col = REPLACE(UPPER(col), ' ', '')
The sample data only has spaces and lowercase letters at the start. If this holds true for the real data then simply:
UPPER(REPLACE(YourColumn, ' ', ''))
For a more specific answer I'd politely ask you to expand on your spec, otherwise I'd have to code around all the other possibilities (e.g. values of less than three characters) without knowing if I was overengineering my solution to handle data that wouldn't actually arise in reality :)
As ever, once you've fixed the data, put in a database constraint to ensure the bad data does not reoccur e.g.
ALTER TABLE YourTable ADD
CONSTRAINT YourColumn__char_pos_1_uppercase_letter
CHECK (ASCII(SUBSTRING(YourColumn, 1, 1)) BETWEEN ASCII('A') AND ASCII('Z'));
ALTER TABLE YourTable ADD
CONSTRAINT YourColumn__char_pos_2_uppercase_letter
CHECK (ASCII(SUBSTRING(YourColumn, 2, 1)) BETWEEN ASCII('A') AND ASCII('Z'));
#huo73: yours doesn't work for me on SQL Server 2008: I get 'TRr1534' instead of 'TR1534'.
update Table set Column = case when len(rtrim(substring (Column , 1 , 2))) < 2
then UPPER(substring (Column , 1 , 1) + substring (Column , 3 , 1)) + substring(Column , 4, len(Column)
else UPPER(substring (Column , 1 , 2)) + substring(Column , 3, len(Column) end
This works on the fact that if there is a space then the trim of that part of string would yield length less than 2 so we split the string in three and use upper on the 1st and 3rd char. In all other cases we can split the string in 2 parts and use upper to make the first two chars to upper case.
If you are doing an UPDATE, I would do it in 2 steps; first get rid of the space (RTRIM on a SUBSTRING), and second do the UPPER on the first 2 chars:
// uses a fixed column length - 20-odd in this case
UPDATE FOO
SET bar = RTRIM(SUBSTRING(bar, 1, 2)) + SUBSTRING(bar, 3, 20)
UPDATE FOO
SET bar = UPPER(SUBSTRING(bar, 1, 2)) + SUBSTRING(bar, 3, 20)
If you need it in a SELECT (i.e. inline), then I'd be tempted to write a scalar UDF