Filling as it is data from stored procedure to Datatable - sql

I am experiencing weird behavior while filling data from a stored procedure to a DataTable.
What I am filling is:
Output from stored procedure:
MINI COMBO
Coke Float
Which is constructed in Stuff by adding CHAR(13) after MINI COMBO and some space before Coke Float.
Reflect in DataTable after fill :
Coke Float
MINI COMBO
This is really new to me, please help.
Thanks in advance

This is an ordering issue. The order that is output from your stored procedure can change at any time for a variety of reasons, unless, you specify an ORDER BY statement.
For example:
SELECT TXNID, ItemName = STUFF( ( SELECT Case When Level > 0 Then Case When
ItemName like '%(%' Then ','+ItemName Else '( '+ ItemName+' ' End Else CHAR(13) + Space(Spaces * 5) + ItemName+' ' End
FROM #ARLines_2 x1 WHERE TXNID = x.TXNID Order By Spaces FOR XML PATH(''), TYPE).value('.[1]', 'nvarchar(max)'), 1, 2, '') FROM #ARLines_2 AS x
GROUP BY ID
ORDER BY ItemName DESC
I suspect however, you have some sort of hierarchy order that you want to show in your drop down. i.e. parent child relationship. If this is the case, you will need to include some sort of hierarchy identifier in the order by statement

Related

SQL Query to use column as a formula to calculate value

I am working on a request that i need to calculate value based on formula specified in another column
Below is my table:
I need to write the query to get value which will be based on FORMULA column. eg I need result as
As formula could be anything consisting of my columns PRICE and SIZE, how do i write the query to achieve this?
Dynamic query is the (only) way to go and it's not that complicated:
DECLARE #query NVARCHAR(MAX) = '';
SELECT #query = #query + '
UNION
SELECT ItemID, Price, Size, Formula, ' + Formula + ' AS CalcValue FROM YourTable WHERE Formula = ''' + Formula + ''' '
FROM YourTable;
SET #query = STUFF(#query,1,8,'');
PRINT #query;
EXEC (#query);
SQLFiddle DEMO
But you must be aware how prone to errors this is. If value of Formula column is not valid formula query breaks.
edit: going with UNION instead of UNION ALL because of same formula appearing in multiple rows
edit2: Plan B - Instead of running bunch of same select queries and making distinct of results, better to make distinct formulas at beginning:
DECLARE #query NVARCHAR(MAX) = '';
WITH CTE_DistinctFormulas AS
(
SELECT DISTINCT Formula FROM YourTable
)
SELECT #query = #query + '
UNION ALL
SELECT ItemID, Price, Size, Formula, ' + Formula + ' AS CalcValue FROM YourTable WHERE Formula = ''' + Formula + ''' '
FROM CTE_DistinctFormulas;
SET #query = STUFF(#query,1,12,'');
PRINT #query;
EXEC (#query);
SQLFiddle DEMO 2 - added few more rows
Another alternative which is relatively easy to do is with a CLR. You can take advantage of the Compute Method of DataTable to give a simple one line code in C#.
[Microsoft.SqlServer.Server.SqlFunction]
public static double Evaluate(SqlString expression)
{
return double.Parse((new DataTable()).Compute(expression.ToString(), "").ToString());
}
Then add the assembly to SQL Server and create the wrapper function:
CREATE FUNCTION [dbo].[Evaluate](#expression [nvarchar](4000))
RETURNS [float] WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [YourAssemblyName].[YourClassName].[Evaluate]
GO
Now you can call the function as part of a simple select statement:
SELECT itemid, price, size, formula,
dbo.Evaluate(REPLACE(REPLACE(formula, 'PRICE', FORMAT(price,'0.00')),
'SIZE', FORMAT(size, '0'))) as calcvalue FROM YourTable
I did something similar to this and if you play in a smaller pool of operations it is not that hard. I went with a series where I had X and Y columns and an operator. Then I just did a large case when statement on identifying the operator and performing logic based on that. You may have to change your structure slightly. The core of the problem is that SQL is a result set based engine so anything where you have to do an operation to determine dynamic is going to be slower. EG:
DECLARE #Formula TABLE
(
FormulaId INT IDENTITY
, FormulaName VARCHAR(128)
, Operator VARCHAR(4)
);
DECLARE #Values TABLE
(
ValueId INT IDENTITY
, FormulaId INT
, Price MONEY
, Size INT
)
INSERT INTO #Formula (FormulaName, Operator)
VALUES ('Simple Addition', '+'), ( 'Simple Subtraction', '-'), ('Simple Multiplication', '*'), ('Simple Division', '/'), ('Squared', '^2'), ('Grow by 20 percent then Multiply', '20%*')
INSERT INTO #Values (FormulaId, Price, Size)
VALUES (1, 10, 5),(2, 10, 5),(3, 10, 5),(4, 10, 5),(5, 10, 5),(6, 10, 5),(1, 16, 12),(6, 124, 254);
Select *
From #Values
SELECT
f.FormulaId
, f.FormulaName
, v.ValueId
, Price
, Operator
, Size
, CASE WHEN Operator = '+' THEN Price + Size
WHEN Operator = '-' THEN Price - Size
WHEN Operator = '*' THEN Price * Size
WHEN Operator = '/' THEN Price / Size
WHEN Operator = '^2' THEN Price * Price
WHEN OPerator = '20%*' THEN (Price * 1.20) * Size
END AS Output
FROM #Values v
INNER JOIN #Formula f ON f.FormulaId = v.FormulaId
With this method my operation is really just a pointer reference to another table that has an operator that is really for all intents and purposes just a token I use for my case statement. You can even compound this potentially and if you wanted to do multiple passed you could add a 'Group' column and a 'Sequence' and do one after the other. It depends how difficult your 'formulas' become. Because if you get into more than 3 or 4 variables that change frequently with frequent operator changes, then you probably would want to do dynamic sql. But if they are just a few things, that should not be that hard. Just keep in mind the downside of this approach is that it is hard coded at a certain level, yet flexible that the parameters put into it can apply this formula over and over.

Using Upper to Capitalize the first letter of City name

I am doing some data clean-up and need to Capitalize the first letter of City names. How do I capitalize the second word in a City Like Terra Bella.
SELECT UPPER(LEFT([MAIL CITY],1))+
LOWER(SUBSTRING([MAIL CITY],2,LEN([MAILCITY])))
FROM masterfeelisting
My results is this 'Terra bella' and I need 'Terra Bella'. Thanks in advance.
Ok, I know I answered this before, but it bugged me that we couldn't write something efficient to handle an unknown amount of 'text segments'.
So re-thinking it and researching, I discovered a way to change the [MAILCITY] field into XML nodes where each 'text segment' is assigned it's own Node within the xml field. Then those xml fields can be processed node by node, concatenated together, and then changed back to a SQL varchar. It's convoluted, but it works. :)
Here's the code:
CREATE TABLE
#masterfeelisting (
[MAILCITY] varchar(max) not null
);
INSERT INTO #masterfeelisting VALUES
('terra bellA')
,(' terrA novA ')
,('chicagO ')
,('bostoN')
,('porT dE sanTo')
,(' porT dE sanTo pallo ');
SELECT
RTRIM
(
(SELECT
UPPER([xmlField].[xmlNode].value('.', 'char(1)')) +
LOWER(STUFF([xmlField].[xmlNode].value('.', 'varchar(max)'), 1, 1, '')) + ' '
FROM [xmlNodeRecordSet].[nodeField].nodes('/N') as [xmlField]([xmlNode]) FOR
xml path(''), type
).value('.', 'varchar(max)')
) as [MAILCITY]
FROM
(SELECT
CAST('<N>' + REPLACE([MAILCITY],' ','</N><N>')+'</N>' as xml) as [nodeField]
FROM #masterfeelisting
) as [xmlNodeRecordSet];
Drop table #masterfeelisting;
First I create a table and fill it with dummy values.
Now here is the beauty of the code:
For each record in #masterfeelisting, we are going to create an xml field with a node for each 'text segment'.
ie. '<N></N><N>terrA</N><N>novA</N><N></N>'
(This is built from the varchar ' terrA novA ')
1) The way this is done is by using the REPLACE function.
The string starts with a '<N>' to designate the beginning of the node. Then:
REPLACE([MAILCITY],' ','</N><N>')
This effectively goes through the whole [MAILCITY] string and replaces each
' ' with '</N><N>'
and then the string ends with a '</N>'. Where '</N>' designates the end of each node.
So now we have a beautiful XML string with a couple of empty nodes and the 'text segments' nicely nestled in their own node. All the 'spaces' have been removed.
2) Then we have to CAST the string into xml. And we will name that field [nodeField]. Now we can use xml functions on our newly created record set. (Conveniently named [xmlNodeRecordSet].)
3) Now we can read the [xmlNodeRecordSet] into the main sub-Select by stating:
FROM [xmlNodeRecordSet].[nodeField].nodes('/N')
This tells us we are reading the [nodeField] as nodes with a '/N' delimiter.
This table of node fields is then parsed by stating:
as [xmlField]([xmlNode]) FOR xml path(''), type
This means each [xmlField] will be parsed for each [xmlNode] in the xml string.
4) So in the main sub-select:
Each blank node '<N></N>' is discarded. (Or not processed.)
Each node with a 'text segment' in it will be parsed. ie <N>terrA</N>
UPPER([xmlField].[xmlNode].value('.', 'char(1)')) +
This code will grab each node out of the field and take its contents '.' and only grab the first character 'char(1)'. Then it will Upper case that character. (the plus sign at the end means it will concatenate this letter with the next bit of code:
LOWER(STUFF([xmlField].[xmlNode].value('.', 'varchar(max)'), 1, 1, ''))
Now here is the beauty... STUFF is a function that will take a string, from a position, for a length, and substitute another string.
STUFF(string, start position, length, replacement string)
So our string is:
[xmlField].[xmlNode].value('.', 'varchar(max)')
Which grabs the whole string inside the current node since it is 'varchar(max)'.
The start position is 1. The length is 1. And the replacement string is ''. This effectively strips off the first character by replacing it with nothing. So the remaining string is all the other characters that we want to have lower case. So that's what we do... we use LOWER to make them all lower case. And this result is concatenated to our first letter that we already upper cased.
But wait... we are not done yet... we still have to append a + ' '. Which adds a blank space after our nicely capitalized 'text segment'. Just in case there is another 'text segment' after this node is done.
This main sub-Select will now parse each node in our [xmlField] and concatenate them all nicely together.
5) But now that we have one big happy concatenation, we still have to change it back from an xml field to a SQL varchar field. So after the main sub-select we need:
.value('.', 'varchar(max)')
This changes our [MAILCITY] back to a SQL varchar.
6) But hold on... we still are not done. Remember we put an extra space at the end of each 'text segment'??? Well the last 'text segment still has that extra space after it. So we need to Right Trim that space off by using RTRIM.
7) And dont forget to rename the final field back to as [MAILCITY]
8) And that's it. This code will take an unknown amount of 'text segments' and format each one of them. All using the fun of XML and it's node parsers.
Hope that helps :)
Here's one way to handle this using APPLY. Note that this solution supports up to 3 substrings (e.g. "Phoenix", "New York", "New York City") but can easily be updated to handle more.
DECLARE #string varchar(100) = 'nEW yoRk ciTY';
WITH DELIMCOUNT(String, DC) AS
(
SELECT #string, LEN(RTRIM(LTRIM(#string)))-LEN(REPLACE(RTRIM(LTRIM(#string)),' ',''))
),
CIPOS AS
(
SELECT *
FROM DELIMCOUNT
CROSS APPLY (SELECT CHARINDEX(char(32), string, 1)) CI1(CI1)
CROSS APPLY (SELECT CHARINDEX(char(32), string, CI1.CI1+1)) CI2(CI2)
)
SELECT
OldString = #string,
NewString =
CASE DC
WHEN 0 THEN UPPER(SUBSTRING(string,1,1))+LOWER(SUBSTRING(string,2,8000))
WHEN 1 THEN UPPER(SUBSTRING(string,1,1))+LOWER(SUBSTRING(string,2,CI1-1)) +
UPPER(SUBSTRING(string,CI1+1,1))+LOWER(SUBSTRING(string,CI1+2,100))
WHEN 2 THEN UPPER(SUBSTRING(string,1,1))+LOWER(SUBSTRING(string,2,CI1-1)) +
UPPER(SUBSTRING(string,CI1+1,1))+LOWER(SUBSTRING(string,CI1+2,CI2-(CI1+1))) +
UPPER(SUBSTRING(string,CI2+1,1))+LOWER(SUBSTRING(string,CI2+2,100))
END
FROM CIPOS;
Results:
OldString NewString
--------------- --------------
nEW yoRk ciTY New York City
This will only capitalize the first letter of the second word. A shorter but less flexible approach. Replace #str with [Mail City].
DECLARE #str AS VARCHAR(50) = 'Los angelas'
SELECT STUFF(#str, CHARINDEX(' ', #str) + 1, 1, UPPER(SUBSTRING(#str, CHARINDEX(' ', #str) + 1, 1)));
This is a way to use imbedded Selects for three City name parts.
It uses CHARINDEX to find the location of your separator character. (ie a space)
I put an 'if' structure around the Select to test if you have any records with more than 3 parts to the city name. If you ever get the warning message, you could add another sub-Select to handle another city part.
Although... just to be clear... SQL is not the best language to do complicated formatting. It was written as a data retrieval engine with the idea that another program will take that data and massage it into a friendlier look and feel. It may be easier to handle the formatting in another program. But if you insist on using SQL and you need to account for city names with 5 or more parts... you may want to consider using Cursors so you can loop through the variable possibilities. (But Cursors are not a good habit to get into. So don't do that unless you've exhausted all other options.)
Anyway, the following code creates and populates a table so you can test the code and see how it works. Enjoy!
CREATE TABLE
#masterfeelisting (
[MAILCITY] varchar(30) not null
);
Insert into #masterfeelisting select 'terra bella';
Insert into #masterfeelisting select ' terrA novA ';
Insert into #masterfeelisting select 'chicagO ';
Insert into #masterfeelisting select 'bostoN';
Insert into #masterfeelisting select 'porT dE sanTo';
--Insert into #masterfeelisting select ' porT dE sanTo pallo ';
Declare #intSpaceCount as integer;
SELECT #intSpaceCount = max (len(RTRIM(LTRIM([MAILCITY]))) - len(replace([MAILCITY],' ',''))) FROM #masterfeelisting;
if #intSpaceCount > 2
SELECT 'You need to account for more than 3 city name parts ' as Warning, #intSpaceCount as SpacesFound;
else
SELECT
cThird.[MAILCITY1] + cThird.[MAILCITY2] + cThird.[MAILCITY3] as [MAILCITY]
FROM
(SELECT
bSecond.[MAILCITY1] as [MAILCITY1]
,SUBSTRING(bSecond.[MAILCITY2],1,bSecond.[intCol2]) as [MAILCITY2]
,UPPER(SUBSTRING(bSecond.[MAILCITY2],bSecond.[intCol2] + 1, 1)) +
SUBSTRING(bSecond.[MAILCITY2],bSecond.[intCol2] + 2,LEN(bSecond.[MAILCITY2]) - bSecond.[intCol2]) as [MAILCITY3]
FROM
(SELECT
SUBSTRING(aFirst.[MAILCITY],1,aFirst.[intCol1]) as [MAILCITY1]
,UPPER(SUBSTRING(aFirst.[MAILCITY],aFirst.[intCol1] + 1, 1)) +
SUBSTRING(aFirst.[MAILCITY],aFirst.[intCol1] + 2,LEN(aFirst.[MAILCITY]) - aFirst.[intCol1]) as [MAILCITY2]
,CHARINDEX ( ' ', SUBSTRING(aFirst.[MAILCITY],aFirst.[intCol1] + 1, LEN(aFirst.[MAILCITY]) - aFirst.[intCol1]) ) as intCol2
FROM
(SELECT
UPPER (LEFT(RTRIM(LTRIM(mstr.[MAILCITY])),1)) +
LOWER(SUBSTRING(RTRIM(LTRIM(mstr.[MAILCITY])),2,LEN(RTRIM(LTRIM(mstr.[MAILCITY])))-1)) as [MAILCITY]
,CHARINDEX ( ' ', RTRIM(LTRIM(mstr.[MAILCITY]))) as intCol1
FROM
#masterfeelisting as mstr -- Initial Master Table
) as aFirst -- First Select Shell
) as bSecond -- Second Select Shell
) as cThird; -- Third Select Shell
Drop table #masterfeelisting;

SQL query with subquery and concatenating variable using CRM on-premise data

I am working on a report where I need to provide a summary of notes for particular "activities/tasks".
Since the activity can accept multiple notes, I have to search for all the notes related to that activity. I then order it by date (new to old), and concatenate them with some other strings as such:
[Tom Smith wrote on 9/23/2016 1:21 pm] Client was out of office, left message. [Jane Doe wrote on 9/21/2016 3:24 pm] Client called asking about pricing.
The data comes from replicated tables of our on-premise CRM system, and I'm using SQL Server 2012. The tables I'm using are: AnnotationBase (contains the notes), ActivityPointerBase (contains the activities/tasks), and SystemUserID (to lookup usernames). Due to Data mismatch, I have to do some converting of the data types so that I can concatenate them properly, so that's why there's a lot of CAST and CONVERT. In addition, not all Activities have a NoteText associated with them, and sometimes the NoteText field is NULL, so I have to catch and filter the NULLs out (or it'll break my concatenated string).
I have written the following query:
DECLARE #Notes VarChar(Max)
SELECT
( SELECT TOP 5 #Notes = COALESCE(#Notes+ ', ', '') + '[' + CONVERT(varchar(max), ISNULL(sUB.FullName, 'N/A')) + ' wrote on ' + CONVERT(varchar(10), CAST(Anno.ModifiedOn AS DATE), 101) + RIGHT(CONVERT(varchar(32),Anno.ModifiedOn,100),8) + '] ' + CONVERT(varchar(max), ISNULL(Anno.NoteText, '')) --+ CONVERT(varchar(max), CAST(ModifiedOn AS varchar(max)), 101)--+ CAST(ModifiedOn AS varchar(max))
FROM [CRM_rsd].[dbo].[AnnotationBase] AS Anno
LEFT OUTER JOIN [CRM_rsd].[dbo].[systemUserBase] AS sUB
ON Anno.ModifiedBy = sUB.SystemUserId
WHERE Anno.ObjectId = Task.ActivityId--'0B48AB28-C08F-419A-8D98-9916BDFFDE4C'
ORDER BY Anno.ModifiedOn DESC
SELECT LEFT(#Notes,LEN(#Notes)-1)
) AS Notes
,Task.*
FROM [CRM_rsd].[dbo].[ActivityPointerBase] AS Task
WHERE Task.Subject LIKE '%Project On Hold%'
I know the above method is probably not very efficient, but the list of "Projects On Hold" is rather small (less than 500), so performance isn't a priority. What is a priority is to be able to get a consolidated and concatenated list of notes for each activity. I have been searching all over the internet for a solution, and I have tried many different methods. But I get the following errors:
Msg 102, Level 15, State 1, Line 3
Incorrect syntax near '='.
Msg 102, Level 15, State 1, Line 10
Incorrect syntax near ')'.
I envision two possible solutions to my problem:
My subquery errors are fixed or
Create a "view" of just the concatenated NotesText, grouped by ActivityId (which would work as a key), and then just query from that.
Yet even though I'm pretty sure my ideas would work, I can't seem to figure out how to concatenate a column and group at the same time.
What you are trying to do is display the records from one table (in your case ActivityPointerBase) and inside you want to add a calculated column with information from multiple records from another table (in your case AnnotationBase) merged in the rows.
There are multiple ways how you could achieve this that are different in terms of performance impact:
Option 1. You could write a scalar function that would receive as parameter the id of the task and would inside select the top 5 records, concatenating them in a procedural fashion and returning a varchar(max)
Option 2: You could use a subquery in combination with the FOR XML clause.
SELECT
SUBSTRING(
CAST(
(SELECT TOP 5 ', ' +
'[' + CONVERT(varchar(max), ISNULL(FullName, 'N/A')) +
' wrote on ' +
CONVERT(varchar(10), CAST(ModifiedOn AS DATE), 101) +
RIGHT(CONVERT(varchar(32),ModifiedOn,100),8) + '] ' +
CONVERT(varchar(max), ISNULL(NoteText, ''))
FROM [CRM_rsd].[dbo].[AnnotationBase] AS Anno
LEFT OUTER JOIN [CRM_rsd].[dbo].[systemUserBase] AS sUB ON Anno.ModifiedBy = sUB.SystemUserId
WHERE Anno.ObjectId = Task.ActivityId
ORDER BY Anno.ModifiedOn DESC
FOR XML PATH(''),TYPE
) AS VARCHAR(MAX)
),3,99999) AS Notes
,Task.*
FROM [CRM_rsd].[dbo].[ActivityPointerBase] AS Task
WHERE Task.Subject LIKE '%Project On Hold%'
What here happens is that by using the construct inside the CAST() we fetch the top 5 lines and make SQL server produce an XML with no element names, resulting in concatenation of the element values, we add comma as separator. Then we convert the XML to varchar(max) and remove the initial separator before the first record.
I prefer option 2, it will perform much better then using a scalar function.

Concatenate values based on ID

I Have a table called Results and the data looks like:
Response_ID Label
12147 It was not clear
12458 Did not Undersstand
12458 Was not resolved
12458 Did not communicate
12586 Spoke too fast
12587 Too slow
Now I want the ouput to display one row per ID and the values from Label to be concatenated and seperated by comma
My Output should look like:
Response_ID Label
12147 It was not clear
12458 Did not Undersstand,Was not resolved,Did not communicate
12586 Spoke too fast
12587 Too Slow
How can I do this:
You can not be sure about the order of the strings concatenated without an order by statement in the sub query. The .value('.', 'varchar(max)') part is there to handle the case where Label contains XML-unfriendly characters like &.
declare #T table(Response_ID int, Label varchar(50))
insert into #T values
(12147, 'It was not clear'),
(12458, 'Did not Undersstand'),
(12458, 'Was not resolved'),
(12458, 'Did not communicate'),
(12586, 'Spoke too fast'),
(12587, 'Too slow')
select T1.Response_ID,
stuff((select ','+T2.Label
from #T as T2
where T1.Response_ID = T2.Response_ID
for xml path(''), type).value('.', 'varchar(max)'), 1, 1, '') as Label
from #T as T1
group by T1.Response_ID
Check the link below, it approaches your problem with many different solutions
http://www.simple-talk.com/sql/t-sql-programming/concatenating-row-values-in-transact-sql/
Given this sample data:
CREATE TABLE #Results(Response_ID int, Label varchar(80));
INSERT #Results(Response_ID, Label) VALUES
(12147, 'It was not clear'),
(12458, 'Did not Undersstand'),
(12458, 'Was not resolved'),
(12458, 'Did not communicate'),
(12586, 'Spoke too fast'),
(12587, 'Too slow');
On older versions you can use FOR XML PATH for (grouped) string aggregation:
SELECT r.Response_ID, Label = STUFF((SELECT ',' + Label
FROM #Results WHERE Response_ID = r.Response_ID
FOR XML PATH(''), TYPE).value(N'./text()[1]', N'varchar(max)'), 1, 1, '')
FROM #Results AS r
GROUP BY r.Response_ID;
If you are on SQL Server 2017 or greater, the query is much simpler:
SELECT r.Response_ID, Label = STRING_AGG(Label, ',')
FROM #Results AS r
GROUP BY r.Response_ID;
Consider this, it is very performant:
http://jerrytech.blogspot.com/2010/04/tsql-concatenate-strings-1-2-3-and.html
Avoid XML functions because they are not performant.
This will take some effort to implement, but millions of rows => milliseconds to run.

Obfuscate / Mask / Scramble personal information

I'm looking for a homegrown way to scramble production data for use in development and test. I've built a couple of scripts that make random social security numbers, shift birth dates, scramble emails, etc. But I've come up against a wall trying to scramble customer names. I want to keep real names so we can still use or searches so random letter generation is out. What I have tried so far is building a temp table of all last names in the table then updating the customer table with a random selection from the temp table. Like this:
DECLARE #Names TABLE (Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names SELECT LastName FROM Customer ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
This worked well in test, but completely bogs down dealing with larger amounts of data (>20 minutes for 40K rows)
All of that to ask, how would you scramble customer names while keeping real names and the weight of the production data?
UPDATE: Never fails, you try to put all the information in the post, and you forget something important. This data will also be used in our sales & demo environments which are publicly available. Some of the answers are what I am attempting to do, to 'switch' the names, but my question is literally, how to code in T-SQL?
I use generatedata. It is an open source php script which can generate all sorts of dummy data.
A very simple solution would be to ROT13 the text.
A better question may be why you feel the need to scramble the data? If you have an encryption key, you could also consider running the text through DES or AES or similar. Thos would have potential performance issues, however.
When doing something like that I usually write a small program that first loads a lot of names and surnames in two arrays, and then just updates the database using random name/surname from arrays. It works really fast even for very big datasets (200.000+ records)
I use a method that changes characters in the name to other characters that are in the same "range" of usage frequency in English names. Apparently, the distribution of characters in names is different than it is for normal conversational English. For example, "x" and "z" occur 0.245% of the time, so they get swapped. The the other extreme, "w" is used 5.5% of the time, "s" 6.86% and "t", 15.978%. I change "s" to "w", "t" to "s" and "w" to "t".
I keep the vowels "aeio" in a separate group so that a vowel is only replaced by another vowel. Similarly, "q", "u" and "y" are not replaced at all. My grouping and decisions are totally subjective.
I ended up with 7 different "groups" of 2-5 characters , based mostly on frequency. characters within each group are swapped with other chars in that same group.
The net result is names that kinda look like the might be names, but from "not around here".
Original name Morphed name
Loren Nimag
Juanita Kuogewso
Tennyson Saggywig
David Mijsm
Julie Kunewa
Here's the SQL I use, which includes a "TitleCase" function. There are 2 different versions of the "morphed" name based on different frequencies of letters I found on the web.
-- from https://stackoverflow.com/a/28712621
-- Convert and return param as Title Case
CREATE FUNCTION [dbo].[fnConvert_TitleCase] (#InputString VARCHAR(4000) )
RETURNS VARCHAR(4000)AS
BEGIN
DECLARE #Index INT
DECLARE #Char CHAR(1)
DECLARE #OutputString VARCHAR(255)
SET #OutputString = LOWER(#InputString)
SET #Index = 2
SET #OutputString = STUFF(#OutputString, 1, 1,UPPER(SUBSTRING(#InputString,1,1)))
WHILE #Index <= LEN(#InputString)
BEGIN
SET #Char = SUBSTRING(#InputString, #Index, 1)
IF #Char IN (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&','''','(','{','[','#')
IF #Index + 1 <= LEN(#InputString)
BEGIN
IF #Char != '''' OR UPPER(SUBSTRING(#InputString, #Index + 1, 1)) != 'S'
SET #OutputString = STUFF(#OutputString, #Index + 1, 1,UPPER(SUBSTRING(#InputString, #Index + 1, 1)))
END
SET #Index = #Index + 1
END
RETURN ISNULL(#OutputString,'')
END
Go
-- 00.045 x 0.045%
-- 00.045 z 0.045%
--
-- Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z')
--
-- 00.456 k 0.456%
-- 00.511 j 0.511%
-- 00.824 v 0.824%
-- kjv
-- Replace(Replace(Replace(Replace(TS_NAME,'k','#'),'j','k'),'v','j'),'#','v')
--
-- 01.642 g 1.642%
-- 02.284 n 2.284%
-- 02.415 l 2.415%
-- gnl
-- Replace(Replace(Replace(Replace(TS_NAME,'g','#'),'n','g'),'l','n'),'#','l')
--
-- 02.826 r 2.826%
-- 03.174 d 3.174%
-- 03.826 m 3.826%
-- rdm
-- Replace(Replace(Replace(Replace(TS_NAME,'r','#'),'d','r'),'m','d'),'#','m')
--
-- 04.027 f 4.027%
-- 04.200 h 4.200%
-- 04.319 p 4.319%
-- 04.434 b 4.434%
-- 05.238 c 5.238%
-- fhpbc
-- Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c')
--
-- 05.497 w 5.497%
-- 06.686 s 6.686%
-- 15.978 t 15.978%
-- wst
-- Replace(Replace(Replace(Replace(TS_NAME,'w','#'),'s','w'),'t','s'),'#','t')
--
--
-- 02.799 e 2.799%
-- 07.294 i 7.294%
-- 07.631 o 7.631%
-- 11.682 a 11.682%
-- eioa
-- Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')
--
-- -- dont replace
-- 00.222 q 0.222%
-- 00.763 y 0.763%
-- 01.183 u 1.183%
-- Obfuscate a name
Select
ts_id,
Cast(ts_name as varchar(42)) as [Original Name]
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z'),'k','#'),'j','k'),'v','j'),'#','v'),'g','#'),'n','g'),'l','n'),'#','l'),'r','#'),'d','r'),'m','d'),'#','m'),'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c'),'w','#'),'s','w'),'t','s'),'#','t'),'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')) as VarChar(42)) As [morphed name] ,
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','t'),'~','e'),'t','~'),'a','o'),'~','a'),'o','~'),'i','n'),'~','i'),'n','~'),'s','h'),'~','s'),'h','r'),'r','~'),'d','l'),'~','d'),'l','~'),'m','w'),'~','m'),'w','f'),'f','~'),'g','y'),'~','g'),'y','p'),'p','~'),'b','v'),'~','b'),'v','k'),'k','~'),'x','~'),'j','x'),'~','j')) as VarChar(42)) As [morphed name2]
From
ts_users
;
Why not just use some sort of Random Name Generator?
Use a temporary table instead and the query is very fast. I just ran on 60K rows in 4 seconds. I'll be using this one going forward.
DECLARE TABLE #Names
(Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names
SELECT LastName
FROM Customer
ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID]
SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
DROP TABLE #Names
The following approach worked for us, lets say we have 2 tables Customers and Products:
CREATE FUNCTION [dbo].[GenerateDummyValues]
(
#dataType varchar(100),
#currentValue varchar(4000)=NULL
)
RETURNS varchar(4000)
AS
BEGIN
IF #dataType = 'int'
BEGIN
Return '0'
END
ELSE IF #dataType = 'varchar' OR #dataType = 'nvarchar' OR #dataType = 'char' OR #dataType = 'nchar'
BEGIN
Return 'AAAA'
END
ELSE IF #dataType = 'datetime'
BEGIN
Return Convert(varchar(2000),GetDate())
END
-- you can add more checks, add complicated logic etc
Return 'XXX'
END
The above function will help in generating different data based on the data type coming in.
Now, for each column of each table which does not have word "id" in it, use following query to generate further queries to manipulate the data:
select 'select ''update '' + TABLE_NAME + '' set '' + COLUMN_NAME + '' = '' + '''''''' + dbo.GenerateDummyValues( Data_type,'''') + '''''' where id = '' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, ' + table_name + ' where RIGHT(LOWER(COLUMN_NAME),2) <> ''id'' and TABLE_NAME = '''+ table_name + '''' + ';' from INFORMATION_SCHEMA.TABLES;
When you execute above query it will generate update queries for each table and for each column of that table, for example:
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Customers where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Customers';
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Products where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Products';
Now, when you execute above queries you will get final update queries, that will update the data of your tables.
You can execute this on any SQL server database, no matter how many tables do you have, it will generate queries for you that can be further executed.
Hope this helps.
Another site to generate shaped fake data sets, with an option for T-SQL output:
https://mockaroo.com/
Here's a way using ROT47 which is reversible, and another which is random. You can add a PK to either to link back to the "un scrambled" versions
declare #table table (ID int, PLAIN_TEXT nvarchar(4000))
insert into #table
values
(1,N'Some Dudes name'),
(2,N'Another Person Name'),
(3,N'Yet Another Name')
--split your string into a column, and compute the decimal value (N)
if object_id('tempdb..#staging') is not null drop table #staging
select
substring(a.b, v.number+1, 1) as Val
,ascii(substring(a.b, v.number+1, 1)) as N
--,dense_rank() over (order by b) as RN
,a.ID
into #staging
from (select PLAIN_TEXT b, ID FROM #table) a
inner join
master..spt_values v on v.number < len(a.b)
where v.type = 'P'
--select * from #staging
--create a fast tally table of numbers to be used to build the ROT-47 table.
;WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--Here we put it all together with stuff and FOR XML
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,case
when 47 + N <= 126 then char(47 + N)
when 47 + N > 126 then char(N-47)
end as ENCRYPTED_TEXT
from cteTally
where N between 33 and 126) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
--or if you want really random
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,char((select ROUND(((122 - N -1) * RAND() + N), 0))) as ENCRYPTED_TEXT
from cteTally
where (N between 65 and 122) and N not in (91,92,93,94,95,96)) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
Encountered the same problem myself and figured out an alternative solution that may work for others.
The idea is to use MD5 on the name and then take the last 3 hex digits of it to map into a table of names. You can do this separately for first name and last name.
3 hex digits represent decimals from 0 to 4095, so we need a list of 4096 first names and 4096 last names.
So conv(substr(md5(first_name), 3),16,10) (in MySQL syntax) would be an index from 0 to 4095 that could be joined with a table that holds 4096 first names. The same concept could be applied to last names.
Using MD5 (as opposed to a random number) guarantees a name in the original data will always be mapped to the same name in the test data.
You can get a list of names here:
https://gist.github.com/elifiner/cc90fdd387449158829515782936a9a4
I am working on this at my company right now -- and it turns out to be a very tricky thing. You want to have names that are realistic, but must not reveal any real personal info.
My approach has been to first create a randomized "mapping" of last names to other last names, then use that mapping to change all last names. This is good if you have duplicate name records. Suppose you have 2 "John Smith" records that both represent the same real person. If you changed one record to "John Adams" and the other to "John Best", then your one "person" now has 2 different names! With a mapping, all occurrences of "Smith" get changed to "Jones", and so duplicates ( or even family members ) still end up with the same last name, keeping the data more "realistic".
I will also have to scramble the addresses, phone numbers, bank account numbers, etc...and I am not sure how I will approach those. Keeping the data "realistic" while scrambling is certainly a deep topic. This must have been done many times by many companies -- who has done this before? What did you learn?
Frankly, I'm not sure why this is needed. Your dev/test environments should be private, behind your firewall, and not accessible from the web.
Your developers should be trusted, and you have legal recourse against them if they fail to live up to your trust.
I think the real question should be "Should I scramble the data?", and the answer is (in my mind) 'no'.
If you're sending it offsite for some reason, or you have to have your environments web-accessible, or if you're paranoid, I would implement a random switch. Rather than build a temp table, run switches between each location and a random row in the table, swapping one piece of data at a time.
The end result will be a table with all the same data, but with it randomly reorganized. It should also be faster than your temp table, I believe.
It should be simple enough to implement the Fisher-Yates Shuffle in SQL...or at least in a console app that reads the db and writes to the target.
Edit (2): Off-the cuff answer in T-SQL:
declare #name varchar(50)
set #name = (SELECT lastName from person where personID = (random id number)
Update person
set lastname = #name
WHERE personID = (person id of current row)
Wrap this in a loop, and follow the guidelines of Fisher-Yates for modifying the random value constraints, and you'll be set.