I am fluent in SQL but new to using the SQL Geometry features. I have what is probably a very basic problem to solve, but I haven't found any good resources online that explain how to use geometry objects. (Technet is a lousy way to learn new things...)
I have a collection of 2d points on a Cartesian plane, and I am trying to find all points that are within a collection of radii.
I created and populated a table using syntax like:
Update [Things] set [Location] = geometry::Point(#X, #Y, 0)
(#X,#Y are just the x and y values, 0 is an arbitrary number shared by all objects that allows set filtering if I understand correctly)
Here is where I go off the rails...Do I try to construct some sort of polygon collection and query using that, or is there some simple way of checking for intersection of multiple radii without building a bunch of circular polygons?
Addendum: If nobody has the answer to the multiple radii question, what is the single radius solution?
UPDATE
Here are some examples I have worked up, using an imaginary star database where stars are stored on a x-y grid as points:
Selects all points in a box:
DECLARE #polygon geometry = geometry::STGeomFromText('POLYGON(('
+ CAST(#MinX AS VARCHAR(10)) + ' ' + CAST(#MinY AS VARCHAR(10)) + ','
+ CAST(#MaxX AS VARCHAR(10)) + ' ' + CAST(#MinY AS VARCHAR(10)) + ', '
+ CAST(#MaxX AS VARCHAR(10)) + ' ' + CAST(#MaxY AS VARCHAR(10)) + ','
+ CAST(#MinX AS VARCHAR(10)) + ' ' + CAST(#MaxY AS VARCHAR(10)) + ','
+ CAST(#MinX AS VARCHAR(10)) + ' ' + CAST(#MinY AS VARCHAR(10)) + '))', 0);
SELECT [Star].[Name] AS [StarName],
[Star].[StarTypeId] AS [StarTypeId],
FROM [Star]
WHERE #polygon.STContains([Star].[Location]) = 1
using this as a pattern, you can do all sorts of interesting things, such as
defining multiple polygons:
WHERE #polygon1.STContains([Star].[Location]) = 1
OR #polygon2.STContains([Star].[Location]) = 1
OR #polygon3.STContains([Star].[Location]) = 1
Or checking distance:
WHERE [Star].[Location].STDistance(#polygon1) < #SomeDistance
Sample insert statement
INSERT [Star]
(
[Name],
[StarTypeId],
[Location],
)
VALUES
(
#Name,
#StarTypeId,
GEOMETRY::Point(#LocationX, #LocationY, 0),
)
This is an incredibly late answer, but perhaps I can shed some light on a solution. The "set" number you refer to is a Spatial Reference Indentifier or SRID. For lat/long calculations you should consider setting this to 4326, which will ensure metres are used as a unit of measurement. You should also consider switching to SqlGeography rather than SqlGeometry, but we'll continue with SqlGeometry for now. To bulk set the SRID, you can update your table as follows:
UPDATE [YourTable] SET [SpatialColumn] = GEOMETRY.STPointFromText([SpatialColumn].STAsText(), 4326);
For a single radius, you need to create a radii as a spatial object. For example:
DECLARE #radiusInMeters FLOAT = 1000; -- Set to a number in meters
DECLARE #radius GEOMETRY = GEOMETRY::Point(#x, #y, 4326).STBuffer(#radiusInMeters);
STBuffer() takes the spatial point and creates a circle (now a Polygon type) from it. You can then query your data set as follows:
SELECT * FROM [YourTable] WHERE [SpatialColumn].STIntersects(#radius);
The above will now use any Spatial Index you have created on the [SpatialColumn] in its query plan.
There is also a simpler option which will work (and still use a spatial index). The STDistance method allows you to do the following:
DECLARE #radius GEOMETRY = GEOMETRY::Point(#x, #y, 4326);
DECLARE #distance FLOAT = 1000; -- A distance in metres
SELECT * FROM [YourTable] WHERE [SpatialColumn].STDistance(#radius) <= #distance;
Lastly, working with a collection of radii. You have a few options. The first is to run the above for each radii in turn, but I would consider the following to do it as one:
DECLARE #radiiCollection TABLE
(
[RadiusInMetres] FLOAT,
[Radius] GEOMETRY
)
INSERT INTO #radiiCollection ([RadiusInMetres], [Radius]) VALUES (1000, GEOMETRY::Point(#xValue, #yValue, 4326).STBuffer(1000));
-- Repeat for other radii
SELECT
X.[Id],
MIN(R.[RadiusInMetres]) AS [WithinRadiusDistance]
FROM
[YourTable] X
JOIN
#radiiCollection RC ON RC.[Radius].STIntersects(X.[SpatialColumn])
GROUP BY
X.[IdColumn],
R.[RadiusInMetres]
DROP TABLE #radiiCollection;
The final above has not been tested, but I'm 99% sure it's just about there with a small amount of tweaking being a possibility. The ideal of taking the min radius distance in the select is that if the multiple radii stem from a single location, if a point is within the first radius, it will naturally be within all of the others. You'll therefore duplicate the record, but by grouping and then selecting the min, you get only one (and the closest).
Hope it helps, albeit 4 weeks after you asked the question. Sorry I didn't see it sooner, if only there was only one spatial tag for questions!!!!
Sure, this is possible. The individual where clause should be something like:
DIM #Center AS Location
-- Initialize the location here, you probably know better how to do that than I.
Dim #Radius AS Decimal(10, 2)
SELECT * from pointTable WHERE sqrt(square(#Center.STX-Location.STX)+square(#Center.STX-Location.STX)) > #Radius
You can then pile a bunch of radii and xy points into a table variable that looks like like:
Dim #MyCircleTable AS Table(Geometry Circle)
INSERT INTO #MyCircleTable (.........)
Note: I have not put this through a compiler, but this is the bare bones of a working solution.
Other option looks to be here:
http://technet.microsoft.com/en-us/library/bb933904.aspx
And there's a demo of seemingly working syntax here:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/6e1d7af4-ecc2-4d82-b069-f2517c3276c2/slow-spatial-predicates-stcontains-stintersects-stwithin-?forum=sqlspatial
The second post implies the syntax:
SELECT Distinct pointTable.* from pointTable pt, circletable crcs
WHERE crcs.geom.STContains(b.Location) = 1
I'm looking for a homegrown way to scramble production data for use in development and test. I've built a couple of scripts that make random social security numbers, shift birth dates, scramble emails, etc. But I've come up against a wall trying to scramble customer names. I want to keep real names so we can still use or searches so random letter generation is out. What I have tried so far is building a temp table of all last names in the table then updating the customer table with a random selection from the temp table. Like this:
DECLARE #Names TABLE (Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names SELECT LastName FROM Customer ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
This worked well in test, but completely bogs down dealing with larger amounts of data (>20 minutes for 40K rows)
All of that to ask, how would you scramble customer names while keeping real names and the weight of the production data?
UPDATE: Never fails, you try to put all the information in the post, and you forget something important. This data will also be used in our sales & demo environments which are publicly available. Some of the answers are what I am attempting to do, to 'switch' the names, but my question is literally, how to code in T-SQL?
I use generatedata. It is an open source php script which can generate all sorts of dummy data.
A very simple solution would be to ROT13 the text.
A better question may be why you feel the need to scramble the data? If you have an encryption key, you could also consider running the text through DES or AES or similar. Thos would have potential performance issues, however.
When doing something like that I usually write a small program that first loads a lot of names and surnames in two arrays, and then just updates the database using random name/surname from arrays. It works really fast even for very big datasets (200.000+ records)
I use a method that changes characters in the name to other characters that are in the same "range" of usage frequency in English names. Apparently, the distribution of characters in names is different than it is for normal conversational English. For example, "x" and "z" occur 0.245% of the time, so they get swapped. The the other extreme, "w" is used 5.5% of the time, "s" 6.86% and "t", 15.978%. I change "s" to "w", "t" to "s" and "w" to "t".
I keep the vowels "aeio" in a separate group so that a vowel is only replaced by another vowel. Similarly, "q", "u" and "y" are not replaced at all. My grouping and decisions are totally subjective.
I ended up with 7 different "groups" of 2-5 characters , based mostly on frequency. characters within each group are swapped with other chars in that same group.
The net result is names that kinda look like the might be names, but from "not around here".
Original name Morphed name
Loren Nimag
Juanita Kuogewso
Tennyson Saggywig
David Mijsm
Julie Kunewa
Here's the SQL I use, which includes a "TitleCase" function. There are 2 different versions of the "morphed" name based on different frequencies of letters I found on the web.
-- from https://stackoverflow.com/a/28712621
-- Convert and return param as Title Case
CREATE FUNCTION [dbo].[fnConvert_TitleCase] (#InputString VARCHAR(4000) )
RETURNS VARCHAR(4000)AS
BEGIN
DECLARE #Index INT
DECLARE #Char CHAR(1)
DECLARE #OutputString VARCHAR(255)
SET #OutputString = LOWER(#InputString)
SET #Index = 2
SET #OutputString = STUFF(#OutputString, 1, 1,UPPER(SUBSTRING(#InputString,1,1)))
WHILE #Index <= LEN(#InputString)
BEGIN
SET #Char = SUBSTRING(#InputString, #Index, 1)
IF #Char IN (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&','''','(','{','[','#')
IF #Index + 1 <= LEN(#InputString)
BEGIN
IF #Char != '''' OR UPPER(SUBSTRING(#InputString, #Index + 1, 1)) != 'S'
SET #OutputString = STUFF(#OutputString, #Index + 1, 1,UPPER(SUBSTRING(#InputString, #Index + 1, 1)))
END
SET #Index = #Index + 1
END
RETURN ISNULL(#OutputString,'')
END
Go
-- 00.045 x 0.045%
-- 00.045 z 0.045%
--
-- Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z')
--
-- 00.456 k 0.456%
-- 00.511 j 0.511%
-- 00.824 v 0.824%
-- kjv
-- Replace(Replace(Replace(Replace(TS_NAME,'k','#'),'j','k'),'v','j'),'#','v')
--
-- 01.642 g 1.642%
-- 02.284 n 2.284%
-- 02.415 l 2.415%
-- gnl
-- Replace(Replace(Replace(Replace(TS_NAME,'g','#'),'n','g'),'l','n'),'#','l')
--
-- 02.826 r 2.826%
-- 03.174 d 3.174%
-- 03.826 m 3.826%
-- rdm
-- Replace(Replace(Replace(Replace(TS_NAME,'r','#'),'d','r'),'m','d'),'#','m')
--
-- 04.027 f 4.027%
-- 04.200 h 4.200%
-- 04.319 p 4.319%
-- 04.434 b 4.434%
-- 05.238 c 5.238%
-- fhpbc
-- Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c')
--
-- 05.497 w 5.497%
-- 06.686 s 6.686%
-- 15.978 t 15.978%
-- wst
-- Replace(Replace(Replace(Replace(TS_NAME,'w','#'),'s','w'),'t','s'),'#','t')
--
--
-- 02.799 e 2.799%
-- 07.294 i 7.294%
-- 07.631 o 7.631%
-- 11.682 a 11.682%
-- eioa
-- Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')
--
-- -- dont replace
-- 00.222 q 0.222%
-- 00.763 y 0.763%
-- 01.183 u 1.183%
-- Obfuscate a name
Select
ts_id,
Cast(ts_name as varchar(42)) as [Original Name]
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z'),'k','#'),'j','k'),'v','j'),'#','v'),'g','#'),'n','g'),'l','n'),'#','l'),'r','#'),'d','r'),'m','d'),'#','m'),'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c'),'w','#'),'s','w'),'t','s'),'#','t'),'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')) as VarChar(42)) As [morphed name] ,
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','t'),'~','e'),'t','~'),'a','o'),'~','a'),'o','~'),'i','n'),'~','i'),'n','~'),'s','h'),'~','s'),'h','r'),'r','~'),'d','l'),'~','d'),'l','~'),'m','w'),'~','m'),'w','f'),'f','~'),'g','y'),'~','g'),'y','p'),'p','~'),'b','v'),'~','b'),'v','k'),'k','~'),'x','~'),'j','x'),'~','j')) as VarChar(42)) As [morphed name2]
From
ts_users
;
Why not just use some sort of Random Name Generator?
Use a temporary table instead and the query is very fast. I just ran on 60K rows in 4 seconds. I'll be using this one going forward.
DECLARE TABLE #Names
(Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names
SELECT LastName
FROM Customer
ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID]
SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
DROP TABLE #Names
The following approach worked for us, lets say we have 2 tables Customers and Products:
CREATE FUNCTION [dbo].[GenerateDummyValues]
(
#dataType varchar(100),
#currentValue varchar(4000)=NULL
)
RETURNS varchar(4000)
AS
BEGIN
IF #dataType = 'int'
BEGIN
Return '0'
END
ELSE IF #dataType = 'varchar' OR #dataType = 'nvarchar' OR #dataType = 'char' OR #dataType = 'nchar'
BEGIN
Return 'AAAA'
END
ELSE IF #dataType = 'datetime'
BEGIN
Return Convert(varchar(2000),GetDate())
END
-- you can add more checks, add complicated logic etc
Return 'XXX'
END
The above function will help in generating different data based on the data type coming in.
Now, for each column of each table which does not have word "id" in it, use following query to generate further queries to manipulate the data:
select 'select ''update '' + TABLE_NAME + '' set '' + COLUMN_NAME + '' = '' + '''''''' + dbo.GenerateDummyValues( Data_type,'''') + '''''' where id = '' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, ' + table_name + ' where RIGHT(LOWER(COLUMN_NAME),2) <> ''id'' and TABLE_NAME = '''+ table_name + '''' + ';' from INFORMATION_SCHEMA.TABLES;
When you execute above query it will generate update queries for each table and for each column of that table, for example:
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Customers where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Customers';
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Products where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Products';
Now, when you execute above queries you will get final update queries, that will update the data of your tables.
You can execute this on any SQL server database, no matter how many tables do you have, it will generate queries for you that can be further executed.
Hope this helps.
Another site to generate shaped fake data sets, with an option for T-SQL output:
https://mockaroo.com/
Here's a way using ROT47 which is reversible, and another which is random. You can add a PK to either to link back to the "un scrambled" versions
declare #table table (ID int, PLAIN_TEXT nvarchar(4000))
insert into #table
values
(1,N'Some Dudes name'),
(2,N'Another Person Name'),
(3,N'Yet Another Name')
--split your string into a column, and compute the decimal value (N)
if object_id('tempdb..#staging') is not null drop table #staging
select
substring(a.b, v.number+1, 1) as Val
,ascii(substring(a.b, v.number+1, 1)) as N
--,dense_rank() over (order by b) as RN
,a.ID
into #staging
from (select PLAIN_TEXT b, ID FROM #table) a
inner join
master..spt_values v on v.number < len(a.b)
where v.type = 'P'
--select * from #staging
--create a fast tally table of numbers to be used to build the ROT-47 table.
;WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--Here we put it all together with stuff and FOR XML
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,case
when 47 + N <= 126 then char(47 + N)
when 47 + N > 126 then char(N-47)
end as ENCRYPTED_TEXT
from cteTally
where N between 33 and 126) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
--or if you want really random
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,char((select ROUND(((122 - N -1) * RAND() + N), 0))) as ENCRYPTED_TEXT
from cteTally
where (N between 65 and 122) and N not in (91,92,93,94,95,96)) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
Encountered the same problem myself and figured out an alternative solution that may work for others.
The idea is to use MD5 on the name and then take the last 3 hex digits of it to map into a table of names. You can do this separately for first name and last name.
3 hex digits represent decimals from 0 to 4095, so we need a list of 4096 first names and 4096 last names.
So conv(substr(md5(first_name), 3),16,10) (in MySQL syntax) would be an index from 0 to 4095 that could be joined with a table that holds 4096 first names. The same concept could be applied to last names.
Using MD5 (as opposed to a random number) guarantees a name in the original data will always be mapped to the same name in the test data.
You can get a list of names here:
https://gist.github.com/elifiner/cc90fdd387449158829515782936a9a4
I am working on this at my company right now -- and it turns out to be a very tricky thing. You want to have names that are realistic, but must not reveal any real personal info.
My approach has been to first create a randomized "mapping" of last names to other last names, then use that mapping to change all last names. This is good if you have duplicate name records. Suppose you have 2 "John Smith" records that both represent the same real person. If you changed one record to "John Adams" and the other to "John Best", then your one "person" now has 2 different names! With a mapping, all occurrences of "Smith" get changed to "Jones", and so duplicates ( or even family members ) still end up with the same last name, keeping the data more "realistic".
I will also have to scramble the addresses, phone numbers, bank account numbers, etc...and I am not sure how I will approach those. Keeping the data "realistic" while scrambling is certainly a deep topic. This must have been done many times by many companies -- who has done this before? What did you learn?
Frankly, I'm not sure why this is needed. Your dev/test environments should be private, behind your firewall, and not accessible from the web.
Your developers should be trusted, and you have legal recourse against them if they fail to live up to your trust.
I think the real question should be "Should I scramble the data?", and the answer is (in my mind) 'no'.
If you're sending it offsite for some reason, or you have to have your environments web-accessible, or if you're paranoid, I would implement a random switch. Rather than build a temp table, run switches between each location and a random row in the table, swapping one piece of data at a time.
The end result will be a table with all the same data, but with it randomly reorganized. It should also be faster than your temp table, I believe.
It should be simple enough to implement the Fisher-Yates Shuffle in SQL...or at least in a console app that reads the db and writes to the target.
Edit (2): Off-the cuff answer in T-SQL:
declare #name varchar(50)
set #name = (SELECT lastName from person where personID = (random id number)
Update person
set lastname = #name
WHERE personID = (person id of current row)
Wrap this in a loop, and follow the guidelines of Fisher-Yates for modifying the random value constraints, and you'll be set.