Averaging out Lat/longs in SQL Server database

Averaging out Lat/longs in SQL Server database - sql

I'm new to SQL Server. I'm trying to figure out how I can get the below one done:
I have thousands of lat/long positions pointing to the same OR very close by locations. It's all stored flat in a SQL Server table as LAT & LONG columns.
Now to cluster the lat/longs and pick one representation per cluster, what I must be doing?
I read through a method called "STCentroid" :
https://msdn.microsoft.com/en-us/library/bb933847.aspx
But is it worth letting the Server do a polygon with all these million rows and find the center point? Which would implicitly mean a single representation for all the near by duplicates. Might be an in efficient/wrong way?
Only points around few meters must be considered as duplicate entries.
I'm thinking how I can pick the right representation.
In better words:
If there's a group of points G1{} (GPS positions) trying to point to a location L1. (Physical loc). & There's a group of points G2{}, trying to point to a location L2. How do I derive Center Point CP1 from G1{}. & CP2 from G2{}, such that CP1 is very close to L1 & CP2 is very close to L2.
And the fact is, L1 & L2 could be very near to each other say, 10 feet.
Just thinking how do I approach this problem. Any help please?

Clustering points will be problematic. You are going to have issues if you have two potential clusters close together, and if you need precision or optimization, then you will need to do some research on your implementation. Try: Wiki-Cluster Analysis
However, if the points clusers are fairly far apart, then you could try a fairly simple cluster and then find the envelopes.
Something like this may work, although you would be well served to actually make a spatial column and add a spatial index.
ALTER TABLE Recordset ADD (ClusterID INT) -- Add a grouping ID
GO
DECLARE #i INT --Group Counter
DECLARE #g GEOGRAPHY --Point from which the cluster will be made
DECLARE #Limit INT --Distance limitation
SET #Limit = 10
SET #i = 0
WHILE (SELECT COUNT(*) FROM Recordset R WHERE ClusterID IS NULL) > 0 --Loop until all points are clustered
BEGIN
SET #g = (SELECT TOP 1 GEOGRAPHY::STPointFromText('POINT(' + CAST(LAT AS VARCHAR(20)) + ' ' + CAST(LONG AS VARCHAR(20)) + ')', 4326) WHERE ClusterID IS NULL) --Point to cluster on
UPDATE Recordset SET ClusterID = #i WHERE GEOGRAPHY::STPointFromText('POINT(' + CAST(LAT AS VARCHAR(20)) + ' ' + CAST(LONG AS VARCHAR(20)) + ')', 4326).STDistance(#g) < #Limit AND ClusterID IS NULL--update all points within the limit circle
SET #i = #i + 1
END
SELECT --Clustered centers
ClusterID,
GEOGRAPHY::ConvexHullAggregate(GEOGRAPHY::STPointFromText('POINT(' + CAST(LAT AS VARCHAR(20)) + ' ' + CAST(LONG AS VARCHAR(20)) + ')', 4326)).EnvelopeCenter().Lat AS 'LatCenter',
GEOGRAPHY::ConvexHullAggregate(GEOGRAPHY::STPointFromText('POINT(' + CAST(LAT AS VARCHAR(20)) + ' ' + CAST(LONG AS VARCHAR(20)) + ')', 4326)).EnvelopeCenter().Long AS 'LatCenter',
FROM
RecordSet
GROUP BY
ClusterID

Related

Convert string coordinates to geography

I have string coordinates in my table but I want to do some geographical functionalities, So I need first to convert this string value to geography.like this:
geography::STGeomFromText('POINT([location])', 4326).MakeValid().STDistance(#p)
but for sure this code didn't work as it needs here point not string coordinates.
The full code:
geography::STGeomFromText('POINT([location])', 4326).MakeValid().STDistance(#p);
DECLARE #p geography;
SET #p = geography::STGeomFromText('POINT({$Lon} {$Lat})', 4326);
Select TOP 1 id, location from branches where {$location} <= {$this->radius} order by {$location}

It's a little difficult to provide a perfect solution without seeing how the code is interporlating the variables, but SQL could be having issues recognizing your long/lat as strings with the STGeomFromText method.
Could you try something like this:
SELECT geography::STGeomFromText('POINT(' + CAST([$Long] AS VARCHAR(20)) + ' ' + CAST([$Lat] AS VARCHAR(20)) + ')', 4326)
Or more succinctly:
SELECT geography::Point([$Lat], [$Long], 4326)

SQL points around line between A and B

In SQL Server 2014, I have a database with Geometry points - City
Driving from City A to City B gives me a line (we take an airplane).
I need to find points in my database - which are in certain distance (10 miles) "off-track" of this line.
I know how to find the closest points around a single point, how to calculate the distance between them - but - how can I search along this line? Like POI in your Navi...
DECLARE #g geography
SELECT #g = Geo_LatLong_deg
FROM airports
WHERE iata_code = 'MyAirportCode' -- radius 100km
SELECT *
FROM airports
WHERE #g.STDistance(Geo_LatLong_deg) <= 100000

Use the STBuffer method. Assuming that you've got some way to determine your path as a geography instance, it's as simple as:
declare #distance float = 16.09344 --10 miles in km
select *
from airports
where #path.STBuffer(#distance).STIntersects(Geo_LatLong_deg) = 1
By way of explanation, the STBuffer() method creates a region that is the set of points within 10 miles of your path. Then, we select all points from your table that intersect with that region with STIntersects().

Thank you for your help. I mixed up Long/Lat sequence in string... now I get the results as expected.
here the code - if others want to see how to combine two or more points - together with the area around the line(s).
DECLARE #BuildString NVARCHAR(MAX)
SELECT #BuildString = COALESCE(#BuildString + ',', '') + CAST(longitude_deg AS NVARCHAR(50)) + ' ' + CAST(latitude_deg AS NVARCHAR(50))
FROM dbo.airports where iata_code='RLG' or iata_code='FRA'
ORDER BY ID
SET #BuildString = 'LINESTRING(' + #BuildString + ')';
DECLARE #LineFromPoints geography = geography::STLineFromText(#BuildString, 4326);
declare #distance float = 50000
select *
from airports
where #LineFromPoints.STBuffer(#distance).STIntersects(airports.GEO_LatLong_deg) = 1 and type<>'heliport'

SQL Geometry find all points in a radius

I am fluent in SQL but new to using the SQL Geometry features. I have what is probably a very basic problem to solve, but I haven't found any good resources online that explain how to use geometry objects. (Technet is a lousy way to learn new things...)
I have a collection of 2d points on a Cartesian plane, and I am trying to find all points that are within a collection of radii.
I created and populated a table using syntax like:
Update [Things] set [Location] = geometry::Point(#X, #Y, 0)
(#X,#Y are just the x and y values, 0 is an arbitrary number shared by all objects that allows set filtering if I understand correctly)
Here is where I go off the rails...Do I try to construct some sort of polygon collection and query using that, or is there some simple way of checking for intersection of multiple radii without building a bunch of circular polygons?
Addendum: If nobody has the answer to the multiple radii question, what is the single radius solution?
UPDATE
Here are some examples I have worked up, using an imaginary star database where stars are stored on a x-y grid as points:
Selects all points in a box:
DECLARE #polygon geometry = geometry::STGeomFromText('POLYGON(('
+ CAST(#MinX AS VARCHAR(10)) + ' ' + CAST(#MinY AS VARCHAR(10)) + ','
+ CAST(#MaxX AS VARCHAR(10)) + ' ' + CAST(#MinY AS VARCHAR(10)) + ', '
+ CAST(#MaxX AS VARCHAR(10)) + ' ' + CAST(#MaxY AS VARCHAR(10)) + ','
+ CAST(#MinX AS VARCHAR(10)) + ' ' + CAST(#MaxY AS VARCHAR(10)) + ','
+ CAST(#MinX AS VARCHAR(10)) + ' ' + CAST(#MinY AS VARCHAR(10)) + '))', 0);
SELECT [Star].[Name] AS [StarName],
[Star].[StarTypeId] AS [StarTypeId],
FROM [Star]
WHERE #polygon.STContains([Star].[Location]) = 1
using this as a pattern, you can do all sorts of interesting things, such as
defining multiple polygons:
WHERE #polygon1.STContains([Star].[Location]) = 1
OR #polygon2.STContains([Star].[Location]) = 1
OR #polygon3.STContains([Star].[Location]) = 1
Or checking distance:
WHERE [Star].[Location].STDistance(#polygon1) < #SomeDistance
Sample insert statement
INSERT [Star]
(
[Name],
[StarTypeId],
[Location],
)
VALUES
(
#Name,
#StarTypeId,
GEOMETRY::Point(#LocationX, #LocationY, 0),
)

This is an incredibly late answer, but perhaps I can shed some light on a solution. The "set" number you refer to is a Spatial Reference Indentifier or SRID. For lat/long calculations you should consider setting this to 4326, which will ensure metres are used as a unit of measurement. You should also consider switching to SqlGeography rather than SqlGeometry, but we'll continue with SqlGeometry for now. To bulk set the SRID, you can update your table as follows:
UPDATE [YourTable] SET [SpatialColumn] = GEOMETRY.STPointFromText([SpatialColumn].STAsText(), 4326);
For a single radius, you need to create a radii as a spatial object. For example:
DECLARE #radiusInMeters FLOAT = 1000; -- Set to a number in meters
DECLARE #radius GEOMETRY = GEOMETRY::Point(#x, #y, 4326).STBuffer(#radiusInMeters);
STBuffer() takes the spatial point and creates a circle (now a Polygon type) from it. You can then query your data set as follows:
SELECT * FROM [YourTable] WHERE [SpatialColumn].STIntersects(#radius);
The above will now use any Spatial Index you have created on the [SpatialColumn] in its query plan.
There is also a simpler option which will work (and still use a spatial index). The STDistance method allows you to do the following:
DECLARE #radius GEOMETRY = GEOMETRY::Point(#x, #y, 4326);
DECLARE #distance FLOAT = 1000; -- A distance in metres
SELECT * FROM [YourTable] WHERE [SpatialColumn].STDistance(#radius) <= #distance;
Lastly, working with a collection of radii. You have a few options. The first is to run the above for each radii in turn, but I would consider the following to do it as one:
DECLARE #radiiCollection TABLE
(
[RadiusInMetres] FLOAT,
[Radius] GEOMETRY
)
INSERT INTO #radiiCollection ([RadiusInMetres], [Radius]) VALUES (1000, GEOMETRY::Point(#xValue, #yValue, 4326).STBuffer(1000));
-- Repeat for other radii
SELECT
X.[Id],
MIN(R.[RadiusInMetres]) AS [WithinRadiusDistance]
FROM
[YourTable] X
JOIN
#radiiCollection RC ON RC.[Radius].STIntersects(X.[SpatialColumn])
GROUP BY
X.[IdColumn],
R.[RadiusInMetres]
DROP TABLE #radiiCollection;
The final above has not been tested, but I'm 99% sure it's just about there with a small amount of tweaking being a possibility. The ideal of taking the min radius distance in the select is that if the multiple radii stem from a single location, if a point is within the first radius, it will naturally be within all of the others. You'll therefore duplicate the record, but by grouping and then selecting the min, you get only one (and the closest).
Hope it helps, albeit 4 weeks after you asked the question. Sorry I didn't see it sooner, if only there was only one spatial tag for questions!!!!

Sure, this is possible. The individual where clause should be something like:
DIM #Center AS Location
-- Initialize the location here, you probably know better how to do that than I.
Dim #Radius AS Decimal(10, 2)
SELECT * from pointTable WHERE sqrt(square(#Center.STX-Location.STX)+square(#Center.STX-Location.STX)) > #Radius
You can then pile a bunch of radii and xy points into a table variable that looks like like:
Dim #MyCircleTable AS Table(Geometry Circle)
INSERT INTO #MyCircleTable (.........)
Note: I have not put this through a compiler, but this is the bare bones of a working solution.
Other option looks to be here:
http://technet.microsoft.com/en-us/library/bb933904.aspx
And there's a demo of seemingly working syntax here:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/6e1d7af4-ecc2-4d82-b069-f2517c3276c2/slow-spatial-predicates-stcontains-stintersects-stwithin-?forum=sqlspatial
The second post implies the syntax:
SELECT Distinct pointTable.* from pointTable pt, circletable crcs
WHERE crcs.geom.STContains(b.Location) = 1

Query table records with points (geometry) within area

I have a table (locations) that has a field called Point (geometry). I wrote a query that passes the top and bottom latitude coordinates and the bottom and top longitude coordinates. I want to retrieve all records that are within the area of the coordinates I pass the stored procedure. When I run this it returns zero records even though I know there is a record that matches the criteria. Any ideas what I might be doing wrong?
DECLARE #categoryid AS int,#leftlong AS float,#rightlong AS float,#toplat AS float,#bottomlat AS float
DECLARE #searcharea geometry, #polygon AS varchar(500);
SET #leftlong = -85.605469
SET #toplat = 42.303468
SET #rightlong = -85.594912
SET #bottomlat = 42.297564
SET #polygon = CAST(#leftlong AS varchar(20)) + ' ' + CAST(#toplat AS varchar(20)) + ',' +
CAST(#leftlong AS varchar(20)) + ' ' + cast(#bottomlat AS varchar(20)) + ',' +
cast(#rightlong AS varchar(20)) + ' ' + cast(#bottomlat AS varchar(20)) + ',' +
cast(#rightlong AS varchar(20)) + ' ' + cast(#toplat AS varchar(20)) + ',' +
CAST(#leftlong AS varchar(20)) + ' ' + CAST(#toplat AS varchar(20))
SET #searcharea = geometry::STGeomFromText('POLYGON ((' + #polygon + '))', 0);
SELECT *
FROM locations l
WHERE l.point.STWithin(#searcharea) = 1

There are two likely potential sources of the problem:
Usage of geometry instead of geography type. When working with latitude and longitude data, you generally want to use the geography type in order to do calculations on the ellipsoid instead of on the plane. These calculations can different considerably - this is a lot of information available on this distinction in various articles such as this whitepaper.
Mismatching SRIDs - is the srid of the data in your locations table also 0 for all rows? If these do not match between your data and your #searcharea, no results will be returned.

Obfuscate / Mask / Scramble personal information

I'm looking for a homegrown way to scramble production data for use in development and test. I've built a couple of scripts that make random social security numbers, shift birth dates, scramble emails, etc. But I've come up against a wall trying to scramble customer names. I want to keep real names so we can still use or searches so random letter generation is out. What I have tried so far is building a temp table of all last names in the table then updating the customer table with a random selection from the temp table. Like this:
DECLARE #Names TABLE (Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names SELECT LastName FROM Customer ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
This worked well in test, but completely bogs down dealing with larger amounts of data (>20 minutes for 40K rows)
All of that to ask, how would you scramble customer names while keeping real names and the weight of the production data?
UPDATE: Never fails, you try to put all the information in the post, and you forget something important. This data will also be used in our sales & demo environments which are publicly available. Some of the answers are what I am attempting to do, to 'switch' the names, but my question is literally, how to code in T-SQL?

I use generatedata. It is an open source php script which can generate all sorts of dummy data.

A very simple solution would be to ROT13 the text.
A better question may be why you feel the need to scramble the data? If you have an encryption key, you could also consider running the text through DES or AES or similar. Thos would have potential performance issues, however.

When doing something like that I usually write a small program that first loads a lot of names and surnames in two arrays, and then just updates the database using random name/surname from arrays. It works really fast even for very big datasets (200.000+ records)

I use a method that changes characters in the name to other characters that are in the same "range" of usage frequency in English names. Apparently, the distribution of characters in names is different than it is for normal conversational English. For example, "x" and "z" occur 0.245% of the time, so they get swapped. The the other extreme, "w" is used 5.5% of the time, "s" 6.86% and "t", 15.978%. I change "s" to "w", "t" to "s" and "w" to "t".
I keep the vowels "aeio" in a separate group so that a vowel is only replaced by another vowel. Similarly, "q", "u" and "y" are not replaced at all. My grouping and decisions are totally subjective.
I ended up with 7 different "groups" of 2-5 characters , based mostly on frequency. characters within each group are swapped with other chars in that same group.
The net result is names that kinda look like the might be names, but from "not around here".
Original name Morphed name
Loren Nimag
Juanita Kuogewso
Tennyson Saggywig
David Mijsm
Julie Kunewa
Here's the SQL I use, which includes a "TitleCase" function. There are 2 different versions of the "morphed" name based on different frequencies of letters I found on the web.
-- from https://stackoverflow.com/a/28712621
-- Convert and return param as Title Case
CREATE FUNCTION [dbo].[fnConvert_TitleCase] (#InputString VARCHAR(4000) )
RETURNS VARCHAR(4000)AS
BEGIN
DECLARE #Index INT
DECLARE #Char CHAR(1)
DECLARE #OutputString VARCHAR(255)
SET #OutputString = LOWER(#InputString)
SET #Index = 2
SET #OutputString = STUFF(#OutputString, 1, 1,UPPER(SUBSTRING(#InputString,1,1)))
WHILE #Index <= LEN(#InputString)
BEGIN
SET #Char = SUBSTRING(#InputString, #Index, 1)
IF #Char IN (' ', ';', ':', '!', '?', ',', '.', '_', '-', '/', '&','''','(','{','[','#')
IF #Index + 1 <= LEN(#InputString)
BEGIN
IF #Char != '''' OR UPPER(SUBSTRING(#InputString, #Index + 1, 1)) != 'S'
SET #OutputString = STUFF(#OutputString, #Index + 1, 1,UPPER(SUBSTRING(#InputString, #Index + 1, 1)))
END
SET #Index = #Index + 1
END
RETURN ISNULL(#OutputString,'')
END
Go
-- 00.045 x 0.045%
-- 00.045 z 0.045%
--
-- Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z')
--
-- 00.456 k 0.456%
-- 00.511 j 0.511%
-- 00.824 v 0.824%
-- kjv
-- Replace(Replace(Replace(Replace(TS_NAME,'k','#'),'j','k'),'v','j'),'#','v')
--
-- 01.642 g 1.642%
-- 02.284 n 2.284%
-- 02.415 l 2.415%
-- gnl
-- Replace(Replace(Replace(Replace(TS_NAME,'g','#'),'n','g'),'l','n'),'#','l')
--
-- 02.826 r 2.826%
-- 03.174 d 3.174%
-- 03.826 m 3.826%
-- rdm
-- Replace(Replace(Replace(Replace(TS_NAME,'r','#'),'d','r'),'m','d'),'#','m')
--
-- 04.027 f 4.027%
-- 04.200 h 4.200%
-- 04.319 p 4.319%
-- 04.434 b 4.434%
-- 05.238 c 5.238%
-- fhpbc
-- Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c')
--
-- 05.497 w 5.497%
-- 06.686 s 6.686%
-- 15.978 t 15.978%
-- wst
-- Replace(Replace(Replace(Replace(TS_NAME,'w','#'),'s','w'),'t','s'),'#','t')
--
--
-- 02.799 e 2.799%
-- 07.294 i 7.294%
-- 07.631 o 7.631%
-- 11.682 a 11.682%
-- eioa
-- Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')
--
-- -- dont replace
-- 00.222 q 0.222%
-- 00.763 y 0.763%
-- 01.183 u 1.183%
-- Obfuscate a name
Select
ts_id,
Cast(ts_name as varchar(42)) as [Original Name]
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'x','#'),'z','x'),'#','z'),'k','#'),'j','k'),'v','j'),'#','v'),'g','#'),'n','g'),'l','n'),'#','l'),'r','#'),'d','r'),'m','d'),'#','m'),'f','#'),'h','f'),'p','h'),'b','p'),'c','b'),'#','c'),'w','#'),'s','w'),'t','s'),'#','t'),'e','#'),'i','ew'),'o','i'),'a','o'),'#','a')) as VarChar(42)) As [morphed name] ,
Cast(dbo.fnConvert_TitleCase(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(TS_NAME,'e','t'),'~','e'),'t','~'),'a','o'),'~','a'),'o','~'),'i','n'),'~','i'),'n','~'),'s','h'),'~','s'),'h','r'),'r','~'),'d','l'),'~','d'),'l','~'),'m','w'),'~','m'),'w','f'),'f','~'),'g','y'),'~','g'),'y','p'),'p','~'),'b','v'),'~','b'),'v','k'),'k','~'),'x','~'),'j','x'),'~','j')) as VarChar(42)) As [morphed name2]
From
ts_users
;

Why not just use some sort of Random Name Generator?

Use a temporary table instead and the query is very fast. I just ran on 60K rows in 4 seconds. I'll be using this one going forward.
DECLARE TABLE #Names
(Id int IDENTITY(1,1),[Name] varchar(100))
/* Scramble the last names (randomly pick another last name) */
INSERT #Names
SELECT LastName
FROM Customer
ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID]
SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id)
DROP TABLE #Names

The following approach worked for us, lets say we have 2 tables Customers and Products:
CREATE FUNCTION [dbo].[GenerateDummyValues]
(
#dataType varchar(100),
#currentValue varchar(4000)=NULL
)
RETURNS varchar(4000)
AS
BEGIN
IF #dataType = 'int'
BEGIN
Return '0'
END
ELSE IF #dataType = 'varchar' OR #dataType = 'nvarchar' OR #dataType = 'char' OR #dataType = 'nchar'
BEGIN
Return 'AAAA'
END
ELSE IF #dataType = 'datetime'
BEGIN
Return Convert(varchar(2000),GetDate())
END
-- you can add more checks, add complicated logic etc
Return 'XXX'
END
The above function will help in generating different data based on the data type coming in.
Now, for each column of each table which does not have word "id" in it, use following query to generate further queries to manipulate the data:
select 'select ''update '' + TABLE_NAME + '' set '' + COLUMN_NAME + '' = '' + '''''''' + dbo.GenerateDummyValues( Data_type,'''') + '''''' where id = '' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, ' + table_name + ' where RIGHT(LOWER(COLUMN_NAME),2) <> ''id'' and TABLE_NAME = '''+ table_name + '''' + ';' from INFORMATION_SCHEMA.TABLES;
When you execute above query it will generate update queries for each table and for each column of that table, for example:
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Customers where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Customers';
select 'update ' + TABLE_NAME + ' set ' + COLUMN_NAME + ' = ' + '''' + dbo.GenerateDummyValues( Data_type,'') + ''' where id = ' + Convert(varchar(10),Id) from INFORMATION_SCHEMA.COLUMNS, Products where RIGHT(LOWER(COLUMN_NAME),2) <> 'id' and TABLE_NAME = 'Products';
Now, when you execute above queries you will get final update queries, that will update the data of your tables.
You can execute this on any SQL server database, no matter how many tables do you have, it will generate queries for you that can be further executed.
Hope this helps.

Another site to generate shaped fake data sets, with an option for T-SQL output:
https://mockaroo.com/

Here's a way using ROT47 which is reversible, and another which is random. You can add a PK to either to link back to the "un scrambled" versions
declare #table table (ID int, PLAIN_TEXT nvarchar(4000))
insert into #table
values
(1,N'Some Dudes name'),
(2,N'Another Person Name'),
(3,N'Yet Another Name')
--split your string into a column, and compute the decimal value (N)
if object_id('tempdb..#staging') is not null drop table #staging
select
substring(a.b, v.number+1, 1) as Val
,ascii(substring(a.b, v.number+1, 1)) as N
--,dense_rank() over (order by b) as RN
,a.ID
into #staging
from (select PLAIN_TEXT b, ID FROM #table) a
inner join
master..spt_values v on v.number < len(a.b)
where v.type = 'P'
--select * from #staging
--create a fast tally table of numbers to be used to build the ROT-47 table.
;WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
--Here we put it all together with stuff and FOR XML
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,case
when 47 + N <= 126 then char(47 + N)
when 47 + N > 126 then char(N-47)
end as ENCRYPTED_TEXT
from cteTally
where N between 33 and 126) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t
--or if you want really random
select
PLAIN_TEXT
,ENCRYPTED_TEXT =
stuff((
select
--s.Val
--,s.N
e.ENCRYPTED_TEXT
from #staging s
left join(
select
N as DECIMAL_VALUE
,char(N) as ASCII_VALUE
,char((select ROUND(((122 - N -1) * RAND() + N), 0))) as ENCRYPTED_TEXT
from cteTally
where (N between 65 and 122) and N not in (91,92,93,94,95,96)) e on e.DECIMAL_VALUE = s.N
where s.ID = t.ID
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 0, '')
from #table t

Encountered the same problem myself and figured out an alternative solution that may work for others.
The idea is to use MD5 on the name and then take the last 3 hex digits of it to map into a table of names. You can do this separately for first name and last name.
3 hex digits represent decimals from 0 to 4095, so we need a list of 4096 first names and 4096 last names.
So conv(substr(md5(first_name), 3),16,10) (in MySQL syntax) would be an index from 0 to 4095 that could be joined with a table that holds 4096 first names. The same concept could be applied to last names.
Using MD5 (as opposed to a random number) guarantees a name in the original data will always be mapped to the same name in the test data.
You can get a list of names here:
https://gist.github.com/elifiner/cc90fdd387449158829515782936a9a4

I am working on this at my company right now -- and it turns out to be a very tricky thing. You want to have names that are realistic, but must not reveal any real personal info.
My approach has been to first create a randomized "mapping" of last names to other last names, then use that mapping to change all last names. This is good if you have duplicate name records. Suppose you have 2 "John Smith" records that both represent the same real person. If you changed one record to "John Adams" and the other to "John Best", then your one "person" now has 2 different names! With a mapping, all occurrences of "Smith" get changed to "Jones", and so duplicates ( or even family members ) still end up with the same last name, keeping the data more "realistic".
I will also have to scramble the addresses, phone numbers, bank account numbers, etc...and I am not sure how I will approach those. Keeping the data "realistic" while scrambling is certainly a deep topic. This must have been done many times by many companies -- who has done this before? What did you learn?

Frankly, I'm not sure why this is needed. Your dev/test environments should be private, behind your firewall, and not accessible from the web.
Your developers should be trusted, and you have legal recourse against them if they fail to live up to your trust.
I think the real question should be "Should I scramble the data?", and the answer is (in my mind) 'no'.
If you're sending it offsite for some reason, or you have to have your environments web-accessible, or if you're paranoid, I would implement a random switch. Rather than build a temp table, run switches between each location and a random row in the table, swapping one piece of data at a time.
The end result will be a table with all the same data, but with it randomly reorganized. It should also be faster than your temp table, I believe.
It should be simple enough to implement the Fisher-Yates Shuffle in SQL...or at least in a console app that reads the db and writes to the target.
Edit (2): Off-the cuff answer in T-SQL:
declare #name varchar(50)
set #name = (SELECT lastName from person where personID = (random id number)
Update person
set lastname = #name
WHERE personID = (person id of current row)
Wrap this in a loop, and follow the guidelines of Fisher-Yates for modifying the random value constraints, and you'll be set.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Averaging out Lat/longs in SQL Server database - sql

Related

Convert string coordinates to geography

SQL points around line between A and B

SQL Geometry find all points in a radius

Query table records with points (geometry) within area

Obfuscate / Mask / Scramble personal information

Categories

Resources