I am trying to pregenerate some alphanumeric strings and insert the result into a table. The length of string will be 5. Example: a5r67. Basically I want to generate some readable strings for customers so they can access their orders like
www.example.com/order/a5r67. Now I have a select statement:
;WITH
cte1 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t)),
cte2 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t)),
cte3 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t)),
cte4 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t)),
cte5 AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t))
INSERT INTO ProductHandles(ID, Used)
SELECT cte1.t + cte2.t + cte3.t + cte4.t + cte5.t, 0
FROM cte1
CROSS JOIN cte2
CROSS JOIN cte3
CROSS JOIN cte4
CROSS JOIN cte5
Now the problem is I need to write something like this to get a value from the table:
SELECT TOP 1 ID
FROM ProductHandles
WHERE Used = 0
I will have index on the Used column so it will be fast. The problem with this is that it comes with order:
00000
00001
00002
...
I know that I can order by NEWID(), but that will be much slower. I know that there is no guarantee of ordering unless we specify Order By clause. What is needed is opposite. I need guaranteed chaos, but not by ordering by NEWID() each time customer creates order.
I am going to use it like:
WITH cte as (
SELECT TOP 1 * FROM ProductHandles WHERE Used = 0
--I don't want to order by newid() here as it will be slow
)
UPDATE cte
SET Used = 1
OUTPUT INSERTED.ID
If you add an identity column to the table, and use order by newid() when inserting the records (that will be slow but it's a one time thing that's being done offline from what I understand) then you can use order by on the identity column to select the records in the order they where inserted to the table.
From the Limitations and Restrictions part of the INSERT page in Microsoft Docs:
INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted.
This means that by doing this you are effectively making the identity column ordered by the same random order the rows where selected in the insert...select statement.
Also, there is no need to repeat the same cte 5 times - you are already repeating the cross apply:
CREATE TABLE ProductHandles(sort int identity(1,1), ID char(5), used bit)
;WITH
cte AS(SELECT * FROM (VALUES('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('s'),('t'),('u'),('v'),('w'),('x'),('y'),('z')) AS v(t))
INSERT INTO ProductHandles(ID, Used)
SELECT a.t + b.t + c.t + d.t + e.t, 0
FROM cte a
CROSS JOIN cte b
CROSS JOIN cte c
CROSS JOIN cte d
CROSS JOIN cte e
ORDER BY NEWID()
Then the cte can have an order by clause that guarantees the same random order as the rows returned from the select statement populating this table:
WITH cte as (
SELECT TOP 1 *
FROM ProductHandles
WHERE Used = 0
ORDER BY sort
)
UPDATE cte
SET Used = 1
OUTPUT INSERTED.ID
You can see a live demo on rextester. (with only digits since it's taking too long otherwise)
Here's a slightly different option...
Rather than trying to generate all possible values in a single sitting, you could simply generate a million or two at a time and generate more as they get used up.
Using this approach, you drastically reduce the the initial creation time and eliminate the need to maintain the massive table of values, the majority of which, that will never be used.
CREATE TABLE dbo.ProductHandles (
rid INT NOT NULL
CONSTRAINT pk_ProductHandles
PRIMARY KEY CLUSTERED,
ID_Value CHAR(5) NOT NULL
CONSTRAINT uq_ProductHandles_IDValue
UNIQUE WITH (IGNORE_DUP_KEY = ON), -- prevents the insertion of duplicate values w/o generating any errors.
Used BIT NOT NULL
CONSTRAINT df_ProductHandles_Used
DEFAULT (0)
);
-- Create a filtered index to help facilitate fast searches
-- of unused values.
CREATE NONCLUSTERED INDEX ixf_ProductHandles_Used_rid
ON dbo.ProductHandles (Used, rid)
INCLUDE(ID_Value)
WHERE Used = 0;
--==========================================================
WHILE 1 = 1 -- The while loop will attempt to insert new rows, in 1M blocks, until required minimum of unused values are available.
BEGIN
IF (SELECT COUNT(*) FROM dbo.ProductHandles ph WHERE ph.Used = 0) > 1000000 -- the minimum num of unused ID's you want to keep on hand.
BEGIN
BREAK;
END;
ELSE
BEGIN
WITH
cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
cte_Tally (n) AS (
SELECT TOP (1000000) -- Sets the "block size" of each insert attempt.
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM
cte_n3 a CROSS JOIN cte_n3 b
)
INSERT dbo.ProductHandles (rid, ID_Value, Used)
SELECT
t.n + ISNULL((SELECT MAX(ph.rid) FROM dbo.ProductHandles ph), 0),
CONCAT(ISNULL(c1.char_1, n1.num_1), ISNULL(c2.char_2, n2.num_2), ISNULL(c3.char_3, n3.num_3), ISNULL(c4.char_4, n4.num_4), ISNULL(c5.char_5, n5.num_5)),
0
FROM
cte_Tally t
-- for each of the 5 positions, randomly generate numbers between 0 & 36.
-- 0-9 are left as numbers.
-- 10 - 36 are converted to lower cased letters.
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n1 (num_1)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n1.num_1 > 9 THEN n1.num_1 + 87 END)) ) c1 (char_1)
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n2 (num_2)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n2.num_2 > 9 THEN n2.num_2 + 87 END)) ) c2 (char_2)
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n3 (num_3)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n3.num_3 > 9 THEN n3.num_3 + 87 END)) ) c3 (char_3)
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n4 (num_4)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n4.num_4 > 9 THEN n4.num_4 + 87 END)) ) c4 (char_4)
CROSS APPLY ( VALUES (ABS(CHECKSUM(NEWID())) % 36) ) n5 (num_5)
CROSS APPLY ( VALUES (CHAR(CASE WHEN n5.num_5 > 9 THEN n5.num_5 + 87 END)) ) c5 (char_5);
END;
END;
After the initial creation, move the code in the WHILE loop to a stored procedure and schedule it to automatically run on a periodic basis.
If I'm understanding this right, It looks like your attempting to separate the URL/visible data from the DB record ID, as most apps use, and provide something that is not directly related to an ID field that the user will see. NEWID() does allow control of the number of characters so you could generate a smaller field with a smaller index. Or just use a portion of the full NEWID()
SELECT CONVERT(varchar(255), NEWID())
SELECT SUBSTRING(CONVERT(varchar(40), NEWID()),0,5)
You might also want to look at a checksum field, I don't know if its faster on indexing though. You could get crazier by combining random NEWID() with a checksum across 2 or 3 fields.
SELECT BINARY_CHECKSUM(5 ,'EP30461105',1)
Related
Suppose a table with 3 columns. each row represents a unique combination of each value:
a a a
a a b
a b a
b b a
b b c
c c a
...
however, what I want is,
aab = baa = aba
cca = cac = acc
...
Finally, I want to get these values in a CSV format as a combination for each value like the image that I attached.
Thanks for your help!
Below is the query to generate my problem, please take a look!
--=======================================
--populate test data
--=======================================
drop table if exists #t0
;
with
cte_tally as
(
select row_number() over (order by (select 1)) as n
from sys.all_columns
)
select
char(n) as alpha
into #t0
from
cte_tally
where
(n > 64 and n < 91) or
(n > 96 and n < 123);
drop table if exists #t1
select distinct upper(alpha) alpha into #t1 from #t0
drop table if exists #t2
select
a.alpha c1
, b.alpha c2
, c.alpha c3
, row_number()over(order by (select 1)) row_num
into #t2
from #t1 a
join #t1 b on 1=1
join #t1 c on 1=1
drop table if exists #t3
select *
into #t3
from (
select *
from #t2
) p
unpivot
(cvalue for c in (c1,c2,c3)
) unpvt
select
row_num
, c
, cvalue
from #t3
order by 1,2
--=======================================
--these three rows should be treated equally
--=======================================
select *
from #t2
where concat(c1,c2,c3) in ('ABA','AAB', 'BAA')
--=======================================
--what i've tried...
--row count is actually correct, but the problem is that it ommits where there're any duplicate alphabet.
--=======================================
select
distinct
stuff((
select
distinct
'.' + cvalue
from #t3 a
where a.row_num = h.row_num
for xml path('')
),1,1,'') as comb
from #t3 h
As pointed out in the comments, you can unpivot the values, sort them in the right order and reaggregate them into a single row. Then you can group the original rows by those new values.
SELECT *
FROM #t2
CROSS APPLY (
SELECT a = MIN(val), b = MIN(CASE WHEN rn = 2 THEN val), c = MAX(val)
FROM (
SELECT *, rn = ROW_NUMBER() OVER (ORDER BY val)
FROM (VALUES (c1),(c2),(c3) ) v3(val)
) v2
) v
GROUP BY v.a, v.b, v.c;
Really, what you should perhaps do, is ensure that the values are in the correct order in the first place:
ALTER TABLE #t2
ADD CONSTRAINT t2_ValuesOrder
CHECK (c1 <= c2 AND c2 <= c3);
Would be curious why, sure you have a reason. Might suggest having a lookup table, holding all associated keys to a "Mapping Table". You might optimize some of this as you implement it. First create one table for holding the "Next/New Key" (this is where the 1, 2, 3...) come from. You get a new "New Key" after each batch of records you bulk insert into your "Mapping Table". The "Mapping Table" holds the combination of the key values, one row for each combinations along with your "New Key" Should get a table looking something like:
A, B, C, 1
A, C, B, 1
B, A, C, 1
...
X, Y, Z, 2
X, Z, Y, 2
If you can update your source table to hold a column for your "Mapping Key" (the 1,2,3) then you just look up from the mapping table where (c1=a, c2=a, c3=b) order for this look-up shouldn't matter. One suggestion would create a composite unique key using c1,c2,c3 on your mapping table. Then to get your records just look up the "mapping key value" from the mapping table and then query for records matching the mapping key value. Or, if you don't do a pre-lookup to get the mapping key you should be able to do a self-join using the mapping key value...
If you want them in a CSV format:
select distinct v.cs
from #t2 t2 cross apply
(select string_agg(c order by c desc, ',') as cs
from (values (t2.c1), (t2.c2), (t2.c3)
) v(c)
) v;
It seems to me that what you need is some form of masking*. Take this fiddle:
http://sqlfiddle.com/#!18/fc67f/8
where I have created a mapping table that contains all of the possible values and paired that with increasing orders of 10. Doing a cross join on that map table, concatenating the values, adding the masks and grouping on the total will yield you all the unique combinations.
Here is the code from the fiddle:
CREATE TABLE maps (
val varchar(1),
num int
);
INSERT INTO maps (val, num) VALUES ('a', 1), ('b', 10), ('c', 100);
SELECT mask, max(vals) as val
FROM (
SELECT concat(m1.val, m2.val, m3.val) as vals,
m1.num + m2.num + m3.num as mask
FROM maps m1
CROSS JOIN maps m2
CROSS JOIN maps m3
) q GROUP BY mask
Using these values of 10 will ensure that mask contains the count for each value, one for each place column in the resulting number, and then you can group on it to get the unique(ish) strings.
I don't know what your data looks like, and if you have more than 10 possible values then you will have to use some other base than 10, but the theory should still apply. I didn't write code to extract the columns from the value table into the mapping table, but I'm sure you can do that.
*actually, I think the term I was looking for was flag.
I am looking to create a table that has 100 rows, and the first column is organized by the letters A-Z, and repeats all the way to 100. The closest I have come is either:
having a numeric column that then uses the ASCII values to convert the number to the letter, however this involves creating the numeric column first, and then having the alphabet column dependent on this one, or
I have been able to create a single column, however when I try to print the whole table, it shows up as AAAA, BBBB, CCCC, DDDD, etc.
I need the column to be completely independent which is why solution #1 doesn't work, and I can't find a way to properly sort or organize solution #2 for it to be A, B, C instead of the way it is printing now. Screenshots for context:
Solution 1
Solution 2
I have been using this code to create the table:
SELECT n
FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) t(n)
)
SELECT n1.n + n10.n * 10 as col
INTO dbo.table1
FROM nums n1
CROSS JOIN nums n10;
Then for solution 1, I tried this:
ALTER TABLE numbers
ADD letters AS CHAR(num % 26 + 65);
SELECT * FROM numbers
ORDER BY num;
and for solution 2, this:
ALTER TABLE table1
ALTER COLUMN col VARCHAR(3);
UPDATE table1
SET col = col % 26 + 65;
UPDATE table1
SET col = CHAR(col);
SELECT * FROM table1
ORDER BY col;
I have been at this for a few hours now, trying different things in both solutions to get the answer.
Thanks in advance.
If you want to repeatedly cycle through generating letters, you can use a recursive CTE:
with cte as (
select convert(varchar(max), 'A') as letter, 1 as n
union all
select (case when letter < 'Z' then convert(varchar(max), char(ascii(letter) + 1)) else 'A' end),
n + 1
from cte
where n < 100
)
select letter
from cte;
You can use insert or select into to put the values in a table.
If you want more than 100 rows, you'll need to add option (maxrecursion 0).
There is no certain order of rows in a table. Even if the table has a clustered index specified. Such a concept does not exist in relational databases.
To ensure a specific order of values you need to force it using ORDER BY clause in a SELECT statement.
Considering that, the following code must give you an idea how to implement your solution:
With nums as (
SELECT n
FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) t(n)
),
numbers as (
SELECT n1.n + n10.n * 10 as num
FROM nums n1
CROSS JOIN nums n10
)
select num, char(num/26 + 65) + CHAR(num % 26 + 65)
from numbers
order by num
How to generate 6 digit unique alphanumeric string with 6- character length, case nonsensitive for 4 million records. By replacing 1’s, I’s, O’s, and 0’s.
I have tried using the below query but the problem is when I am trying to replace the above values the unique id has some duplicate values.
**
select CAST(REPLACE(REPLACE(CHAR( ASCII('AA')+(ABS(CHECKSUM(NEWID()))%25)) , 'O', ''), 'I', '')
REPLACE(REPLACE(REPLACE( REPLACE(SUBSTRING(CONVERT(varchar(60), NEWID()),1, 10) , '-',''), '.' , ''), '0' , ''), '1','') AS nvarchar (6))
, employee_id
from cte
**
The final output should be something like:
....
....
....
...
....
....
....
...
....
....
...
How to generate 6 digit unique alphanumeric string with 6- character length, case nonsensitive for 4 million records.
There are enough hex digits to do what you want. So, one option is:
select right('ZZZZZZ' + format(row_number() over (order by newid()), 'X'), 6)
This generates a sequential number (randomly), converts it to hex, and then prepends Zs.
If you want the UIDs to appear to be random (e.g., 1st could be G5K2M5, second 23BN32, etc), I think you basically have three choices
(In a loop) randomly generating UIDs, remove those that a) already exist, and b) have duplicates in your generated list, then insert the unique UIDs. Repeat until you have none left.
Generate a table with all possible UIDs (e.g., all letters and numbers except 1, I, L, 0, o - note I've added L to the list as lowercase l looks like I or 1). That means 31 possible characters in 6 slots... 31^6 is approximately 900 million possibilities. For the UIDs to use, randomly select the number needed from the UID list, assign them as needed, then remove them from the list so you won't get doubles.
Use a formula where each number is uniquely mapped to a UID. Then just get the rownumber or other unique int identifier, and calculate the UID from it. Note that the formula could be a mathematical formula, or could just be a table (as above) where the UIDs are initially randomly sorted, and you just take the UID from the relevant rownumber.
select top (100000)
cte.*,
concat
(
substring(s.random32, p.p1, 1),
substring(s.random32, p.p2, 1),
substring(s.random32, p.p3, 1),
substring(s.random32, p.p4, 1),
substring(s.random32, p.p5, 1),
substring(s.random32, p.p6, 1)
) as combo6
from
--employees
(
--4mil employees
select top (4000000)
row_number() over(order by ##spid) as empid, --this could be empid, eg. empid as n
a.name as empfirstname, a.name as emplastname, b.type_desc as emptype
from sys.all_objects as a
cross join sys.all_objects as b
) as cte
--random string
cross join
(
--one random string (excluding 1, 0, I, O)
select top (1)
(
select v.v as '*'
from
(values
('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),
('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),
('J'),('K'),('L'),('M'),('N'), ('P'),('Q'),('R'),
('S'),('T'),('U'),('V'),('W'),('X'),('Y'),('Z')
) as v(v)
order by newid()
for xml path('')
) as random32
) as s
--combo6 positions in string
cross apply
(
select
/*for 32 chars = len(rand32) */
(power(32,0)+(cte.empid-1)%power(32, 1))/power(32,0) as p1,
(power(32,1)+(cte.empid-1)%power(32, 2))/power(32,1) as p2,
(power(32,2)+(cte.empid-1)%power(32, 3))/power(32,2) as p3,
(power(32,3)+(cte.empid-1)%power(32, 4))/power(32,3) as p4,
(power(32,4)+(cte.empid-1)%power(32, 5))/power(32,4) as p5,
(power(32,5)+(cte.empid-1)%power(32, 6))/power(32,5) as p6
) as p
go
....or....(?)
create or alter function dbo.[why?]()
returns char(6)
as
begin
declare #combo6 char(6);
declare #randomstring char(32) = cast(session_context(N'randomstring') as char(32));
if #randomstring is null
begin
select #randomstring =
(
select v.v as '*'
from
(values
('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),
('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),
('J'),('K'),('L'),('M'),('N'), ('P'),('Q'),('R'),
('S'),('T'),('U'),('V'),('W'),('X'),('Y'),('Z')
) as v(v)
order by checksum(##idle, ##cpu_busy, (select max(last_request_end_time) from sys.dm_exec_sessions where session_id=##spid), v.v)
for xml path('')
);
end
declare #randomnumber int = 1 + isnull(cast(session_context(N'randomnumber') as int), abs(checksum(#randomstring))%10000000);
select #combo6 = concat(
substring(#randomstring, p.p1, 1),
substring(#randomstring, p.p2, 1),
substring(#randomstring, p.p3, 1),
substring(#randomstring, p.p4, 1),
substring(#randomstring, p.p5, 1),
substring(#randomstring, p.p6, 1)
)
from
(
select
/*for 32 chars = len(rand32) */
(power(32,0)+(#randomnumber-1)%power(32, 1))/power(32,0) as p1,
(power(32,1)+(#randomnumber-1)%power(32, 2))/power(32,1) as p2,
(power(32,2)+(#randomnumber-1)%power(32, 3))/power(32,2) as p3,
(power(32,3)+(#randomnumber-1)%power(32, 4))/power(32,3) as p4,
(power(32,4)+(#randomnumber-1)%power(32, 5))/power(32,4) as p5,
(power(32,5)+(#randomnumber-1)%power(32, 6))/power(32,5) as p6
) as p;
exec sp_set_session_context #key=N'randomstring', #value=#randomstring;
exec sp_set_session_context #key=N'randomnumber', #value=#randomnumber;
return(#combo6);
end
go
exec sp_set_session_context #key=N'randomstring', #value=null;
exec sp_set_session_context #key=N'randomnumber', #value=null;
go
select top (100000) dbo.[why?]() as empid, a.name, b.object_id
from sys.all_objects as a
cross join sys.all_objects as b
go
--drop function dbo.[why?]
The difficulty reside on the UNIQUE feature of the string. Some solutions that have been shown cannot guarantee the uniqueness of the generated strings.
First solution of lptr does not give always 6 letters and give some duplicates.
My solution give the full requirement, but it is slow :
WITH TAZ AS
(SELECT CAST('A' COLLATE Latin1_General_BIN AS CHAR(1)) AS LETTER, ASCII('A') AS CAR
UNION ALL
SELECT CAST(CHAR(CAR + 1) COLLATE Latin1_General_BIN AS CHAR(1)), CAR + 1
FROM TAZ
WHERE CHAR(CAR + 1) <= 'Z'
)
SELECT TOP 4000000
T1.LETTER + T2.LETTER + T3.LETTER + T4.LETTER + T5.LETTER + T6.LETTER AS L6
FROM TAZ AS T1
CROSS JOIN TAZ AS T2
CROSS JOIN TAZ AS T3
CROSS JOIN TAZ AS T4
CROSS JOIN TAZ AS T5
CROSS JOIN TAZ AS T6
ORDER BY NEWID()
One thing I do in such a case is to compute all the 6 length strings possible and store it in a plain table stored in a compressed mode and in a read only storage (a tally table). The table has an ID and an extra column of bit type with the 0 value.
When you want to attribute some 6 chars string values, you just pickup from the table and marks it with the bit modify to 1.
As an information, this is the way that referenced ticket file are givent to customer in the french national railway compagny call SNCF since a long time.
First, let me start by saying you already have some fine answers here, the only problem is with them is that they are slow. My suggested solution is fast - even very fast in comparison.
A year ago I've written a blog post entitled How to pre-populate a random strings pool
that was based on an answer written by Martin Smith to How can I generate random strings in TSQL. I've basically took the code posted in that answer and wrapped it up inside a inline table valued function.
For this problem, I've taken that function and modified it ever so slightly to better fit your requirements - mainly the number of random strings (original version can produce up to 1,000,000 rows only) and the case-insensitivity.
Tests I've made comparing the speed of execution between Gordon's SQLpro's, lptr's answers and my own showed conclusively that this is the best solution between all four, at least in terms of execution speed.
So, without further ado, here's the code:
First, the function and it's auxiliary view:
-- This view is needed for the function to work. Read my blog post for details.
CREATE VIEW dbo.GuidGenerator
AS
SELECT Newid() As NewGuid;
GO
-- slightly modified version to enable the generation of up to 100,000,000 rows.
CREATE FUNCTION dbo.RandomStringGenerator
(
#Length int,
#Count int -- Note: up to 100,000,000 rows
)
RETURNS TABLE
AS
RETURN
WITH E1(N) AS (SELECT N FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) V(N)), -- 10
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000
Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY ##SPID) FROM E4 a, E4 b) -- 100,000,000
SELECT TOP(#Count)
N As Number,
(
SELECT TOP (#Length) CHAR(
CASE Abs(Checksum(NewGuid)) % 2
WHEN 0 THEN 65 + Abs(Checksum(NewGuid)) % 26 -- Random upper case letter
ELSE 48 + Abs(Checksum(NewGuid)) % 10 -- Random digit
END
)
FROM Tally As t0
CROSS JOIN GuidGenerator
WHERE t0.n <> -t1.n
FOR XML PATH('')
) As RandomString
FROM Tally As t1
GO
Then, using distinct, top 4000000 and a simple where clause - select the random strings you want:
SELECT DISTINCT TOP 4000000 Number, RandomString
FROM dbo.RandomStringGenerator(6,100000000)
WHERE RandomString NOT LIKE '%[IiOoLl01]%' -- in case your database's default collation is case sensitive...
The reason this is the fastest solution is very simple - My solution already generates the strings randomly, so I don't need to also sort them randomly - which is the biggest bottle neck of the other suggested solutions.
If you don't need the order to be random, you can go with SQLpro's solution, just remove the order by newid() - that was the fastest solution (though it didn't filter out the unwanted chars)
Update
As requested by lptr - here's an example on how to select the random strings and another table as well:
WITH Tbl AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Em_Id) As Rn
FROM <TableNameHere>
), Rnd AS
(
SELECT DISTINCT TOP 4000000 ROW_NUMBER() OVER (ORDER BY Number) As Rn, RandomString
FROM dbo.RandomStringGenerator(6,100000000)
WHERE RandomString NOT LIKE '%[IiOoLl01]%'
)
SELECT Em_Id, RandomString
FROM Tbl
INNER JOIN rnd
ON Tbl.Rn = Rnd.Rn
Notes:
Change <TableNameHere> to the actual table name
You can use any column (or constant) for the order by of the row number, it doesn't matter because the order is irrelevant here anyway.
I need to be able to apply unique 8 character strings per row on a table that has almost 2.5 million records.
I have tried this:
UPDATE MyTable
SET [UniqueID]=SUBSTRING(CONVERT(varchar(255), NEWID()), 1, 8)
Which works, but when I check the uniqueness of the ID's, I receive duplicates
SELECT [UniqueID], COUNT([UniqueID])
FROM NicoleW_CQ_2019_Audi_CR_Always_On_2019_T1_EM
GROUP BY [UniqueID]
HAVING COUNT([UniqueID]) > 1
I really would just like to update the table, as above, with just a simple line of code, if possible.
Here's a way that uses a temporary table to assure the uniqueness
Create and fill a #temporary table with unique random 8 character codes.
The SQL below uses a FOR XML trick to generate the codes in BASE62 : [A-Za-z0-9]
Examples : 8Phs7ZYl, ugCKtPqT, U9soG39q
A GUID only uses the characters [0-9A-F].
For 8 characters that can generate 16^8 = 4294967296 combinations.
While with BASE62 there are 62^8 = 2.183401056e014 combinations.
So the odds that a duplicate is generated are significantly lower with BASE62.
The temp table should have an equal of larger amount of records than the destination table.
This example only generates 100000 codes. But you get the idea.
IF OBJECT_ID('tempdb..#tmpRandoms') IS NOT NULL DROP TABLE #tmpRandoms;
CREATE TABLE #tmpRandoms (
ID INT PRIMARY KEY IDENTITY(1,1),
[UniqueID] varchar(8),
CONSTRAINT UC_tmpRandoms_UniqueID UNIQUE ([UniqueID])
);
WITH DIGITS AS
(
select n
from (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)
),
NUMS AS
(
select (d5.n*10000 + d4.n*1000 + d3.n*100 + d2.n * 10 + d1.n) as n
from DIGITS d1
cross join DIGITS d2
cross join DIGITS d3
cross join DIGITS d4
cross join DIGITS d5
)
INSERT INTO #tmpRandoms ([UniqueID])
SELECT DISTINCT LEFT(REPLACE(REPLACE((select CAST(NEWID() as varbinary(16)), n FOR XML PATH(''), BINARY BASE64),'+',''),'/',''), 8) AS [UniqueID]
FROM NUMS;
Then update your table with it
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN, [UniqueID]
FROM YourTable
)
UPDATE t
SET t.[UniqueID] = tmp.[UniqueID]
FROM CTE t
JOIN #tmpRandoms tmp ON tmp.ID = t.RN;
A test on rextester here
Can you just use numbers and assign a randomish value?
with toupdate as (
select t.*,
row_number() over (order by newid()) as random_enough
from mytable t
)
update toupdate
set UniqueID = right(concat('00000000', random_enough), 8);
See: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/a289ed64-2038-415e-9f5d-ae84e50fe702/generate-random-string-of-length-5-az09?forum=transactsql
Alter: DECLARE #s char(5) and SELECT TOP (5) c1 to fix length you want.
The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.
I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values
From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;
You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.