Random selection in SQL Server - sql

I have two sets and for each value in the first set I want to apply a number of random values from the second. The approach I have chosen uses a select from the first with a cross apply from the second. A simplified MWE is as follows:
DROP TABLE IF EXISTS #S;
CREATE TABLE #S (c CHAR(1));
INSERT INTO #S VALUES ('A'), ('B');
DROP TABLE IF EXISTS #T;
WITH idGen(id) AS (
SELECT 1
UNION ALL
SELECT id + 1 FROM idGen WHERE id < 1000
)
SELECT id INTO #T FROM idGen OPTION(MAXRECURSION 0);
DROP TABLE IF EXISTS #R;
SELECT c, id INTO #R FROM #S
CROSS APPLY (
SELECT id, ROW_NUMBER() OVER (
/*
-- this gives 100% overlap
PARTITION BY c
ORDER BY RAND(CHECKSUM(NEWID()))
*/
-- this gives the expected ~10% overlap
ORDER BY RAND(CHECKSUM(NEWID()) + CHECKSUM(c))
) AS R
FROM #T
) t
WHERE t.R <= 100;
SELECT COUNT(*) AS PercentOverlap -- ~10%
FROM #R rA JOIN #R rB
ON rB.id = rA.id AND rB.c = 'B'
WHERE rA.c = 'A';
While this solution works, I am wondering why changing to the (commented) partitioning method does not? Also, are there any caveats using this solution, seeing as it feels sort of dirty to add two checksums?
In the actual problem there is also a count in the first set containing the number of random values to select from the second set, which replaces the static 100 in the example above. However, using the fixed 100 made it easy to verify the expected overlap.

RAND() function is a run-time constant in SQL Server. It means that usually it is evaluated once for the query. When you pass a value to RAND this value serves as a starting seed.
You need to examine execution plan and you'll see where optimiser puts evaluation of the functions. It the case which doesn't produce expected result most likely optimiser has optimised it too aggressively and moved all "randomness" outside the loop.
Also, there is no point wrapping NEWID() into CHECKSUM() and into RAND().
Simple NEWID() is enough. Or, even better, a function that is designed to produce a random number, such as CRYPT_GEN_RANDOM()
Either version of your query looks a bit strange. I'd write it like this:
SELECT c, id INTO #R
FROM #S
CROSS APPLY
(
SELECT TOP(100) -- or #S.SomeField instead of 100
id
FROM #T
ORDER BY CRYPT_GEN_RANDOM(4) -- generate 4 random bytes, usually it is enough
) AS t
;
This gives 100 random rows from #T for each row from #S.
Actually, the query above is not good. Optimiser again sees that inner query (inside the CROSS APPLY) doesn't depend on outer query and optimises it away.
End result is that random rows are selected only once.
We need something to make optimiser run the inner query for each row from #S.
One way would be something like this:
SELECT c, id INTO #R
FROM #S
CROSS APPLY
(
SELECT TOP(100) -- or #S.SomeField instead of 100
id
FROM #T
ORDER BY CRYPT_GEN_RANDOM(4) + CHECKSUM(c)
) AS t
;
Something in the inner query to reference the row from the outer query. If you put TOP(#S.SomeField) instead of constant TOP(100), then + CHECKSUM(c) is not needed.
This is the plan for the first variant. You can see that #T is scanned once (1000 rows are read).
This is the plan for the second variant. You can see that #T is scanned twice (2000 rows are read).

Related

SQL Use Value from different column on same Select

I have a question if this is possible which will save me time writing extra code and limits user error. I need to use a value from a column (which has already performed some calulcation) from the same select then do extra calculation on it.
I encounter this a lot in my job. I will highlight the problem with a small example.
I have the following table created with one row added to it:
DECLARE #info AS TABLE
(
Name VARCHAR(500),
Value_A NUMERIC(8, 2)
)
INSERT INTO #info
VALUES ('Test Name 1', 10.20)
Now the requirements is to produce a select with 2 columns. First column needs to multiple Value_A by 10 and then the second column needs to add 1 to the first column. Below is the full requirements added:
SELECT (I.Value_A * 10) ,
(I.Value_A * 10) + 1
FROM #info AS I
As you can see, I just copied and pasted the first column code to second column and added one to it. Is there a way I can just reference the first column and just add + 1 instead of the copy and paste?
I can achieve this in a another way using an insert block followed by an update block. I can create a temp table, insert the first column to it then update second column. However, this means I have wrote extra code. I am looking for a solution which I only need to use one select.
Above is a small example. Normally, the problems I face is bigger select with more calculation or logic.
You can move the expression to the FROM clause using APPLY:
SELECT v.col1, v.col1 + 1
FROM #info I CROSS APPLY
(VALUES (I.Value_A * 10)) v(col1);
For the example given I would also use Gordon's method, but its worth knowing other techniques e.g. a sub-query and a common-table-expression (very similar to a sub-query) as they may be more appropriate for specific situations.
I find that a straight sub-query helps with understanding what is happening in the other solutions.
SELECT Calc1, Calc1 + 1
FROM (
SELECT (I.Value_A * 10) Calc1
FROM #info AS I
) X;
-- OR
WITH cte AS (
SELECT (I.Value_A * 10) Calc1
FROM #info AS I
)
SELECT Calc1, Calc1 + 1
FROM cte;

SQL Server random using seed

I want to add a column to my table with a random number using seed.
If I use RAND:
select *, RAND(5) as random_id from myTable
I get an equal value(0.943597390424144 for example) for all the rows, in the random_id column. I want this value to be different for every row - and that for every time I will pass it 0.5 value(for example), it would be the same values again(as seed should work...).
How can I do this?
(
For example, in PostrgreSql I can write
SELECT setseed(0.5);
SELECT t.* , random() as random_id
FROM myTable t
And I will get different values in each row.
)
Edit:
After I saw the comments here, I have managed to work this out somehow - but it's not efficient at all.
If someone has an idea how to improve it - it will be great. If not - I will have to find another way.
I used the basic idea of the example in here.
Creating a temporary table with blank seed value:
select * into t_myTable from (
select t.*, -1.00000000000000000 as seed
from myTable t
) as temp
Adding a random number for each seed value, one row at a time(this is the bad part...):
USE CPatterns;
GO
DECLARE #seed float;
DECLARE #id int;
DECLARE VIEW_CURSOR CURSOR FOR
select id
from t_myTable t;
OPEN VIEW_CURSOR;
FETCH NEXT FROM VIEW_CURSOR
into #id;
set #seed = RAND(5);
WHILE ##FETCH_STATUS = 0
BEGIN
set #seed = RAND();
update t_myTable set seed = #seed where id = #id
FETCH NEXT FROM VIEW_CURSOR
into #id;
END;
CLOSE VIEW_CURSOR;
DEALLOCATE VIEW_CURSOR;
GO
Creating the view using the seed value and ordering by it
create view my_view AS
select row_number() OVER (ORDER BY seed, id) AS source_id ,t.*
from t_myTable t
I think the simplest way to get a repeatable random id in a table is to use row_number() or a fixed id on each row. Let me assume that you have a column called id with a different value on each row.
The idea is just to use this as a seed:
select rand(id*1), as random_id
from mytable;
Note that the seed for the id is an integer and not a floating point number. If you wanted a floating point seed, you could do something with checksum():
select rand(checksum(id*0.5)) as random_id
. . .
If you are doing this for sampling (where you will say random_id < 0.1 for a 10% sample for instance, then I often use modulo arithmetic on row_number():
with t as (
select t.* row_number() over (order by id) as seqnum
from mytable t
)
select *
from t
where ((seqnum * 17 + 71) % 101) < 0.1
This returns about 10% of the numbers (okay, really 10/101). And you can adjust the sample by fiddling with the constants.
Someone sugested a similar query using newid() but I'm giving you the solution that works for me.
There's a workaround that involves newid() instead of rand, but it gives you the same result. You can execute it individually or as a column in a column. It will result in a random value per row rather than the same value for every row in the select statement.
If you need a random number from 0 - N, just change 100 for the desired number.
SELECT TOP 10 [Flag forca]
,1+ABS(CHECKSUM(NEWID())) % 100 AS RANDOM_NEWID
,RAND() AS RANDOM_RAND
FROM PAGSEGURO_WORK.dbo.jobSTM248_tmp_leitores_iso
So, in case it would someone someday, here's what I eventually did.
I'm generating the random seeded values in the server side(Java in my case), and then create a table with two columns: the id and the generated random_id.
Now I create the view as an inner join between the table and the original data.
The generated SQL looks something like that:
CREATE TABLE SEED_DATA(source_id INT PRIMARY KEY, random_id float NOT NULL);
select Rand(5);
insert into SEED_DATA values(1,Rand());
insert into SEED_DATA values(2, Rand());
insert into SEED_DATA values(3, Rand());
.
.
.
insert into SEED_DATA values(1000000, Rand());
and
CREATE VIEW DATA_VIEW
as
SELECT row_number() OVER (ORDER BY random_id, id) AS source_id,column1,column2,...
FROM
( select * from SEED_DATA tmp
inner join my_table i on tmp.source_id = i.id) TEMP
In addition, I create the random numbers in batches, 10,000 or so in each batch(may be higher), so it will not weigh heavily on the server side, and for each batch I insert it to the table in a separate execution.
All of that because I couldn't find a good way to do what I want purely in SQL. Updating row after row is really not efficient.
My own conclusion from this story is that SQL Server is sometimes really annoying...
You could convert a random number from the seed:
rand(row_number over (order by ___, ___,___))
Then cast that as a varchar
, Then use the last 3 characters as another seed.
That would give you a nice random value:
rand(right(cast(rand(row_number() over(x,y,x)) as varchar(15)), 3)

Solution to avoid non-sargable argument in where clause

In the code_list CTE in this query I have a row constructor that will eventually take any number of arguments. The column icd in the patient_codes CTE is a five digit identifier that is most descriptive that the three digit codes that the row constructor has. The table icd_patient has a 100 million rows so for performance's sake, I would like to filer the rows on this table before I do any further work. I have
;with code_list(code_list)
as
(
select x.code_list
from (values ('70700'),('25002')) as x(code_list)
),patient_codes
as
(
select distinct icd,pat_id,id
from icd_patient
where icd in (select icd from code_list)
)
select distinct pat_id from patient_codes
The problem is, however, is that in the icd_patient table all of the icd columns are five digit and more descriptive. If I look at the execution plan of this query it's pretty streamlined. If I do
;with code_list(code_list)
as
(
select x.code_list
from (values ('70700'),('25002')) as x(code_list)
),patient_codes
as
(
select substring(icd,1,3) as icd,pat_id
from icd_patient2
where substring(icd,1,3) in (select * from code_list)
)
select * from patient_codes
this if course has a large performance impact because of the substring expression in the where clause. Does something akin to like in exist so I can take advantage of my indexes?
Index on icd_patient
CREATE NONCLUSTERED INDEX [ix_icd_patient] ON [dbo].[icd_patient2]
(
[pat_id] ASC
)
INCLUDE ( [id],
This much simpler query should be better than (or, at worst, the same as) your existing query.
select pat_id
FROM dbo.icd_patient
where icd LIKE '707%'
OR icd LIKE '250%'
GROUP BY pat_id;
Note that sargability only matters if there is actually an index on this column.
An alternative (since OR can sometimes give the optimizer fits):
SELECT pat_id FROM
(
SELECT pat_id
FROM dbo.icd_patient
WHERE icd LIKE '707%'
UNION ALL
SELECT pat_id
FROM dbo.icd_patient
WHERE icd LIKE '250%'
) AS x
GROUP BY pat_id;
To make this extensible beyond a handful of OR conditions, I would use a table-valued parameter (TVP).
CREATE TYPE dbo.StringPatterns AS TABLE(s VARCHAR(3) PRIMARY KEY);
Then your stored procedure could say:
CREATE PROCEDURE dbo.whatever
#sp dbo.StringPatterns READONLY
AS
BEGIN
SET NOCOUNT ON;
SELECT p.pat_id
FROM dbo.icd_patient AS p
INNER JOIN #sp AS sp
ON p.pat_id LIKE sp.s + '%'
GROUP BY p.pat_id;
END
Then you can pass in your set of three-character substrings from a DataTable or other collection in C#. From T-SQL just as an example:
DECLARE #p dbo.StringPatterns;
INSERT #p VALUES('707'),('250');
EXEC dbo.whatever #sp = #p;
Something like like in does not exist. The following is sargable:
select *
from icd_patient
where icd like '70700%' or
icd like '25002%'
Because like with a constant initial substring is a special case for SQL Server. This does not work when the strings on the right are variables.
One solution is to create an indexed view on the icd_patient table with an index on the first five characters of the icd code.
Using "IN" makes that part of a command non-sargable on both sides. End of discussion.
Saying he fixes it using substring, completely changes what it would return while it remains non sarged.
Any "fix" should exactly match results. The actual fix is to join the cte so the five characters match or put three characters in the cte and match that in a join or put 4 characters in the cte where the fourth is "%" and join matching by using LIKE
Using a "like" that starts with "%" increases the complexity of the search, but it would still use the index to find the value because parsing the index should use less reading by only getting the full table row when a search is successful.

Multiple replacements in string in single Update Statement in SQL server 2005

I've a table 'tblRandomString' with following data:
ID ItemValue
1 *Test"
2 ?Test*
I've another table 'tblSearchCharReplacement' with following data
Original Replacement
* `star`
? `quest`
" `quot`
; `semi`
Now, I want to make a replacement in the ItemValues using these replacement.
I tried this:
Update T1
SET ItemValue = select REPLACE(ItemValue,[Original],[Replacement])
FROM dbo.tblRandomString T1
JOIN
dbo.tblSpecialCharReplacement T2
ON T2.Original IN ('"',';','*','?')
But it doesnt help me because only one replacement is done per update.
One solution is I've to use as a CTE to perform multiple replacements if they exist.
Is there a simpler way?
Sample data:
declare #RandomString table (ID int not null,ItemValue varchar(500) not null)
insert into #RandomString(ID,ItemValue) values
(1,'*Test"'),
(2,'?Test*')
declare #SearchCharReplacement table (Original varchar(500) not null,Replacement varchar(500) not null)
insert into #SearchCharReplacement(Original,Replacement) values
('*','`star`'),
('?','`quest`'),
('"','`quot`'),
(';','`semi`')
And the UPDATE:
;With Replacements as (
select
ID,ItemValue,0 as RepCount
from
#RandomString
union all
select
ID,SUBSTRING(REPLACE(ItemValue,Original,Replacement),1,500),rs.RepCount+1
from
Replacements rs
inner join
#SearchCharReplacement scr
on
CHARINDEX(scr.Original,rs.ItemValue) > 0
), FinalReplacements as (
select
ID,ItemValue,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY RepCount desc) as rn
from
Replacements
)
update rs
set ItemValue = fr.ItemValue
from
#RandomString rs
inner join
FinalReplacements fr
on
rs.ID = fr.ID and
rn = 1
Which produces:
select * from #RandomString
ID ItemValue
----------- -----------------------
1 `star`Test`quot`
2 `quest`Test`star`
What this does is it starts with the unaltered texts (the top select in Replacements), then it attempts to apply any valid replacements (the second select in Replacements). What it will do is to continue applying this second select, based on any results it produces, until no new rows are produced. This is called a Recursive Common Table Expression (CTE).
We then use a second CTE (a non-recursive one this time) FinalReplacements to number all of the rows produced by the first CTE, assigning lower row numbers to rows which were produced last. Logically, these are the rows which were the result of applying the last applicable transform, and so will no longer contain any of the original characters to be replaced. So we can use the row number 1 to perform the update back against the original table.
This query does do more work than strictly necessary - for small numbers of rows of replacement characters, it's not likely to be too inefficient. We could clear it up by defining a single order in which to apply the replacements.
Will skipping the join table and nesting REPLACE functions work?
Or do you need to actually get the data from the other table?
-- perform 4 replaces in a single update statement
UPDATE T1
SET ItemValue = REPLACE(
REPLACE(
REPLACE(
REPLACE(
ItemValue,'*','star')
ItemValue,'?','quest')
ItemValue,'"','quot')
ItemValue,';','semi')
Note: I'm not sure if you need to escape any of the characters you're replacing

Random select is not always returning a single row

The intention of following (simplified) code fragment is to return one random row.
Unfortunatly, when we run this fragment in the query analyzer, it returns between zero and three results.
As our input table consists of exactly 5 rows with unique ID's and as we perform a select on this table where ID equals a random number, we are stumped that there would ever be more than one row returned.
Note: among other things, we already tried casting the checksum result to an integer with no avail.
DECLARE #Table TABLE (
ID INTEGER IDENTITY (1, 1)
, FK1 INTEGER
)
INSERT INTO #Table
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
SELECT *
FROM #Table
WHERE ID = ABS(CHECKSUM(NEWID())) % 5 + 1
Edit
Our usage scenario is as follows (please don't comment on wether it is the right thing to do or not. It's the powers that be that have decided)
Ultimately, we must create a result with realistic values where the combination of producer and weights are obfuscated by selecting at random existing weights from the table itself.
The query then would become something like this (also a reason why RAND can not be used)
SELECT t.ID
, FK1 = (SELECT FK1 FROM #Table WHERE ID=ABS(CHECKSUM(NEWID())) % 5 + 1)
FROM #Table t
Because the inner select could be returning zero results, it would return a NULL value wich again is not acceptable. It is the investigation of why the inner select returns between zero and x results, that this question sproused (is this even English?).
Answer
What turned the light on for me was the simple observation that ABS(CHECKSUM(NEWID())) % 5 + 1) was re-evaluated for each row. I was under the impression that ABS(CHECKSUM(NEWID())) % 5 + 1) would get evaluated once, then matched.
Thank you all for answering and slowly but surely leading me to a better understanding.
The reason this happens is because NEWID() gies a different value for each row in the table. For each row, independently of the others, there is a one in five chance of it being returned. Consequently, as it stands, you actually have a 1 in 3125 chance of all 5 rows being returned!
To see this, run the following query. You'll see that each row gets a different ID.
SELECT * , NEWID()
FROM #Table
This will fix your code:
DECLARE #Id int
SET #Id = ABS(CHECKSUM(NEWID())) % 5 + 1
SELECT *
FROM #Table
WHERE ID = #Id
However, I'm not sure this is the most efficient method of selecting a single random row from the table.
You might find this MSDN article useful: http://msdn.microsoft.com/en-us/library/Aa175776 (Random Sampling in T-SQL)
EDIT 1: now I think about it, this probably is the most efficient way to do it, assuming the number of rows remains fixed and the IDs are guaranteed to be contiguous.
EDIT 2: to achieve the desired result when used as a sub-query, use TOP 1 like this:
SELECT t.ID
, FK1 = (SELECT TOP 1 FK1 FROM #Table ORDER BY NEWID())
FROM #Table t
A bit of a guess, and not sure that SQL works this way, but wouldn't SQL evaluate "ABS(CHECKSUM(NEWID())) % 5 + 1" for each row in the table? If so, then each evaluation may or may not return the value of ID of the current row.
Try this instead - generating the random number explicitly first, and matching on that single value:
declare #targetRandom int
set #targetRandom = ABS(CHECKSUM(NEWID())) % 5 + 1
select * from #table where ID = #targetRandom
Try the following, so you can see what happens:
SELECT ABS(CHECKSUM(NEWID())) % 5 + 1 AS Number, #Table.*
FROM #Table
WHERE ID = Number
Or you could use RAND() instead of NEWID(), which is only evaluated once per query in MS SQL
If you want to use CHECKSUM to obtain a random row, this is the way to do it.
SELECT TOP 1 *
FROM #Table
ORDER BY CHECKSUM(NEWID())
what about?
SELECT t.ID
, FK1 = (SELECT TOP 1 FK1 FROM #Table ORDER BY NEWID())
FROM #Table t
This may help you understand the reasons.
Run the query multiple times. How many times does MY_FILTER = ID ?
SELECT *, ABS(CHECKSUM(NEWID())) % 5 + 1 AS MY_FILTER
FROM #Table
SELECT *, ABS(CHECKSUM(NEWID())) % 5 + 1 AS MY_FILTER
FROM #Table
SELECT *, ABS(CHECKSUM(NEWID())) % 5 + 1 AS MY_FILTER
FROM #Table
I don't know how much this will be helpful to you, but try this.. All I understood is you want one random row each time you execute the query..
select top 1 newid() as row,ID from #Table order by row
Here is the logic. Each time you execute the query a newid is being assigned to each row and all are unique and the you just order them with the new uniquely generated rowid. Then all you need to do is select the top most or whatever you want..