Giving Range to the SQL Column - sql

I have SQL table in which I have column and Probability . I want to select one row from it with randomly but I want to give more chances to the more waighted probability. I can do this by
Order By abs(checksum(newid()))
But the difference between Probabilities are too much so it gives more chance to highest probability.Like After picking 74 times that value it pick up another value for once than again around 74 times.I want to reduce this .Like I want 3-4 times to it and than others and all. I am thinking to give Range to the Probabilies.Its Like
Row[i] = Row[i-1]+Row[i]
How can I do this .Do I need to create function?Is there any there any other way to achieve this.I am neewby.Any help will be appriciated.Thank You
EDIT:
I have solution of my problem . I have one question .
if I have table as follows.
Column1 Column2
1 50
2 30
3 20
can i get?
Column1 Column2 Column3
1 50 50
2 30 80
3 20 100
Each time I want to add value with existing one.Is there any Way?
UPDATE:
Finally get the solution after 3 hours,I just take square root of my probailities that way I can narrow the difference bw them .It is like I add column with
sqrt(sqrt(sqrt(Probability)))....:-)

I'd handle it by something like
ORDER BY rand()*pow(<probability-field-name>,<n>)
for different values of n you will distort the linear probabilities into a simple polynomial. Small values of n (e.g. 0.5) will compress the probabilities to 1 and thus make less probable choices more probable, big values of n (e.g. 2) will do the opposite and further reduce probability of already inprobable values.

Since the difference in probabilities is too great, you need to add a computed field with a revised weighting that has a more even probability distribution. How you do that depends on your data and preferred distribution. One way to do it is to "normalize" the weighting to an integer between 1 and 10 so that the lowest probability is never more than ten times smaller than the highest.

Answer to your recent question:
SELECT t.Column1,
t.Column2,
(SELECT SUM(Column2)
FROM table t2
WHERE t2.Column1 <= t.Column1) Column3
FROM table t

Here is a basic example how to select one row from the table with taking into account the assigned row weights.
Suppose we have table:
CREATE TABLE TableWithWeights(
Id int NOT NULL PRIMARY KEY,
DataColumn nvarchar(50) NOT NULL,
Weight decimal(18, 6) NOT NULL -- Weight column
)
Let's fill table with sample data.
INSERT INTO TableWithWeights VALUES(1, 'Frequent', 50)
INSERT INTO TableWithWeights VALUES(2, 'Common', 30)
INSERT INTO TableWithWeights VALUES(3, 'Rare', 20)
This is the query that returns one random row with taking into account given row weights.
SELECT * FROM
(SELECT tww1.*, -- Select original table data
-- Add column with the sum of all weights of previous rows
(SELECT SUM(tww2.Weight)- tww1.Weight
FROM TableWithWeights tww2
WHERE tww2.id <= tww1.id) as SumOfWeightsOfPreviousRows
FROM TableWithWeights tww1) as tww,
-- Add column with random number within the range [0, SumOfWeights)
(SELECT RAND()* sum(weight) as rnd
FROM TableWithWeights) r
WHERE
(tww.SumOfWeightsOfPreviousRows <= r.rnd)
and ( r.rnd < tww.SumOfWeightsOfPreviousRows + tww.Weight)
To check query results we can run it for 100 times.
DECLARE #count as int;
SET #count = 0;
WHILE ( #count < 100)
BEGIN
-- This is the query that returns one random row with
-- taking into account given row weights
SELECT * FROM
(SELECT tww1.*, -- Select original table data
-- Add column with the sum of all weights of previous rows
(SELECT SUM(tww2.Weight)- tww1.Weight
FROM TableWithWeights tww2
WHERE tww2.id <= tww1.id) as SumOfWeightsOfPreviousRows
FROM TableWithWeights tww1) as tww,
-- Add column with random number within the range [0, SumOfWeights)
(SELECT RAND()* sum(weight) as rnd
FROM TableWithWeights) r
WHERE
(tww.SumOfWeightsOfPreviousRows <= r.rnd)
and ( r.rnd < tww.SumOfWeightsOfPreviousRows + tww.Weight)
-- Increase counter
SET #count += 1
END
PS The query was tested on SQL Server 2008 R2. And of course the query can be optimized (it's easy to do if you get the idea)

Related

Given a table of numbers, can I get all the rows which add up to less than or equal to a number?

Say I have a table with an incrementing id column and a random positive non zero number.
id
rand
1
12
2
5
3
99
4
87
Write a query to return the rows which add up to a given number.
A couple rules:
Rows must be "consumed" in order, even if a later row makes it a a perfect match. For example, querying for 104 would be a perfect match for rows 1, 2, and 4 but rows 1-3 would still be returned.
You can use a row partially if there is more available than is necessary to add up to whatever is leftover on the number E.g. rows 1, 2, and 3 would be returned if your max number is 50 because 12 + 5 + 33 equals 50 and 90 is a partial result.
If there are not enough rows to satisfy the amount, then return ALL the rows. E.g. in the above example a query for 1,000 would return rows 1-4. In other words, the sum of the rows should be less than or equal to the queried number.
It's possible for the answer to be "no this is not possible with SQL alone" and that's fine but I was just curious. This would be a trivial problem with a programming language but I was wondering what SQL provides out of the box to do something as a thought experiment and learning exercise.
You didn't mention which RDBMS, but assuming SQL Server:
DROP TABLE #t;
CREATE TABLE #t (id int, rand int);
INSERT INTO #t (id,rand)
VALUES (1,12),(2,5),(3,99),(4,87);
DECLARE #target int = 104;
WITH dat
AS
(
SELECT id, rand, SUM(rand) OVER (ORDER BY id) as runsum
FROM #t
),
dat2
as
(
SELECT id, rand
, runsum
, COALESCE(LAG(runsum,1) OVER (ORDER BY id),0) as prev_runsum
from dat
)
SELECT id, rand
FROM dat2
WHERE #target >= runsum
OR #target BETWEEN prev_runsum AND runsum;

Sample certain number of result rows from a postgres table based on given proportions

Let's say I have a table named population with 1000 rows like the following:
And I have another table named proportions that holds the desired proportions of different Group_Names that I want to extract:
I want to randomly sample 100 rows from population table where the proportions of the Group_Names within the sample is in line with that of the Proportion field within proportions table. So in that 100 rows sample, 50 rows should be Group-A, 30 rows should be Group-B and 20 rows should be Group-C.
I can manually sample like:
CREATE EXTENSION tsm_system_rows;
SELECT * FROM population TABLESAMPLE SYSTEM_ROWS(100);
But I do not know how to sample from population programmatically based on proportions table especially if proportions table has a lot more Group_Names than 3 as shown in the example.
The main problem that you will be facing is that TABLESAMPLE takes the sample before applying your group filter. Say that you want 20 rows from group C. The chances of getting those 20 by running
SELECT *
FROM population TABLESAMPLE system_rows(20)
WHERE group_name = 'C'
are pretty slim if group C is small relative to other groups in population.
I'd solve this by writing a stored function that receives as parameters the group name and wanted amount of rows, and samples the table until reaching the wanted amount of rows.
You should also limit the number of iterations, in case that the group is very sparse or there or not enough rows to fulfill the need.
So the function could look like so
CREATE OR REPLACE FUNCTION sample_group (p_group_name text, sample_size int, max_iterations int)
RETURNS int[]
LANGUAGE PLPGSQL AS $$
DECLARE
result int[];
i int := 0;
BEGIN
WHILE i < max_iterations AND coalesce(array_length(result, 1), 0) < sample_size LOOP
WITH sample AS (
SELECT group_name, value
FROM population TABLESAMPLE BERNOULLI (1)
LIMIT 10 * sample_size
), add_rows AS (
SELECT result || array_agg(value) arr
FROM sample
WHERE group_name = p_group_name
)
SELECT array_agg(DISTINCT value), i + 1
INTO result, i
FROM add_rows, unnest(arr) AS t(value);
END LOOP;
RETURN result[1:sample_size];
END;
$$;
I'm using BERNOULLI sampling to avoid getting the same rows over and over.
The function did most of the work for you. All that remains is to call it. In this example I'm setting an upper limit of 500 on the iterations.
SELECT group_name, unnest(sample_group(group_name, (100*proportion)::int, 500)) AS value
from proportions;
You can sample based on randomly assigned row numbers:
select *
from
(
select *
,case
when row_number()
over (partition by pop.group_name
order by random()) <= pr.proportion * 100 -- sample size
then 1
else 0
end as flag
from population as pop
join proportions as pr
on pop.group_name = pr.group_name
) as dt
where flag = 1
Edit:
If the table is large creating a SAMPLE before ROW_NUMBER might greatly reduce the number of rows processed. Of course, the SAMPLE size must be large enough to contain at least the required number of rows, i.e. way over 100 rows.

SQL Server 2012 : update a row with unique number

I have a table with 50k records. Now I want to update one column of the table with a random number. The number should be 7 digits.
I don't want to do that with procedure or loop.
PinDetailId PinNo
--------------------
783 2722692
784 9888648
785 6215578
786 7917727
I have tried this code but not able to succeed. I need 7 digit number.
SELECT
FLOOR(ABS(CHECKSUM(NEWID())) / 2147483647.0 * 3 + 1) rn,
(FLOOR(2000 + RAND() * (3000 - 2000) )) AS rn2
FROM
[GeneratePinDetail]
Random
For a random number, you can use ABS(CHECKSUM(NewId())) % range + lowerbound:
(source: How do I generate random number for each row in a TSQL Select?)
INSERT INTO ResultsTable (PinDetailId, PinNo)
SELECT PinDetailId,
(ABS(CHECKSUM(NewId())) % 1000000 + 1000000) AS `PinNo`
FROM GeneratePinDetail
ORDER BY PinDetailId ASC;
Likely Not Unique
I cannot guarantee these will be unique; but it should be evenly distributed (equal chance of any 7 digit number). If you want to check for duplicates you can run this:
SELECT PinDetailId, PinNo
FROM ResultsTable result
INNER JOIN (
SELECT PinNo
FROM ResultsTable
GROUP BY PinNo
HAVING Count(1) > 1
) test
ON result.PinNo = test.PinNo;
You can create a sequence object and update your fields - it should automatically increment per update.
https://learn.microsoft.com/en-us/sql/t-sql/functions/next-value-for-transact-sql
Updated based on comment:
After retrieving the 'next value for' in the sequence, you can do operations on it to randomize. The sequence can basically be used then to create a unique seed for your randomization function.
If you don't want to create a function yourself, SQL Server has the RAND function build in already.
https://learn.microsoft.com/en-us/sql/t-sql/functions/rand-transact-sql

To Generate Random Numbers in Ms Sql server [duplicate]

I need a different random number for each row in my table. The following seemingly obvious code uses the same random value for each row.
SELECT table_name, RAND() magic_number
FROM information_schema.tables
I'd like to get an INT or a FLOAT out of this. The rest of the story is I'm going to use this random number to create a random date offset from a known date, e.g. 1-14 days offset from a start date.
This is for Microsoft SQL Server 2000.
Take a look at SQL Server - Set based random numbers which has a very detailed explanation.
To summarize, the following code generates a random number between 0 and 13 inclusive with a uniform distribution:
ABS(CHECKSUM(NewId())) % 14
To change your range, just change the number at the end of the expression. Be extra careful if you need a range that includes both positive and negative numbers. If you do it wrong, it's possible to double-count the number 0.
A small warning for the math nuts in the room: there is a very slight bias in this code. CHECKSUM() results in numbers that are uniform across the entire range of the sql Int datatype, or at least as near so as my (the editor) testing can show. However, there will be some bias when CHECKSUM() produces a number at the very top end of that range. Any time you get a number between the maximum possible integer and the last exact multiple of the size of your desired range (14 in this case) before that maximum integer, those results are favored over the remaining portion of your range that cannot be produced from that last multiple of 14.
As an example, imagine the entire range of the Int type is only 19. 19 is the largest possible integer you can hold. When CHECKSUM() results in 14-19, these correspond to results 0-5. Those numbers would be heavily favored over 6-13, because CHECKSUM() is twice as likely to generate them. It's easier to demonstrate this visually. Below is the entire possible set of results for our imaginary integer range:
Checksum Integer: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Range Result: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5
You can see here that there are more chances to produce some numbers than others: bias. Thankfully, the actual range of the Int type is much larger... so much so that in most cases the bias is nearly undetectable. However, it is something to be aware of if you ever find yourself doing this for serious security code.
When called multiple times in a single batch, rand() returns the same number.
I'd suggest using convert(varbinary,newid()) as the seed argument:
SELECT table_name, 1.0 + floor(14 * RAND(convert(varbinary, newid()))) magic_number
FROM information_schema.tables
newid() is guaranteed to return a different value each time it's called, even within the same batch, so using it as a seed will prompt rand() to give a different value each time.
Edited to get a random whole number from 1 to 14.
RAND(CHECKSUM(NEWID()))
The above will generate a (pseudo-) random number between 0 and 1, exclusive. If used in a select, because the seed value changes for each row, it will generate a new random number for each row (it is not guaranteed to generate a unique number per row however).
Example when combined with an upper limit of 10 (produces numbers 1 - 10):
CAST(RAND(CHECKSUM(NEWID())) * 10 as INT) + 1
Transact-SQL Documentation:
CAST(): https://learn.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql
RAND(): http://msdn.microsoft.com/en-us/library/ms177610.aspx
CHECKSUM(): http://msdn.microsoft.com/en-us/library/ms189788.aspx
NEWID(): https://learn.microsoft.com/en-us/sql/t-sql/functions/newid-transact-sql
Random number generation between 1000 and 9999 inclusive:
FLOOR(RAND(CHECKSUM(NEWID()))*(9999-1000+1)+1000)
"+1" - to include upper bound values(9999 for previous example)
Answering the old question, but this answer has not been provided previously, and hopefully this will be useful for someone finding this results through a search engine.
With SQL Server 2008, a new function has been introduced, CRYPT_GEN_RANDOM(8), which uses CryptoAPI to produce a cryptographically strong random number, returned as VARBINARY(8000). Here's the documentation page: https://learn.microsoft.com/en-us/sql/t-sql/functions/crypt-gen-random-transact-sql
So to get a random number, you can simply call the function and cast it to the necessary type:
select CAST(CRYPT_GEN_RANDOM(8) AS bigint)
or to get a float between -1 and +1, you could do something like this:
select CAST(CRYPT_GEN_RANDOM(8) AS bigint) % 1000000000 / 1000000000.0
The Rand() function will generate the same random number, if used in a table SELECT query. Same applies if you use a seed to the Rand function. An alternative way to do it, is using this:
SELECT ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) AS [RandomNumber]
Got the information from here, which explains the problem very well.
Do you have an integer value in each row that you could pass as a seed to the RAND function?
To get an integer between 1 and 14 I believe this would work:
FLOOR( RAND(<yourseed>) * 14) + 1
If you need to preserve your seed so that it generates the "same" random data every time, you can do the following:
1. Create a view that returns select rand()
if object_id('cr_sample_randView') is not null
begin
drop view cr_sample_randView
end
go
create view cr_sample_randView
as
select rand() as random_number
go
2. Create a UDF that selects the value from the view.
if object_id('cr_sample_fnPerRowRand') is not null
begin
drop function cr_sample_fnPerRowRand
end
go
create function cr_sample_fnPerRowRand()
returns float
as
begin
declare #returnValue float
select #returnValue = random_number from cr_sample_randView
return #returnValue
end
go
3. Before selecting your data, seed the rand() function, and then use the UDF in your select statement.
select rand(200); -- see the rand() function
with cte(id) as
(select row_number() over(order by object_id) from sys.all_objects)
select
id,
dbo.cr_sample_fnPerRowRand()
from cte
where id <= 1000 -- limit the results to 1000 random numbers
select round(rand(checksum(newid()))*(10)+20,2)
Here the random number will come in between 20 and 30.
round will give two decimal place maximum.
If you want negative numbers you can do it with
select round(rand(checksum(newid()))*(10)-60,2)
Then the min value will be -60 and max will be -50.
try using a seed value in the RAND(seedInt). RAND() will only execute once per statement that is why you see the same number each time.
If you don't need it to be an integer, but any random unique identifier, you can use newid()
SELECT table_name, newid() magic_number
FROM information_schema.tables
You would need to call RAND() for each row. Here is a good example
https://web.archive.org/web/20090216200320/http://dotnet.org.za/calmyourself/archive/2007/04/13/sql-rand-trap-same-value-per-row.aspx
The problem I sometimes have with the selected "Answer" is that the distribution isn't always even. If you need a very even distribution of random 1 - 14 among lots of rows, you can do something like this (my database has 511 tables, so this works. If you have less rows than you do random number span, this does not work well):
SELECT table_name, ntile(14) over(order by newId()) randomNumber
FROM information_schema.tables
This kind of does the opposite of normal random solutions in the sense that it keeps the numbers sequenced and randomizes the other column.
Remember, I have 511 tables in my database (which is pertinent only b/c we're selecting from the information_schema). If I take the previous query and put it into a temp table #X, and then run this query on the resulting data:
select randomNumber, count(*) ct from #X
group by randomNumber
I get this result, showing me that my random number is VERY evenly distributed among the many rows:
It's as easy as:
DECLARE #rv FLOAT;
SELECT #rv = rand();
And this will put a random number between 0-99 into a table:
CREATE TABLE R
(
Number int
)
DECLARE #rv FLOAT;
SELECT #rv = rand();
INSERT INTO dbo.R
(Number)
values((#rv * 100));
SELECT * FROM R
select ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) as [Randomizer]
has always worked for me
Use newid()
select newid()
or possibly this
select binary_checksum(newid())
If you want to generate a random number between 1 and 14 inclusive.
SELECT CONVERT(int, RAND() * (14 - 1) + 1)
OR
SELECT ABS(CHECKSUM(NewId())) % (14 -1) + 1
DROP VIEW IF EXISTS vwGetNewNumber;
GO
Create View vwGetNewNumber
as
Select CAST(RAND(CHECKSUM(NEWID())) * 62 as INT) + 1 as NextID,
'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'as alpha_num;
---------------CTDE_GENERATE_PUBLIC_KEY -----------------
DROP FUNCTION IF EXISTS CTDE_GENERATE_PUBLIC_KEY;
GO
create function CTDE_GENERATE_PUBLIC_KEY()
RETURNS NVARCHAR(32)
AS
BEGIN
DECLARE #private_key NVARCHAR(32);
set #private_key = dbo.CTDE_GENERATE_32_BIT_KEY();
return #private_key;
END;
go
---------------CTDE_GENERATE_32_BIT_KEY -----------------
DROP FUNCTION IF EXISTS CTDE_GENERATE_32_BIT_KEY;
GO
CREATE function CTDE_GENERATE_32_BIT_KEY()
RETURNS NVARCHAR(32)
AS
BEGIN
DECLARE #public_key NVARCHAR(32);
DECLARE #alpha_num NVARCHAR(62);
DECLARE #start_index INT = 0;
DECLARE #i INT = 0;
select top 1 #alpha_num = alpha_num from vwGetNewNumber;
WHILE #i < 32
BEGIN
select top 1 #start_index = NextID from vwGetNewNumber;
set #public_key = concat (substring(#alpha_num,#start_index,1),#public_key);
set #i = #i + 1;
END;
return #public_key;
END;
select dbo.CTDE_GENERATE_PUBLIC_KEY() public_key;
Update my_table set my_field = CEILING((RAND(CAST(NEWID() AS varbinary)) * 10))
Number between 1 and 10.
Try this:
SELECT RAND(convert(varbinary, newid()))*(b-a)+a magic_number
Where a is the lower number and b is the upper number
If you need a specific number of random number you can use recursive CTE:
;WITH A AS (
SELECT 1 X, RAND() R
UNION ALL
SELECT X + 1, RAND(R*100000) --Change the seed
FROM A
WHERE X < 1000 --How many random numbers you need
)
SELECT
X
, RAND_BETWEEN_1_AND_14 = FLOOR(R * 14 + 1)
FROM A
OPTION (MAXRECURSION 0) --If you need more than 100 numbers

Split a query result based on the result count

I have a query based on basic criteria that will return X number of records on any given day.
I'm trying to check the result of the basic query then apply a percentage split to it based on the total of X and split it in 2 buckets. Each bucket will be a percentage of the total query result returned in X.
For example:
Query A returns 3500 records.
If the number of records returned from Query A is <= 3000, then split the 3500 records into a 40% / 60% split (1,400 / 2,100).
If the number of records returned from Query A is >=3001 and <=50,000 then split the records into a 10% / 90% split.Etc. Etc.
I want the actual records returned, and not just the math acting on the records that returns one row with a number in it (in the column).
I'm not sure how you want to display different parts of the resulting set of rows, so I've just added additional column(part) in the resulting set of rows that contains values 1 indicating that row belongs to the first part and 2 - second part.
select z.*
, case
when cnt_all <= 3000 and cnt <= 40
then 1
when (cnt_all between 3001 and 50000) and (cnt <= 10)
then 1
else 2
end part
from (select t.*
, 100*(count(col1) over(order by col1) / count(col1) over() )cnt
, count(col1) over() cnt_all
from split_rowset t
order by col1
) z
Demo #1 number of rows 3000.
Demo #2 number of rows 3500.
For better usability you can create a view using the query above and then query that view filtering by part column.
Demo #3 using of a view.