Select row with mostly higher value and rarely lower value - sql

I'm trying to select a random row from a table, but there is a column in this table called Rate, I want it to return the row that has a higher rate, and rarely ever return the rows that has a lower rate, is this possible?
Table :
CREATE TABLE _Random (Code varchar(128), Rate tinyint)

So you want a random row, but weighted towards the ones with higher rates?
It would also be good to know how many rows there are in the table - sorting the whole lot is kinda expensive. You may prefer to use a row_number concept than sorting by N guids.
So... One option could be to generate a single number, and then divide 100 by it. Imagine we generate a number between 0 and 1.
.25 gives us 400, .5 gives us 200, .75 gives us 133... Notice that there's a curve here - so the numbers closer to 100 come up more often (subtract 100 to make the range start at 1).
You could use RAND() for a single value between 0 and 1 (it's probably good enough), and then do the division and subtraction to get a number. If this is higher than the count of records, then maybe repeat? But try to choose a value for your division that suits.
If you need to weight it more, you could raise your RAND() value by some number, to flatten it out or steepen it up. Do some experimenting to see how it looks.

This query will fetch a random record which has an above average rate
SELECT TOP (1) * FROM _Random
WHERE Rate>(SELECT AVG(Rate) FROM _Random)
ORDER BY NEWID()

Related

Select Average of Top 25% of Values in SQL

I'm currently writing a stored procedure for my client to populate some tables that will be used to generate SSRS reports later on. Some of the data is based on specific stock formulas that are run on each of their clients' quarterly data (sent to them by their clients). The other part of the data is generated by comparing those results against those from other, similar sized clients. One of the things that they want tracked in their reports is the average of the top 25% of formula results for that particular comparison group.
To give a better picture of it, imagine the following fields that I have in a temp table:
FormulaID int
Value decimal (18,6)
I want to do the following: Given a specific FormulaID return the average of the top 25% of Value.
I know how to take an average in SQL, but I don't know how to do it against only the top 25% of a specific group.
How would I write this query?
I guess you can do something like this...
SELECT AVG(Q.ColA) Avg25Prec
FROM (
SELECT TOP 25 Percent ColA
FROM Table_Name
ORDER BY SomeCOlumn
) Q
Here's what I did, given the table shown above:
select AVG(t.Value)
from (select top 25 percent Value
from #TempGroupTable
where FormulaID = #PassedInFormulaID
order by Value desc) as t
The desc must be there, because the percent command will not actually do comparisons. It will just simply grab the first x number of records, with x being equal to 25% of the count of records it's querying. Therefore, the order by Value desc line then will grab the top 25% records which have the highest Value, and then sends that info to be averaged.
As a side note to all of this, this also means that if you wanted to grab the bottom 25% instead, or if your formula results are like a golf score (i.e. lowest is the best), all you would need to do is remove the desc part and you would be good to go.

Select case mess in JFreeChart

I have a Column(cliente_x_hora, a numeric field) i put in a interval and count the number in each interval.I have 3 textfields(number of intervals,value between intervals and initial value). When I select the two first(with 5 intervals and 1000 value), the query run flawless and generate the expect barchart.
Query(with two select textfields):
SELECT INTERVAL, COUNT(*) TOTAL FROM (
SELECT CASE WHEN CLIENTE_X_HORA>0 AND CLIENTE_X_HORA<=1000.00 THEN '0<CLIENTE_X_HORA> <=1000.00'
WHEN CLIENTE_X_HORA>1000.00 AND CLIENTE_X_HORA<=2000.00 THEN '1000.00<CLIENTE_X_HORA><=2000.00'
WHEN CLIENTE_X_HORA>2000.00 AND CLIENTE_X_HORA<=3000.00 THEN '2000.00<CLIENTE_X_HORA><=3000.00'
WHEN CLIENTE_X_HORA>3000.00 AND CLIENTE_X_HORA<=4000.00 THEN '3000.00<CLIENTE_X_HORA><=4000.00'
ELSE '4000.00<CLIENTE_X_HORA' END INTERVAL, CLIENTE_X_HORA FROM SGD_CAUSA)
GROUP BY INTERVAL ORDER BY TOTAL
The barchart is
The problem is when I select the last field(initial value with, per example 2000), my barchart go crazy(i believe is adding up the discarded values below 2000):
That ELSE(>6000) should be much smaller than is showing.How can I solve that?
Best Regards,
DDias
CLARIFICATION from OP:
The query is the same as above but begins in 2000:
SELECT CASE WHEN CLIENTE_X_HORA>2000 AND CLIENTE_X_HORA<=3000.00... and ends in 6000:ELSE '6000.00<CLIENTE_X_HORA' END INTERVAL, CLIENTE_X_HORA FROM SGD_CAUSA) GROUP BY INTERVAL ORDER BY TOTAL
put the result in table form is impractical(we are talking about over 87 thousand rows) That happens always when i give an initial value different than ZERO.
Your ELSE is just that. It includes everything that is not matched by specific WHENs.
So if you do not start from zero, that last column will include everything below a lowest limit in addition to greater than highest limit.
So if you do not want this behavior, do not use ELSE at all. Use WHEN CLIENTE_X_HORA > 6000.00 (or whatever your highest limit is) as the last condition.
EDIT:
In your internal query filter out (with WHERE) the values that are below the lowest limit.
Since we no longer have unneeded low range, you no longer need the HAVING clause we added and you can even go back to using ELSE.
If your lowest limit is zero, then you will be filtering everything below 0, which I assume is nothing.

how do I pull out a random record from an SQL table?

I have an SQL table which has two integers. Let these integers be a and b.
I want to SELECT out a random record, such that the record is selected with probability proportional to C + a/b for some constant C which I will choose.
So for example, if C = 0, and there are two records with a=1,b=2 and a=2,b=3, then we have that for the first record C+a/b = 1/2 and for the second record C+a/b = 2/3, and therefore with probability 0.3 I will choose the first record, and probability 0.7 I will choose the second record from that SELECT query.
I know SQL well (I thought), but I am not even sure where to begin here. I thought of doing a select for the "SUM(a/b)" first, and then doing a select for the first record the sum of C+a/b up to it exceeds a random number between C*number_of_records + SUM(a/b) for the first time. But, I don't really know how to do that.
You could do something like sorting by a random number multiplied by your other stuff, and just select top 1 from that query - something like:
SELECT TOP 1 (your column names)
FROM (your table)
ORDER BY Rand() * (your calculation)

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost

How to add "weights" to a MySQL table and select random values according to these?

I want to create a table, with each row containing some sort of weight. Then I want to select random values with the probability equal to (weight of that row)/(weight of all rows). For example, having 5 rows with weights 1,2,3,4,5 out of 1000 I'd get approximately 1/15*1000=67 times first row and so on.
The table is to be filled manually. Then I'll take a random value from it. But I want to have an ability to change the probabilities on the filling stage.
I found this nice little algorithm in Quod Libet. You could probably translate it to some procedural SQL.
function WeightedShuffle(list of items with weights):
max_score ← the sum of every item’s weight
choice ← random number in the range [0, max_score)
current ← 0
for each item (i, weight) in items:
current ← current + weight
if current ≥ choice or i is the last item:
return item i
The easiest (and maybe best/safest?) way to do this is to add those rows to the table as many times as you want the weight to be - say I want "Tree" to be found 2x more often then "Dog" - I insert it 2 times into the table and I insert "Dog" once and just select elements at random one by one.
If the rows are complex/big then it would be best to create a separate table (weighted_Elements or something) in which you'll just have foreign keys to the real rows inserted as many times as the weights dictate.
The best possible scenario (if i understand your question properly) is to setup your table as you normally would and then add two columns both INT's.
Column 1: Weight - This column would hold your weight value going from -X to +X, X being the highest value you want to have as a weight (IE: X=100, -100 to 100). This value is populated to give the row an actual weight and increase or decrease the probability of it coming up.
Column 2: *Count** - This column would hold the count of how many times this row has come up, this column is needed only if you want to use fair weighting. Fair weighting prevents one row from always showing up. (IE: if you have one row weighted at 100 and another at 2 the row with 100 will always show up, this column will allow weight 2 to be more 'valueable' as you get more weight 100 results). This column should be incremented by 1 each time a row result is pulled but you can make the logic more advanced later so it adds the weight etc.
Logic: - Its really simple now, your query simply has to request all rows as you normally would then make an extra select that (you can change the logic here to whatever you want) takes the weights and subtracts the count and order by that column.
The end result should be a table where you will get your weights appearing more often until a certain point where the system will evenly distribute itself out (leave out column 2) and you will have a system that will always return the same weighted order unless you offset the base of the query (IE: LIMIT [RANDOM NUMBER], [NUMBER OF ROWS TO RETURN])
I'm not an expert in probability theory, but assuming you have a column called WEIGHT, how about
select FIELD_1, ... FIELD_N, (rand() * WEIGHT) as SCORE
from YOURTABLE
order by SCORE
limit 0, 10
This would give you 10 records, but you can change the limit clause, of course.
The problem is called Reservoir Sampling (https://en.wikipedia.org/wiki/Reservoir_sampling)
The A-Res algorithm is easy to implement in SQL:
SELECT *
FROM table
ORDER BY pow(rand(), 1 / weight) DESC
LIMIT 10;
I came looking for the answer to the same question - I decided to come up with this:
id weight
1 5
2 1
SELECT * FROM table ORDER BY RAND()/weight
it's not exact - but it is using random so i might not expect exact. I ran it 70 times to get number 2 10 times. I would have expect 1/6th but i got 1/7th. I'd say that's pretty close. I'd have to run a script to do it a few thousand times to get a really good idea if it's working.