The table tests contains data on power tests of a certain type of rocket engine. The values in the following columns mean:
ID - test number
SN - engine serial number
T - test duration in seconds
M1, M2, M3 - power values recorded at the beginning, half and end of the test duration.
Your task is to display for each engine the test in which it achieved the highest average value of three power measurements. However, you must take into account the following limitations: 1.we are only interested in those engines that have participated in at least 5 tests, 2.of the tests for engines that meet condition 1, we are only interested in those that lasted at least a minute and where the lowest value of the three measurements was not less than 90% of the highest value of the three measurements 3. We are interested only in those engines for which the tests selected after meeting the criteria in point 2 meet the following condition: the lowest average of the three measurement results of the tests for a given engine is not less than 85% of the highest average of measurements among these tests.
The resulting table should contain four columns:
ID
SN
MAX - containing the highest value of the three measurements in the given test
MAX_AVG - containing the highest average of measurements from all tests for a given engine (taking into account the conditions described above). Round this value to two decimal places.
Sort the results from the highest to the lowest average.
I am a student and the professor gave me such an assignment below I am sending my idea to solve this problem.
SELECT
ID, SN, AS MAX(M1, M2, M3) AS MAX, ROUND(MAX((M1 + M2 + M3)/3.0),2) AS MAX_AVG
FROM
tests
WHERE
T >= 60
AND MIN(M1, M2, M3) >= MAX(M1, M2, M3) * 0.9
GROUP BY
SN
HAVING
COUNT(SN) >= 5
AND MIN((M1 + M2 + M3)/3.0) >= MAX((M1 + M2 + M3)/3.0) * 0.85
ORDER BY
MAX_AVG ASC;
You do not say what your difficulty is. But you did include your attempt, and made an effort.
However, you have a condition backwards, so that might be it.
Say that you have an engine that participated in 5 tests, but the fifth only lasted 55 seconds. Condition 1 says that you must include that engine, because it did participate to five tests:
1.we are only interested in those engines that have participated in at least 5 tests
Of course, for that engine, you only want the first four tests:
2.of the tests for engines that meet condition 1, we are only interested in those that lasted at least a minute
But your WHERE excludes the fifth test from the start, so that leaves four, and the HAVING excludes that engine.
You probably want to try a subSELECT to handle this case.
WITH
test_level_summary AS
(
SELECT
ID,
SN,
MIN(M1, M2, M3) AS MIN_M,
MAX(M1, M2, M3) AS MAX_M,
ROUND((M1 + M2 + M3)/3.0), 2) AS AVG_M,
COUNT(*) OVER (PARTITION BY SN) AS SN_TEST_COUNT
FROM
tests
),
valid_engines AS
(
SELECT
*,
MIN(AVG_M) OVER (PARTITION BY SN) AS MIN_AVG_M,
MAX(AVG_M) OVER (PARTITION BY SN) AS MAX_AVG_M
FROM
test_level_summary
WHERE
SN_TEST_COUNT >= 5
AND T >= 60
AND MIN_M >= MAX_M * 0.9
)
SELECT
ID, SN, MAX_M, AVG_M
FROM
valid_engines
WHERE
MIN_AVG_M >= MAX_AVG_M * 0.85
AND AVG_M = MAX_AVG_M
Related
I have data on five experiments, which vary slightly, and I would like to find a way of standardizing results between experiments to develop a “standard result.” This will be a value I can multiply each result by to make them comparable.
The way I have gone about this is there are individuals in more than one experimental group (assuming being in more than one group doesn’t affect results).
I am assuming the results of individuals in more than one experiment are only different because of the slight differences between experiments and so by calculating a conversion factor for each experiment that makes the average of the individuals results in more than one experiment the same that can translate results into a standard result.
The problem I am experiencing is at the end of the process most of my conversion factors are less than 1. However, I was expecting to get roughly an even number of values greater than 1 and less than 1 as I am sort of averaging out the results.
My procedure and MSQL code is below:
Get data by experiment and by individual and exclude all individual’s data that is only in 1 experiment.
SELECT ExperimentID,
IndividualID,
INTO Z1_IndivByExperiment
FROM Results
GROUP BY ExperimentID,
IndividualID
SELECT IndividualID,
COUNT(ExperimentID) AS ExperimentCount
INTO Z2_MultiExperIndiv
FROM Z1_IndivByExperiment
GROUP BY IndividualID
HAVING COUNT(ExperimentID) > 1
ORDER BY ExperimentCount DESC
Create two tables:
i) Summaries results by individual (an individual can have multiple results per experiment) and experiment.
ii) Summaries results by individual
SELECT ExperimentID,
IndividualID,
SUM(Results.Result) AS Result_sum,
SUM(Results.ResultsCount) AS ResultCount_sum
INTO Z3_MultiExperIndiv_Results
FROM Results INNER JOIN
Z2_MultiExperIndiv ON Results.IndividualID =
Z2_MultiExperIndiv.IndividualID
GROUP BY ExperimentID,
IndividualID
SELECT 'Standard' AS Experiment
IndividualID,
SUM(Result_sum) AS ResultIndiv_sum,
SUM(ResultCount_sum) AS ResultCountIndiv_sum
into Z4_MultiExperIndiv_Stand
FROM Z3_MultiExperIndiv_Results
GROUP BY IndividualID
Link the two tables created in step 2 by individual summing results from table 1 and table 2 by experiment. I am hoping this provides two sets of results for individuals on the experiment in question and results for individuals who were part of the experiment in question from other experiments were.
SELECT Z4_MultiExperIndiv_Stand.ExperimentID AS ExperimentID1,
Z3_MultiExperIndiv_Results.ExperimentID AS ExperimentID2,
SUM(Z4_MultiExperIndiv_Stand.ResultIndiv_sum -
Z3_MultiExperIndiv_Results.Result_sum) AS Results1,
SUM(Z4_MultiExperIndiv_Stand.ResultCountIndiv_sum -
Z3_MultiExperIndiv_Results.ResultCount_sum) AS ResultCount1,
SUM(Z3_MultiExperIndiv_Results.Result_sum) AS Results 2,
SUM(Z3_MultiExperIndiv_Results.ResultCount_sum) AS ResultCount2
into Z5_StandardConversion_data
FROM Z3_MultiExperIndiv_Results INNER JOIN
Z4_MultiExperIndiv_Stand ON Z3_MultiExperIndiv_Results.IndividualID =
Z4_MultiExperIndiv_Stand.IndividualID
GROUP BY Z4_MultiExperIndiv_Stand.IndividualId,
Z3_MultiExperIndiv_Results.IndividualId
Then I divide results for each set of results by number of results and divide one by the other to get my conversion to standard number for each experiment.
SELECT ExperimentID1,
ExperimentID2,
Results1,
ResultCount1,
Results2,
ResultCount2,
Results1 / ResultCount1 AS Result1_avg,
Results2 / ResultCount2 AS Result2_avg,
(ResultCount2 * Results1) / (ResultCount1 * Results2) AS Conversion
FROM Z5_StandardConversion_data
Sample input data:
Sample Output:
The use case is, that I have a table products and user_match_product. For a specific user, I want to select X random products, for which that user has no match.
The naive way to do that, would be to make something like
SELECT * FROM products WHERE id NOT IN (SELECT p_id FROM user_match_product WHERE u_id = 123) ORDER BY random() LIMIT X
but that will become a performance bottleneck when having millions of rows.
I thought of some possible solutions which I will present here now. I would love to hear about your solutions for that problem or suggestions regarding my solutions.
Solution 1: Trust the randomness
Based on the fact that the product ids are monotonically increasing, one could optimistically generate X*C random numbers R_i where i between 1 and X*C, which are in the range [min_id, max_id], and hope that a select like the following will return X elements.
SELECT * FROM products p1 WHERE p1.id IN (R_1, R_2, ..., R_XC) AND NOT EXISTS (SELECT * FROM user_match_product WHERE u_id = 123 AND p_id = p1.id) LIMIT X
Advantages
If the random number generator is good, this will probably work very well within O(1)
Old and newly added products have the same probability of being choosen
Disadvantages
If the number of matches is near to the number of products, the collision probability might be very high.
Solution 2: Block-wise PRNG
One could create a permutation function permutate(seed, start, end, value) for the domain [START, END] that uses a seed for randomness. At time t0 a user A has 0 matched products and observes that E0 products exist. The first block for the user A at t0 is for the domain [1, E0]. The user remembers a counter C which initially is 0.
To select X products the user A first has to create the permutations P_i like
P_i = permutate(seed, START, END, C + i)
The following has to hold for the function.
permutate(seed, start, end, value) is element of [start, end]
value is element of [start, end]
The following query will return X non-repeating elements.
SELECT * FROM products WHERE id IN (P_1, ..., P_X)
When C reaches END, the next block is allocated by using END + 1 as the new START, the current count of products E1 as new END. The seed and C stay the same.
Advantages
No collisions possible
Guaranteed O(1)
Disadvantages
The current block has to be finished before new products can be selected
I'd go with approach #1.
You can get a first estimate of C by counting the user's rows in user_match_product (supposed unique). If he already possesses half the possible products, selecting twice the number of random products seems a good heuristic.
You can also have a last-ditch correction that verifies that the number of extracted products is actually X. If it was, say, X/3, you'd need to run the same extraction two more times (avoiding already-generated random product IDs), and increase that user's C constant by a factor of three.
Also, knowing what the range of product IDs is, you could select random numbers in that range that do not appear in user_match_product (i.e. your first stage query is only against user_match_product) which is bound to have a (much?) lower cardinality than products. Then, those IDs that pass the test can be safely selected from products.
If you want to choose X products that the user doesn't have, the first thing that comes to mind is to enumerate the products and use order by rand() (or the equivalent, depending on the database). This is your first solution:
SELECT p.*
FROM products p
WHERE NOT EXISTS (SELECT 1 FROM user_match_product WHERE ump.p_id = p.id and u_id = 123)
ORDER BY random()
LIMIT X;
A simple way to make this more efficient is to choose an arbitrary subset. You can actually do this using random() as well, but in the where clause:
SELECT p.*
FROM products p
WHERE random() < Y AND
NOT EXISTS (SELECT 1 FROM user_match_product WHERE ump.p_id = p.id and u_id = 123)
ORDER BY random()
LIMIT X;
The question is: what is "Y"? Well, let's say the number of products is P and the user has U. Then, if we choose a random set of (X + U) products, we can definitely get X products the user does not have. This suggests that the expression random() < (X + U) / P would be sufficient. Alas, the vagaries of random numbers say that sometimes we would get enough and sometimes not enough. Let's add a factor such as 3 to be safe. This is actually really, really, really, really safe for most values of X, U, and P.
The idea is a query such as this:
SELECT p.*
FROM Products p CROSS JOIN
(SELECT COUNT(*) as p FROM Products) v1 CROSS JOIN
(SELECT COUNT(*) as u FROM User_Match_Product WHERE u_id = 123) v2
WHERE random() < 3 * (u + x) / p AND
NOT EXISTS (SELECT 1 FROM User_Match_Product WHERE ump.p_id = p.id and ump.u_id = 123)
ORDER BY random()
LIMIT X;
Note that these calculations require a small amount of time with appropriate indexes on Products and User_Match_Product.
So, if you have 1,000,000 products and a typical user has 20. You want to recommend 10 more. Then the expression (20 + 10)*3/1000000 --> 90/1000000. This query will scan the products table, pull out 90 rows randomly and then sort them and choose an appropriate 10 rows. Sorting 90 rows is, essentially, constant time relative to the original operation.
For many purposes, the cost of the table scan is acceptable. It sure beats the cost of sorting all the data, for instance.
The alternative approach is to load all products for a user into the application. Then pull a random product out and compare to the list:
select p.id
from Products p cross join
(select min(id) as minid, max(id) as maxid as p from Products) v1
where p.id >= minid + random() * (maxid - minid)
order by p.id
limit 1;
(Note the calculation can be done outside the query so you can just plug in a constant.)
Many query optimizers will resolve this query constant time by doing an index scan. You can then check in the application whether the user has the product already. This would then run about X times for the user, providing O(1) performance. However, this has rather bad worst case performance: if there are not X available products, it will run indefinitely. Of course, additional logic can fix this problem.
The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.
Suppose we have a national database of music teachers with 250,000 rows:
teacherName, address, city, state, zipcode, geolocation, primaryInstrument
where the geolocation column is a geography::point datatype with optimally tesselated index.
User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.
But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.
Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.
So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?
I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.
P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.
EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.
select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number()
over(partition by
STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress
order by DistanceInMiles)
from
(
select
SINGER.name,
SINGER.streetaddress,
SINGER.city,
SINGER.state,
SINGER.zipcode,
TEACHERS.name as TEACHERname,
TEACHERS.streetaddress as TEACHERaddress,
TEACHERS.city as TEACHERcity,
TEACHERS.state as TEACHERstate,
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
/ (1.6 * 1000) as DistanceInMiles
from
SINGER left join TEACHERS
on
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))
< (SINGER.radius * (1.6 * 1000 ))
and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6 -- closest 5 is an abitrary requirement given to us
I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.
select
S.name,
S.streetaddress,
S.city,
S.state,
S.zipcode,
T.name as TEACHERname,
T.streetaddress as TEACHERaddress,
T.city as TEACHERcity,
T.state as TEACHERstate,
T.zipcode as TEACHERzip,
T.teacherid,
T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
/ (1.6 * 1000) as DistanceInMiles
from SINGER as S
outer apply (
select top 5 TT.*
from TEACHERS as TT
where TT.primaryInstrument='voice'
order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
) as T
Given a table of responses with columns:
Username, LessonNumber, QuestionNumber, Response, Score, Timestamp
How would I run a query that returns which users got a score of 90 or better on their first attempt at every question in their last 5 lessons? "last 5 lessons" is a limiting condition, rather than a requirement, so if they completely only 1 lesson, but got all of their first attempts for each question right, then they should be included in the results. We just don't want to look back farther than 5 lessons.
About the data: Users may be on different lessons. Some users may have not yet completed five lessons (may only be on lesson 3 for example). Each lesson has a different number of questions. Users have different lesson paths, so they may skip some lesson numbers or even complete lessons out of sequence.
Since this seems to be a problem of transforming temporally non-uniform/discontinuous values into uniform/contiguous values per-user, I think I can solve the bulk of the problem with a couple ranking function calls. The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions completed is variable per-user.
So far...
As a starting point or hint at what may need to happen, I've transformed Timestamp into an "AttemptNumber" for each question, by using "row_number() over (partition by Username,LessonNumber,QuestionNumber order by Timestamp) as AttemptNumber".
I'm also trying to transform LessonNumber from an absolute value into a contiguous ranked value for individual users. I could use "dense_rank() over (partition by Username order by LessonNumber desc) as LessonRank", but that assumes the order lessons are completed corresponds with the order of LessonNumber, which is unfortunately not always the case. However, let's assume that this is the case, since I do have a way of producing such a number through a couple of joins, so I can use the dense_rank transform described to select the "last 5 completed lessons" (i.e. LessonRank <= 5).
For the >90 condition, I think I can transform the score into an integer so that it's "1" if >= 90, and "0" if < 90. I can then introduce a clause like "group by Username having SUM(Score)=COUNT(Score).", which will select only those users with all scores equal to 1.
Any solutions or suggestions would be appreciated.
You kind of gave away the solution:
SELECT DISTINCT Username
FROM Results
WHERE Username NOT in (
SELECT DISTINCT Username
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1 and Score < 90
)
Concerning the LessonRank, I used exactly what you desribed since it is not clear how to order the lessons otherwise: The timestamp of the first attempt of the first question of a lesson? Or the timestamp of the first attempt of any question of a lesson? Or simply the first(or the most recent?) timestamp of any result of any question of a lesson?
The innermost Select adds all the AttemptNumber and LessonRank as provided by you.
The next Select retains only the results which would disqualify a user to be in the final list - all first attempts with an insufficient score in the last 5 lessons. We end up with a list of users we do not want to display in the final result.
Therefore, in the outermost Select, we can select all the users which are not in the exclusion list. Basically all the other users which have answered any question.
EDIT: As so often, second try should be better...
One more EDIT:
Here's a version including your remarks in the comments.
SELECT Username
FROM
(
SELECT Username, CASE WHEN Score >= 90 THEN 1 ELSE 0 END AS QuestionScoredWell
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1
) as ff
Group BY Username
HAVING MIN(QuestionScoredWell) = 1
I used a Having clause with a MIN expression on the calculated QuestionScoredWell value.
When comparing the execution plans for both queries, this query is actually faster. Not sure though whether this is partially due to the low number of data rows in my table.
Random suggestions:
1
The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions is variable per-user.
is equivalent to
There exists no first attempt with a score <= 90 most-recent 5 lessons
which strikes me as a little easier to grab with a NOT EXISTS subquery.
2
First attempt is the same as where timestamp = (select min(timestamp) ... )
You need to identify the top 5 lessons per user first, using the timestamp to prioritize lessons, then you can limit by score. Try:
Select username
from table t inner join
(select top 5 username, lessonNumber
from table
order by timestamp desc) l
on t.username = l.username and t.lessonNumber = l.lessonNumber
from table
where score >= 90
I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost