Selectivity estimation error on a simple query - sql

Let us have a simple table tt created like this
WITH x AS (SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)), t1 AS
(
SELECT ones.n + 10 * tens.n + 100 * hundreds.n + 1000 * thousands.n + 10000 * tenthousands.n as id
FROM x ones, x tens, x hundreds, x thousands, x tenthousands, x hundredthousands
)
SELECT id,
id % 100 groupby,
row_number() over (partition by id % 100 order by id) orderby,
row_number() over (partition by id % 100 order by id) / (id % 100 + 1) local_search
INTO tt
FROM t1
I have a simple query Q1:
select distinct g1.groupby,
(select count(*) from tt g2
where local_search = 1 and g1.groupby = g2.groupby) as orderby
from tt g1
option(maxdop 1)
I would like to know why the SQL Server estimates the result size so badly for Q1 (see the printscreen). Most of the operators in the query plan are estimated precisely, however, at the the root Hash Match operator introduce completely insane guess.
To make it more interesting I have tried different rewritings of the Q1. If I apply decorrelation of the subquery I obtain an equivalent query Q2:
select main.groupby,
coalesce(sub1.orderby,0) orderby
from
(
select distinct g1.groupby
from tt g1
) main
left join
(
select groupby, count(*) orderby
from tt g2
where local_search = 1
group by groupby
) sub1 on sub1.groupby = main.groupby
option(maxdop 1)
This query is interesting in two aspects: (1) the estimation is accurate (see printscreen), (2) it has also different query plan, which is more efficient that the query plan of Q1.
So the question is: why the estimation of Q1 is inccorect, whereas the estimation of Q2 is precise? Please do not post other rewritings of this SQL (I know that that is can be written even without subqueries), I'm interested only about the explanation of the selectivity estimator behaviour. Thanks.

It doesn't recognize the orderby value will be the same for all rows with the same groupby so it thinks distinct groupby, orderby will have more combinations than just distinct groupby.
It multiplies the estimate for DISTINCT orderby (for me this is 35.0367) and the estimate for DISTINCT groupby (for me this is 100) as if they were uncorrelated.
I get an estimate for 3503.67 for the root node in Q1
This rewrite avoids it as it now only groups by the single groupby column.
SELECT groupby,
max(orderby) AS orderby
FROM (SELECT g1.groupby,
(SELECT count(*)
FROM tt g2
WHERE local_search = 1
AND g1.groupby = g2.groupby) AS orderby
FROM tt g1) d
GROUP BY groupby
OPTION(maxdop 1)
This is not the optimal approach to this query though as shown by your Q2 and the comment #GarethD makes about the inefficiencies of running the correlated sub query multiple times and discarding the duplicates.

Related

Display percentage of registered members that have not rated a Movie

I have the following three tables. See full db<>fiddle here
members
member_id
first_name
last_name
1
Roby
Dauncey
2
Isa
Garfoot
3
Sullivan
Carletto
4
Jacintha
Beacock
5
Mikey
Keat
6
Cindy
Stenett
7
Alexina
Deary
8
Perkin
Bachmann
10
Suzann
Genery
39
Horatius
Baukham
41
Bendicty
Willisch
movies
movie_id
movie_name
movie_genre
10
The Bloody Olive
Comedy,Crime,Film-Noir
56
Attack of The Killer Tomatoes
(no genres listed)
ratings
rating_id
movie_id
member_id
rating
19
10
39
2
10
56
41
1
Now the question is:
Out of the total number registered members, how many have actually left a movie rating? Display the result as a percentage
This is what I have tried:
SELECT CONVERT(VARCHAR,(CONVERT(FLOAT,COUNT([Number of Members])) / CONVERT(FLOAT,COUNT(*)) * 100)) + '%'
AS 'Members Percentage'
FROM (
SELECT COUNT(*) AS 'Number of Members'
FROM members
WHERE member_id IN (
SELECT member_id FROM members
EXCEPT
SELECT member_id FROM ratings
)
) MembersNORatings
And my query result is displaying as 100%. Which is obvious that the result is wrong.
Members Percentage
100%
What I figured out was that in the first line of the query:
COUNT(*) value is being recognized as the value equivalent to the alias [Number of Members]. That's why it is showing 100%.
I thought of replacing COUNT(*) with SELECT COUNT(*) FROM members but before I try to run the query, it was showing error saying
Incorrect Syntax near SELECT.
What change do I need to make in my existing query in order to get the proper percentage result?
You can use a cross apply to determine using a sub-query whether a given member has left a rating or not (because you can't use a sub-query in an aggregation). Then divide (ensuring you use decimal division, not integer) to get the percentage.
select
count(*) TotalMembers
, sum(r.HasRating) TotalWithRatings
, convert(decimal(9,2), 100 * sum(r.HasRating) / (count(*) * 1.0)) PercentageWithRatings
from #members m
cross apply (
select case when exists (select 1 from #ratings r where r.member_id = m.member_id) then 1 else 0 end
) r (HasRating);
Returns:
TotalMembers
TotalWithRatings
PercentageWithRatings
50
2
4.00
As mentioned in the comments, there are several ways to approach this. For example:
Option #1 - OUTER JOIN + DISTINCT
SELECT TotalMembers
, TotalMembersWithRatings
, CAST( 100.0 * TotalMembersWithRatings
/ NULLIF(TotalMembers, 0 )
AS DECIMAL(10,2)) AS MemberPercentage
FROM (
SELECT COUNT(DISTINCT m.member_id) AS TotalMembers
, COUNT(DISTINCT r.member_id) AS TotalMembersWithRatings
FROM members m LEFT JOIN ratings r ON r.member_id = m.member_id
) t
Option #2 - CTE + ROW_NUMBER()
WITH memberRatings AS (
SELECT member_id, ROW_NUMBER() OVER(
PARTITION BY member_id
ORDER BY member_id
) AS RowNum
FROM ratings
)
SELECT COUNT(mr.member_id) AS TotalMembers
, COUNT(mr.member_id) AS TotalWithRatings
, CAST( 100.0 * COUNT(mr.member_id)
/ NULLIF(COUNT(m.member_id), 0 )
AS DECIMAL(10,2)) AS MemberPercentage
FROM members m LEFT JOIN memberRatings mr ON mr.member_id = m.member_id
AND mr.RowNum = 1
Option #3 - CROSS APPLY
SELECT
COUNT(*) TotalMembers
, SUM(r.HasRating) TotalWithRatings
, CONVERT(decimal(9,2), 100 * sum(r.HasRating) / (count(*) * 1.0)) PercentageWithRatings
FROM members m
CROSS APPLY (
SELECT CASE WHEN exists (select 1 from ratings r where r.member_id = m.member_id) THEN 1
ELSE 0
END
) r (HasRating);
Execution Plans - Take #1
There's a LOT more to analyzing execution plans than just comparing a single number. However, high level plans do provide some useful indicators.
With the small data samples provided, the plans suggest options #2 (CTE) and #3 (APPLY) are likely to be the most performant (19%), and option #1 (OUTER JOIN + DISTINCT) the least at (63%), likely due to the count(distinct) which can often be slower than alternative options.
Original Sample Size:
TableName
TotalRows
movies
50
members
50
ratings
50
Execution Plans - Take #2
However, populate the tables with more than a few sample rows of data and the same rough comparison produces a different result. Option #2 (CTE) still seems likely to be the least expensive query (9%), but Option #3 (APPLY) is now the most expensive (76%). You can see the majority of that cost is the index spool used due to how APPLY operates:
New Sample Size
TableName
TotalRows
movies
4105
members
29941
ratings
14866
New Execution Plans
With the increased amount of data, STATISTICS IO shows option #2 has far less logical reads and scans and option #3 (APPLY) which as has the most. While Option #1, which appears to have a lower cost overall (15%) it still has a much higher number of logical reads. (Add a non-clustered index on member_id and movie_id and the numbers, while similar, change once again.) So don't just look at a single number.
New Statistics IO
While overall, option #2 (CTE) would seem likely to be most efficient, there are a lot of factors involved (indexes, data volume, statistics, version, etc), so you should examine the actual execution plans in your own environment.
As with most things, the answer as to which is best is: it depends.
Late to the party, but you don't need to join the tables if you only want to know how many members made a rating, not who.
What you need is
count entries in members table
count (distinct) members in ratings
get quota of 'rating' members (rating members divided by total members)
to get nonrating members, substract the quota from 1.0
multiply with 100 to get the percent value
This is how you could do the calculation step by step using CTEs:
with count_members as (
select count(member_id) as member_count from members
), count_raters as (
select count(distinct member_id) as rater_count from ratings
), convert_both as (
select top 1
cast(m.member_count as decimal(10,2)) as member_count,
cast(r.rater_count as decimal(10,2)) as rater_count
from count_members as m cross join count_raters as r
), calculate_quota as (
select (rater_count / member_count) as quota from convert_both
), invert_quota as (
select (1.0 - quota) as quota from calculate_quota
)
select (quota * 100) as percentage from invert_quota;
Alternatively, that's how you could roll it all into one:
select (
(1.0 - (
cast((select count(distinct member_id) from ratings) as decimal(10,2))
/
cast((select count(member_id) from members) as decimal(10,2))
) ) * 100
) as percentage;
dbfiddle here

Is there a better way to retrieve a random row from an Oracle table?

Not so long ago I needed to fetch a random row from a table in an Oracle database. The most widespread solution that I've found was this:
SELECT * FROM
( SELECT * FROM tabela WHERE warunek
ORDER BY dbms_random.value )
WHERE rownum = 1​
However, this is very performance heavy for large tables, as it sorts the table in random order first, then grabs the first row.
Today, one of my collegues suggested a different way:
SELECT * FROM (
SELECT * FROM MAIN_PRODUCT
WHERE ROWNUM <= CAST((SELECT COUNT(*) FROM MAIN_PRODUCT)*dbms_random.value AS INTEGER)
ORDER BY ROWNUM DESC
) WHERE ROWNUM = 1;
It works way faster and seems to return random values, but does it really? Could someone give an insight into whether it is really random and behaves the way as expected? I'm really curious why I haven't found this approach anywhere else while looking, and if it is indeed random and way better performance wise, why isn't it more widespread?
This is the (possibly) the most simple query possible to get the results.
But the SELECT COUNT(*) FROM MAIN_PRODUCT will table scan i doubt you can get a query which does not do that.
P.s This query assumes not deleted records.
Query
SELECT *
FROM
MAIN_PRODUCT
WHERE
ROWNUM = FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
Will generate a number between between 1 and the max count of the table see demo how that works when you refresh it.
Oracle12c+ Query
SELECT *
FROM
MAIN_PRODUCT
WHERE
ROWNUM <= FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
ORDER BY
ROWNUM DESC
FETCH FIRST ROW ONLY
The second code you have
SELECT * FROM (
SELECT * FROM MAIN_PRODUCT
WHERE ROWNUM <= CAST((SELECT COUNT(*) FROM MAIN_PRODUCT)*dbms_random.value AS INTEGER)
ORDER BY ROWNUM DESC
) WHERE ROWNUM = 1;
is excellent, except that it will get subsequent elements. dbms_random.value is returning a real number between 0 and 1. Multiplying this with the number of rows will provide you a really random number and the bottleneck here is counting the number of rows rather then generating a random value for each row.
Proof
Consider the
0 <= x < 1
number. If we multiply it with n, we get
0 <= n * x < n
which is exactly what you need if you want to load a single element. The reason this is not widespread is that in many cases the performance issues are not felt due to only a few thousands of records.
EDIT
If you would need k number of records, not just the first one, then it would be slightly difficult, however, still solvable. The algorithm would be something like this (I do not have Oracle installed to test it, so I only describe the algorithm):
randomize(n, k)
randomized <- empty_set
while (k > 0) do
newValue <- random(n)
n <- n - 1
k <- k - 1
//find out how many elements are lower than newValue
//increase newValue with that amount
//find out if newValue became larger than some values which were larger than new value
//increase newValue with that amount
//repeat until there is no need to increase newValue
while end
randomize end
If you randomize k elements from n, then you will be able to use those values in your filter.
The key to improving performance is to lessen the load of the ORDER BY.
If you know about how many rows match the conditions, then you can filter before the sort. For instance, the following takes about 1% of the rows:
SELECT *
FROM (SELECT *
FROM tabela
WHERE warunek AND dbms_random.value < 0.01
ORDER BY dbms_random.value
)
WHERE rownum = 1​ ;
A variation is to calculate the number of matching values. Then randomly select a smaller sample. The following gets about 100 matching rows and then sorts them for the random selection:
SELECT a.*
FROM (SELECT *
FROM (SELECT a.*, COUNT(*) OVER () as cnt
FROM tabela a
WHERE warunek
) a
WHERE dbms_random.value < 100 / cnt
ORDER BY dbms_random.value
) a
WHERE rownum = 1​ ;

Max returning all values on a self join

My problem requires me to query data from the table, and include a column to calculate the % increase as well. I need to pull only the records with the highest % of increase using MAX. I think I'm on the right track but but for some reason its returning all records despite the having clause calling for just the max.
Select
O.Grocery_Item,
TO_CHAR(sum(g.Price_IN_2000), '$99,990.00') TOTAL_IN_2000,
TO_CHAR(sum(g.Estimated_Price_In_2025), '$99,990.00') TOTAL_IN_2025,
TO_CHAR(Round(O.MY_OUTPUT),'9,990') || '%' as My_Output
From
GROCERY_PRICES g,
(SELECT
GROCERY_ITEM,
(((sum(Estimated_Price_In_2025) -
sum(Price_IN_2000))/sum(Price_IN_2000))*100) MY_OUTPUT
FROM
GROCERY_PRICES
GROUP BY GROCERY_ITEM) O
Where
G.GROCERY_ITEM = O.GROCERY_ITEM
GROUP BY
O.GROCERY_ITEM, O.MY_OUTPUT
Having
my_output IN (select Max(O.MY_OUTPUT) from GROCERY_PRICES);
Results:
GROCERY_ITEM TOTAL_IN_2000 TOTAL_IN_2025 MY_OUTPUT
------------------------------ ------------- ------------- ---------
M_004 $2.70 $5.65 109%
B_001 $0.80 $2.64 230%
T_006 $5.70 $6.65 17%
B_002 $2.72 $7.36 171%
E_001 $0.62 $1.78 187%
R_003 $4.00 $13.20 230%
6 rows selected
You can simplify your query so you only select from the Groceries table once since your My_Output column is only a function of numbers you are already producing the self join is not necessary. Then I've used RANK to get the top records (although if you are not concerned about ties ROWNUM will work better):
SELECT g.Grocery_Item,
g.TOTAL_IN_2000,
g.TOTAL_IN_2025,
g.My_Output
FROM ( SELECT Grocery_Item,
TO_CHAR(TOTAL_IN_2000, '$99,990.00') TOTAL_IN_2000,
TO_CHAR(TOTAL_IN_2025, '$99,990.00') TOTAL_IN_2025,
TO_CHAR(ROUND(((TOTAL_IN_2025 / TOTAL_IN_2000) - 1) * 100), '9,990') || '%' as My_Output,
RANK() OVER(PARTITION BY Grocery_Item ORDER BY (TOTAL_IN_2025 / TOTAL_IN_2000) - 1 DESC) AS GroceryRank
FROM ( SELECT g.Grocery_Item,
SUM(g.Price_IN_2000) TOTAL_IN_2000,
SUM(g.Estimated_Price_In_2025) TOTAL_IN_2025
FROM GROCERY_PRICES g
GROUP BY g.Grocery_Item
) g
) g
WHERE GroceryRank = 1;
I've also simplified your percentage calculation.
Try this instead:
select *
from (Select O.Grocery_Item, TO_CHAR(sum(g.Price_IN_2000), '$99,990.00') TOTAL_IN_2000,
TO_CHAR(sum(g.Estimated_Price_In_2025), '$99,990.00') TOTAL_IN_2025,
TO_CHAR(Round(O.MY_OUTPUT),'9,990') || '%' as My_Output
From GROCERY_PRICES g join
(SELECT GROCERY_ITEM,
(((sum(Estimated_Price_In_2025) -
sum(Price_IN_2000))/sum(Price_IN_2000))*100
) MY_OUTPUT
FROM GROCERY_PRICES
GROUP BY GROCERY_ITEM
) O
on G.GROCERY_ITEM = O.GROCERY_ITEM
GROUP BY O.GROCERY_ITEM, O.MY_OUTPUT
ORDER BY my_output desc
) t
where rownum = 1
The problem is that your subquery only has outer references. So, the o.my_output is coming from the outer table, not the from clause in the subquery. You are comparing a value to itself, which for non-NULL values is always true.
Since you want the maximum value, the easiest way is to order the list and take the first row. You can also do this with analytic functions, but rownum is usually more efficient.

SQL select elements where sum of field is less than N

Given that I've got a table with the following, very simple content:
# select * from messages;
id | verbosity
----+-----------
1 | 20
2 | 20
3 | 20
4 | 30
5 | 100
(5 rows)
I would like to select N messages, which sum of verbosity is lower than Y (for testing purposes let's say it should be 70, then correct results will be messages with id 1,2,3).
It's really important to me, that solution should be database independent (it should work at least on Postgres and SQLite).
I was trying with something like:
SELECT * FROM messages GROUP BY id HAVING SUM(verbosity) < 70;
However it doesn't seem to work as expected, because it doesn't actually sum all values from verbosity column.
I would be very grateful for any hints/help.
SELECT m.id, sum(m1.verbosity) AS total
FROM messages m
JOIN messages m1 ON m1.id <= m.id
WHERE m.verbosity < 70 -- optional, to avoid pointless evaluation
GROUP BY m.id
HAVING SUM(m1.verbosity) < 70
ORDER BY total DESC
LIMIT 1;
This assumes a unique, ascending id like you have in your example.
In modern Postgres - or generally with modern standard SQL (but not in SQLite):
Simple CTE
WITH cte AS (
SELECT *, sum(verbosity) OVER (ORDER BY id) AS total
FROM messages
)
SELECT *
FROM cte
WHERE total < 70
ORDER BY id;
Recursive CTE
Should be faster for big tables where you only retrieve a small set.
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, verbosity, verbosity AS total
FROM messages
ORDER BY id
LIMIT 1
)
UNION ALL
SELECT c1.id, c1.verbosity, c.total + c1.verbosity
FROM cte c
JOIN LATERAL (
SELECT *
FROM messages
WHERE id > c.id
ORDER BY id
LIMIT 1
) c1 ON c1.verbosity < 70 - c.total
WHERE c.total < 70
)
SELECT *
FROM cte
ORDER BY id;
All standard SQL, except for LIMIT.
Strictly speaking, there is no such thing as "database-independent". There are various SQL-standards, but no RDBMS complies completely. LIMIT works for PostgreSQL and SQLite (and some others). Use TOP 1 for SQL Server, rownum for Oracle. Here's a comprehensive list on Wikipedia.
The SQL:2008 standard would be:
...
FETCH FIRST 1 ROWS ONLY
... which PostgreSQL supports - but hardly any other RDBMS.
The pure alternative that works with more systems would be to wrap it in a subquery and
SELECT max(total) FROM <subquery>
But that is slow and unwieldy.
db<>fiddle here
Old sqlfiddle
This will work...
select *
from messages
where id<=
(
select MAX(id) from
(
select m2.id, SUM(m1.verbosity) sv
from messages m1
inner join messages m2 on m1.id <=m2.id
group by m2.id
) v
where sv<70
)
However, you should understand that SQL is designed as a set based language, rather than an iterative one, so it designed to treat data as a set, rather than on a row by row basis.

MySQL: Alternatives to ORDER BY RAND()

I've read about a few alternatives to MySQL's ORDER BY RAND() function, but most of the alternatives apply only to where on a single random result is needed.
Does anyone have any idea how to optimize a query that returns multiple random results, such as this:
SELECT u.id,
p.photo
FROM users u, profiles p
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
ORDER BY RAND()
LIMIT 18
UPDATE 2016
This solution works best using an indexed column.
Here is a simple example of and optimized query bench marked with 100,000 rows.
OPTIMIZED: 300ms
SELECT
g.*
FROM
table g
JOIN
(SELECT
id
FROM
table
WHERE
RAND() < (SELECT
((4 / COUNT(*)) * 10)
FROM
table)
ORDER BY RAND()
LIMIT 4) AS z ON z.id= g.id
note about limit ammount: limit 4 and 4/count(*). The 4s need to be the same number. Changing how many you return doesn't effect the speed that much. Benchmark at limit 4 and limit 1000 are the same. Limit 10,000 took it up to 600ms
note about join: Randomizing just the id is faster than randomizing a whole row. Since it has to copy the entire row into memory then randomize it. The join can be any table that is linked to the subquery Its to prevent tablescans.
note where clause: The where count limits down the ammount of results that are being randomized. It takes a percentage of the results and sorts them rather than the whole table.
note sub query: The if doing joins and extra where clause conditions you need to put them both in the subquery and the subsubquery. To have an accurate count and pull back correct data.
UNOPTIMIZED: 1200ms
SELECT
g.*
FROM
table g
ORDER BY RAND()
LIMIT 4
PROS
4x faster than order by rand(). This solution can work with any table with a indexed column.
CONS
It is a bit complex with complex queries. Need to maintain 2 code bases in the subqueries
Here's an alternative, but it is still based on using RAND():
SELECT u.id,
p.photo,
ROUND(RAND() * x.m_id) 'rand_ind'
FROM users u,
profiles p,
(SELECT MAX(t.id) 'm_id'
FROM USERS t) x
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
ORDER BY rand_ind
LIMIT 18
This is slightly more complex, but gave a better distribution of random_ind values:
SELECT u.id,
p.photo,
FLOOR(1 + RAND() * x.m_id) 'rand_ind'
FROM users u,
profiles p,
(SELECT MAX(t.id) - 1 'm_id'
FROM USERS t) x
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
ORDER BY rand_ind
LIMIT 18
It is not the fastest, but faster then common ORDER BY RAND() way:
ORDER BY RAND() is not so slow, when you use it to find only indexed column. You can take all your ids in one query like this:
SELECT id
FROM testTable
ORDER BY RAND();
to get a sequence of random ids, and JOIN the result to another query with other SELECT or WHERE parameters:
SELECT t.*
FROM testTable t
JOIN
(SELECT id
FROM `testTable`
ORDER BY RAND()) AS z ON z.id= t.id
WHERE t.isVisible = 1
LIMIT 100;
in your case it would be:
SELECT u.id, p.photo
FROM users u, profiles p
JOIN
(SELECT id
FROM users
ORDER BY RAND()) AS z ON z.id = u.id
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
LIMIT 18
It's very blunt method and it can be not proper with very big tables, but still it's faster than common RAND(). I got 20 times faster execution time searching 3000 random rows in almost 400000.
Order by rand() is very slow on large tables,
I found the following workaround in a php script:
Select min(id) as min, max(id) as max from table;
Then do random in php
$rand = rand($min, $max);
Then
'Select * from table where id>'.$rand.' limit 1';
Seems to be quite fast....
I ran into this today and was trying to use 'DISTINCT' along with JOINs, but was getting duplicates I assume because the RAND was making each JOINed row distinct. I muddled around a bit and found a solution that works, like this:
SELECT DISTINCT t.id,
t.photo
FROM (SELECT u.id,
p.photo,
RAND() as rand
FROM users u, profiles p
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
ORDER BY rand) t
LIMIT 18
Create a column or join to a select with random numbers (generated in for example php) and order by this column.
The solution I am using is also posted in the link below:
How can i optimize MySQL's ORDER BY RAND() function?
I am assuming your users table is going to be larger than your profiles table, if not then it's 1 to 1 cardinality.
If so, I would first do a random selection on user table before joining with profile table.
First do selection:
SELECT *
FROM users
WHERE users.ownership = 1 OR users.stamp = 1
Then from this pool, pick out random rows through calculated probability. If your table has M rows and you want to pick out N random rows, the probability of random selection should be N/M. Hence:
SELECT *
FROM
(
SELECT *
FROM users
WHERE users.ownership = 1 OR users.stamp = 1
) as U
WHERE
rand() <= $limitCount / (SELECT count(*) FROM users WHERE users.ownership = 1 OR users.stamp = 1)
Where N is $limitCount and M is the subquery that calculates the table row count. However, since we are working on probability, it is possible to have LESS than $limitCount of rows returned. Therefore we should multiply N by a factor to increase the random pool size.
i.e:
SELECT*
FROM
(
SELECT *
FROM users
WHERE users.ownership = 1 OR users.stamp = 1
) as U
WHERE
rand() <= $limitCount * $factor / (SELECT count(*) FROM users WHERE users.ownership = 1 OR users.stamp = 1)
I usually set $factor = 2. You can set the factor to a lower value to further reduce the random pool size (e.g. 1.5).
At this point, we would have already limited a M size table down to roughly 2N size. From here we can do a JOIN then LIMIT.
SELECT *
FROM
(
SELECT *
FROM
(
SELECT *
FROM users
WHERE users.ownership = 1 OR users.stamp = 1
) as U
WHERE
rand() <= $limitCount * $factor / (SELECT count(*) FROM users WHERE users.ownership = 1 OR users.stamp = 1)
) as randUser
JOIN profiles
ON randUser.id = profiles.memberid AND profiles.photo != ''
LIMIT $limitCount
On a large table, this query will outperform a normal ORDER by RAND() query.
Hope this helps!
SELECT
a.id,
mod_question AS modQuestion,
mod_answers AS modAnswers
FROM
b_ask_material AS a
INNER JOIN ( SELECT id FROM b_ask_material WHERE industry = 2 ORDER BY RAND( ) LIMIT 100 ) AS b ON a.id = b.id
I had the same issue today, I fixed it by using limit and offset
You can do by iterating 18 times over a random set of offsets
to avoid duplicates, you can create your random set of offset like this in python sample(range(1, rows_count), random_rows_count)
then for each offset get the corresponding row using OFFSET and LIMIT 1 and add it to a list
rows_count can be cached to avoid performance issue to count total number of rows at each request
EDIT it's actually what this post https://stackoverflow.com/a/40398306/3045926 proposes