I have a table with user_ids that we've gathered from a streaming datasource of active accounts. Now I'm looking to go through and fill in the information about the user_ids that don't do much of anything.
Is there a SQL (postgres if it matters) way to have a query return random numbers not present in the table?
Eg something like this:
SELECT RANDOM(count, lower_bound, upper_bound) as new_id
WHERE new_id NOT IN (SELECT user_id FROM user_table) AS user_id_table
Possible, or would it be best to generate a bunch of random numbers with a scripted wrapper and pass those into the DB to figure out non existant ones?
It is posible. If you want the IDs to be integers, try:
SELECT trunc((random() * (upper_bound - lower_bound)) + lower_bound) AS new_id
FROM generate_series(1,upper_bound)
WHERE new_id NOT IN (
SELECT user_id
FROM user_table)
You can wrap the query above in a subselect, i.e.
SELECT * FROM (SELECT trunc(random() * (upper - lower) + lower) AS new_id
FROM generate_series(1, count)) AS x
WHERE x.new_id NOT IN (SELECT user_id FROM user_table)
I suspect you want a random sampling. I would do something like:
SELECT s
FROM generate_series(1, (select max(user_id) from users) s
LEFT JOIN users ON s.s = user_id
WHERE user_id IS NULL
order by random() limit 5;
I haven't tested this but the idea should work. If you have a lot of users and not a lot of missing id's it will perform better than the other options, but performance no matter what you do may be a problem.
My pragmatic approach would be: generate 500 random numbers and then pick one which is not in the table:
WITH fivehundredrandoms AS ( RANDOM(count, lower_bound, upper_bound) AS onerandom
FROM (SELECT generate_series(1,500)) AS fivehundred )
SELECT onerandom FROM fivehundredrandoms
WHERE onerandom NOT IN (SELECT user_id FROM user_table WHERE user_id > 0) LIMIT 1;
There is way to do what you want with recursive queries, alas it is not nice.
Suppose that you have the following table:
CREATE TABLE test (a int)
To simplify, you want to insert random numbers from 0 to 4 (random() * 5)::int that are not in the table.
WITH RECURSIVE rand (i, r, is_new) AS (
SELECT
0,
null,
false
UNION ALL
SELECT
i + 1,
next_number.v,
NOT EXISTS (SELECT 1 FROM test WHERE test.a = next_number.v)
FROM
rand r,
(VALUES ((random() * 5)::int)) next_number(v)
-- safety check to make sure we do not go into an infinite loop
WHERE i < 500
)
SELECT * FROM rand WHERE rand.is_new LIMIT 1
I'm not super sure, but PostgreSQL should be able to stop iterating once it has one result, since it knows that the query has limit 1.
Nice thing about this query is that you can replace (random() * 5)::int with any id generating function that you want
Related
The task is to execute the sql query:
select * from x where user in (select user from x where id = '1')
The subquery contains about 1000 id so it takes a long time.
Maybe this question was already there, but how can I speed it up? (if it is possible to speed up please write for PL SQL and T-SQL or at least one of them).
I would start by rewriting the in condition to exists:
select *
from x
where exists (select 1 from x x1 where x.user = x.user and x1.id = 1)
Then, consider an index on x(user, id) - or x(id, user) (you can try both and see if one offers better improvement that the other).
Another possibility is to use window functions:
select *
from (
select x.*, max(case when id = 1 then 1 else 0 end) over(partition by user) flag
from x
) x
where flag = 1
This might, or might not, perform better than the not exists solution, depending on various factors.
ids are usually unique. Is it sufficient to do this?
select x.*
from x
where id in ( . . . );
You would want an index on id, if it is not already the primary key of the table.
Is there any common practice to use SELECT result as a typed value, e.g. for functions arguments?
Could it be something like this?
func((SELECT number FROM numbers WHERE user_id = 1 LIMIT 1).number::numeric)
I thought about CURSOR for such a task but I'm not really sure. Thank you for any advice!
I'm using PostgreSQL so if there is any specific solution feel free to share.
Use the FROM clause or a common table expression:
SELECT func(a.x, b.y)
FROM (SELECT ... LIMIT 1) AS a
CROSS JOIN (SELECT ... LIMIT 1) AS b;
or
WITH a AS (SELECT ... LIMIT 1),
b AS (SELECT ... LIMIT 1)
SELECT func(a.x, b.y)
FROM a CROSS JOIN b;
Doesn't this do what you want?
func( (SELECT number FROM numbers WHERE user_id = 1 LIMIT 1)::numeric )
The subquery is returning (at most) one value so it is a scalar subquery and is equivalent to a scalar reference in the query.
That said, there are many other ways to express this. For instance:
func( (SELECT number::numeric FROM numbers WHERE user_id = 1 LIMIT 1) )
or:
(SELECT func(number::numeric) FROM numbers WHERE user_id = 1 LIMIT 1)
or moving to the from clause and using a lateral join. Or calculating the result in a CTE or subquery.
Not so long ago I needed to fetch a random row from a table in an Oracle database. The most widespread solution that I've found was this:
SELECT * FROM
( SELECT * FROM tabela WHERE warunek
ORDER BY dbms_random.value )
WHERE rownum = 1
However, this is very performance heavy for large tables, as it sorts the table in random order first, then grabs the first row.
Today, one of my collegues suggested a different way:
SELECT * FROM (
SELECT * FROM MAIN_PRODUCT
WHERE ROWNUM <= CAST((SELECT COUNT(*) FROM MAIN_PRODUCT)*dbms_random.value AS INTEGER)
ORDER BY ROWNUM DESC
) WHERE ROWNUM = 1;
It works way faster and seems to return random values, but does it really? Could someone give an insight into whether it is really random and behaves the way as expected? I'm really curious why I haven't found this approach anywhere else while looking, and if it is indeed random and way better performance wise, why isn't it more widespread?
This is the (possibly) the most simple query possible to get the results.
But the SELECT COUNT(*) FROM MAIN_PRODUCT will table scan i doubt you can get a query which does not do that.
P.s This query assumes not deleted records.
Query
SELECT *
FROM
MAIN_PRODUCT
WHERE
ROWNUM = FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
Will generate a number between between 1 and the max count of the table see demo how that works when you refresh it.
Oracle12c+ Query
SELECT *
FROM
MAIN_PRODUCT
WHERE
ROWNUM <= FLOOR(
(dbms_random.value * (SELECT COUNT(*) FROM MAIN_PRODUCT)) + 1
)
ORDER BY
ROWNUM DESC
FETCH FIRST ROW ONLY
The second code you have
SELECT * FROM (
SELECT * FROM MAIN_PRODUCT
WHERE ROWNUM <= CAST((SELECT COUNT(*) FROM MAIN_PRODUCT)*dbms_random.value AS INTEGER)
ORDER BY ROWNUM DESC
) WHERE ROWNUM = 1;
is excellent, except that it will get subsequent elements. dbms_random.value is returning a real number between 0 and 1. Multiplying this with the number of rows will provide you a really random number and the bottleneck here is counting the number of rows rather then generating a random value for each row.
Proof
Consider the
0 <= x < 1
number. If we multiply it with n, we get
0 <= n * x < n
which is exactly what you need if you want to load a single element. The reason this is not widespread is that in many cases the performance issues are not felt due to only a few thousands of records.
EDIT
If you would need k number of records, not just the first one, then it would be slightly difficult, however, still solvable. The algorithm would be something like this (I do not have Oracle installed to test it, so I only describe the algorithm):
randomize(n, k)
randomized <- empty_set
while (k > 0) do
newValue <- random(n)
n <- n - 1
k <- k - 1
//find out how many elements are lower than newValue
//increase newValue with that amount
//find out if newValue became larger than some values which were larger than new value
//increase newValue with that amount
//repeat until there is no need to increase newValue
while end
randomize end
If you randomize k elements from n, then you will be able to use those values in your filter.
The key to improving performance is to lessen the load of the ORDER BY.
If you know about how many rows match the conditions, then you can filter before the sort. For instance, the following takes about 1% of the rows:
SELECT *
FROM (SELECT *
FROM tabela
WHERE warunek AND dbms_random.value < 0.01
ORDER BY dbms_random.value
)
WHERE rownum = 1 ;
A variation is to calculate the number of matching values. Then randomly select a smaller sample. The following gets about 100 matching rows and then sorts them for the random selection:
SELECT a.*
FROM (SELECT *
FROM (SELECT a.*, COUNT(*) OVER () as cnt
FROM tabela a
WHERE warunek
) a
WHERE dbms_random.value < 100 / cnt
ORDER BY dbms_random.value
) a
WHERE rownum = 1 ;
Supposed you have a table T(A) with only positive integers allowed, like:
1,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18
In the above example, the result is 10. We always can use ORDER BY and DISTINCT to sort and remove duplicates. However, to find the lowest integer not in the list, I came up with the following SQL query:
select list.x + 1
from (select x from (select distinct a as x from T order by a)) as list, T
where list.x + 1 not in T limit 1;
My idea is start a counter and 1, check if that counter is in list: if it is, return it, otherwise increment and look again. However, I have to start that counter as 1, and then increment. That query works most of the cases, by there are some corner cases like in 1. How can I accomplish that in SQL or should I go about a completely different direction to solve this problem?
Because SQL works on sets, the intermediate SELECT DISTINCT a AS x FROM t ORDER BY a is redundant.
The basic technique of looking for a gap in a column of integers is to find where the current entry plus 1 does not exist. This requires a self-join of some sort.
Your query is not far off, but I think it can be simplified to:
SELECT MIN(a) + 1
FROM t
WHERE a + 1 NOT IN (SELECT a FROM t)
The NOT IN acts as a sort of self-join. This won't produce anything from an empty table, but should be OK otherwise.
SQL Fiddle
select min(y.a) as a
from
t x
right join
(
select a + 1 as a from t
union
select 1
) y on y.a = x.a
where x.a is null
It will work even in an empty table
SELECT min(t.a) - 1
FROM t
LEFT JOIN t t1 ON t1.a = t.a - 1
WHERE t1.a IS NULL
AND t.a > 1; -- exclude 0
This finds the smallest number greater than 1, where the next-smaller number is not in the same table. That missing number is returned.
This works even for a missing 1. There are multiple answers checking in the opposite direction. All of them would fail with a missing 1.
SQL Fiddle.
You can do the following, although you may also want to define a range - in which case you might need a couple of UNIONs
SELECT x.id+1
FROM my_table x
LEFT
JOIN my_table y
ON x.id+1 = y.id
WHERE y.id IS NULL
ORDER
BY x.id LIMIT 1;
You can always create a table with all of the numbers from 1 to X and then join that table with the table you are comparing. Then just find the TOP value in your SELECT statement that isn't present in the table you are comparing
SELECT TOP 1 table_with_all_numbers.number, table_with_missing_numbers.number
FROM table_with_all_numbers
LEFT JOIN table_with_missing_numbers
ON table_with_missing_numbers.number = table_with_all_numbers.number
WHERE table_with_missing_numbers.number IS NULL
ORDER BY table_with_all_numbers.number ASC;
In SQLite 3.8.3 or later, you can use a recursive common table expression to create a counter.
Here, we stop counting when we find a value not in the table:
WITH RECURSIVE counter(c) AS (
SELECT 1
UNION ALL
SELECT c + 1 FROM counter WHERE c IN t)
SELECT max(c) FROM counter;
(This works for an empty table or a missing 1.)
This query ranks (starting from rank 1) each distinct number in ascending order and selects the lowest rank that's less than its number. If no rank is lower than its number (i.e. there are no gaps in the table) the query returns the max number + 1.
select coalesce(min(number),1) from (
select min(cnt) number
from (
select
number,
(select count(*) from (select distinct number from numbers) b where b.number <= a.number) as cnt
from (select distinct number from numbers) a
) t1 where number > cnt
union
select max(number) + 1 number from numbers
) t1
http://sqlfiddle.com/#!7/720cc/3
Just another method, using EXCEPT this time:
SELECT a + 1 AS missing FROM T
EXCEPT
SELECT a FROM T
ORDER BY missing
LIMIT 1;
I've read about a few alternatives to MySQL's ORDER BY RAND() function, but most of the alternatives apply only to where on a single random result is needed.
Does anyone have any idea how to optimize a query that returns multiple random results, such as this:
SELECT u.id,
p.photo
FROM users u, profiles p
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
ORDER BY RAND()
LIMIT 18
UPDATE 2016
This solution works best using an indexed column.
Here is a simple example of and optimized query bench marked with 100,000 rows.
OPTIMIZED: 300ms
SELECT
g.*
FROM
table g
JOIN
(SELECT
id
FROM
table
WHERE
RAND() < (SELECT
((4 / COUNT(*)) * 10)
FROM
table)
ORDER BY RAND()
LIMIT 4) AS z ON z.id= g.id
note about limit ammount: limit 4 and 4/count(*). The 4s need to be the same number. Changing how many you return doesn't effect the speed that much. Benchmark at limit 4 and limit 1000 are the same. Limit 10,000 took it up to 600ms
note about join: Randomizing just the id is faster than randomizing a whole row. Since it has to copy the entire row into memory then randomize it. The join can be any table that is linked to the subquery Its to prevent tablescans.
note where clause: The where count limits down the ammount of results that are being randomized. It takes a percentage of the results and sorts them rather than the whole table.
note sub query: The if doing joins and extra where clause conditions you need to put them both in the subquery and the subsubquery. To have an accurate count and pull back correct data.
UNOPTIMIZED: 1200ms
SELECT
g.*
FROM
table g
ORDER BY RAND()
LIMIT 4
PROS
4x faster than order by rand(). This solution can work with any table with a indexed column.
CONS
It is a bit complex with complex queries. Need to maintain 2 code bases in the subqueries
Here's an alternative, but it is still based on using RAND():
SELECT u.id,
p.photo,
ROUND(RAND() * x.m_id) 'rand_ind'
FROM users u,
profiles p,
(SELECT MAX(t.id) 'm_id'
FROM USERS t) x
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
ORDER BY rand_ind
LIMIT 18
This is slightly more complex, but gave a better distribution of random_ind values:
SELECT u.id,
p.photo,
FLOOR(1 + RAND() * x.m_id) 'rand_ind'
FROM users u,
profiles p,
(SELECT MAX(t.id) - 1 'm_id'
FROM USERS t) x
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
ORDER BY rand_ind
LIMIT 18
It is not the fastest, but faster then common ORDER BY RAND() way:
ORDER BY RAND() is not so slow, when you use it to find only indexed column. You can take all your ids in one query like this:
SELECT id
FROM testTable
ORDER BY RAND();
to get a sequence of random ids, and JOIN the result to another query with other SELECT or WHERE parameters:
SELECT t.*
FROM testTable t
JOIN
(SELECT id
FROM `testTable`
ORDER BY RAND()) AS z ON z.id= t.id
WHERE t.isVisible = 1
LIMIT 100;
in your case it would be:
SELECT u.id, p.photo
FROM users u, profiles p
JOIN
(SELECT id
FROM users
ORDER BY RAND()) AS z ON z.id = u.id
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
LIMIT 18
It's very blunt method and it can be not proper with very big tables, but still it's faster than common RAND(). I got 20 times faster execution time searching 3000 random rows in almost 400000.
Order by rand() is very slow on large tables,
I found the following workaround in a php script:
Select min(id) as min, max(id) as max from table;
Then do random in php
$rand = rand($min, $max);
Then
'Select * from table where id>'.$rand.' limit 1';
Seems to be quite fast....
I ran into this today and was trying to use 'DISTINCT' along with JOINs, but was getting duplicates I assume because the RAND was making each JOINed row distinct. I muddled around a bit and found a solution that works, like this:
SELECT DISTINCT t.id,
t.photo
FROM (SELECT u.id,
p.photo,
RAND() as rand
FROM users u, profiles p
WHERE p.memberid = u.id
AND p.photo != ''
AND (u.ownership=1 OR u.stamp=1)
ORDER BY rand) t
LIMIT 18
Create a column or join to a select with random numbers (generated in for example php) and order by this column.
The solution I am using is also posted in the link below:
How can i optimize MySQL's ORDER BY RAND() function?
I am assuming your users table is going to be larger than your profiles table, if not then it's 1 to 1 cardinality.
If so, I would first do a random selection on user table before joining with profile table.
First do selection:
SELECT *
FROM users
WHERE users.ownership = 1 OR users.stamp = 1
Then from this pool, pick out random rows through calculated probability. If your table has M rows and you want to pick out N random rows, the probability of random selection should be N/M. Hence:
SELECT *
FROM
(
SELECT *
FROM users
WHERE users.ownership = 1 OR users.stamp = 1
) as U
WHERE
rand() <= $limitCount / (SELECT count(*) FROM users WHERE users.ownership = 1 OR users.stamp = 1)
Where N is $limitCount and M is the subquery that calculates the table row count. However, since we are working on probability, it is possible to have LESS than $limitCount of rows returned. Therefore we should multiply N by a factor to increase the random pool size.
i.e:
SELECT*
FROM
(
SELECT *
FROM users
WHERE users.ownership = 1 OR users.stamp = 1
) as U
WHERE
rand() <= $limitCount * $factor / (SELECT count(*) FROM users WHERE users.ownership = 1 OR users.stamp = 1)
I usually set $factor = 2. You can set the factor to a lower value to further reduce the random pool size (e.g. 1.5).
At this point, we would have already limited a M size table down to roughly 2N size. From here we can do a JOIN then LIMIT.
SELECT *
FROM
(
SELECT *
FROM
(
SELECT *
FROM users
WHERE users.ownership = 1 OR users.stamp = 1
) as U
WHERE
rand() <= $limitCount * $factor / (SELECT count(*) FROM users WHERE users.ownership = 1 OR users.stamp = 1)
) as randUser
JOIN profiles
ON randUser.id = profiles.memberid AND profiles.photo != ''
LIMIT $limitCount
On a large table, this query will outperform a normal ORDER by RAND() query.
Hope this helps!
SELECT
a.id,
mod_question AS modQuestion,
mod_answers AS modAnswers
FROM
b_ask_material AS a
INNER JOIN ( SELECT id FROM b_ask_material WHERE industry = 2 ORDER BY RAND( ) LIMIT 100 ) AS b ON a.id = b.id
I had the same issue today, I fixed it by using limit and offset
You can do by iterating 18 times over a random set of offsets
to avoid duplicates, you can create your random set of offset like this in python sample(range(1, rows_count), random_rows_count)
then for each offset get the corresponding row using OFFSET and LIMIT 1 and add it to a list
rows_count can be cached to avoid performance issue to count total number of rows at each request
EDIT it's actually what this post https://stackoverflow.com/a/40398306/3045926 proposes