Optimized way to get x Random rows satisfying given criteria in MySQL - sql

I need to get x rows from a Database Table which satisfy some given criteria.
I know that we can get random rows from MySQL using ORDER BY RAND ().
SELECT * FROM 'vids' WHERE 'cat'=n ORDER BY RAND() LIMIT x
I am looking for the most optimized way do the same {Low usage of system resources is main priority. Next important priority is speed of the query}. Also, in the table design, should I make 'cat' INDEX ?

I'm trying to think of how to do this too. My thinking at the moment is the following three alternatives:
1) select random rows ignoring criteria, then throw out ones that do not match at the application level and select more random rows if needed. This method will be effective if your criteria matches lots of rows in your table, perhaps 20% or more (need to benchmark)
2) select rows following criteria, and choosing a row based on a random number between 1 and count(*) (random number determined in the application). This will be effective if the data matching the criteria is evenly distributed, but will fail terribly if for example you are selecting a date range, and the majority of random numbers will fall upon records outside this range.
3) my current favourite, but also the most work. For every combination of criteria you intend to use to select a random record, you insert a record into a special table for that criteria. You then select random records from the special table, and follow them back to your data. For example, you might have a table like this:
Table cat: name, age, eye_colour, fur_type
If you want to be able to select random cats with brown fur, then you need a table like this:
Table cats_with_brown_fur: id (autonumber), cat_fk
You can then select a random record from this table based on the autonumber id, and it will be fast, and will produce evenly distributed random results. But indeed, if you select from many sets of criteria, you will have some overheads on maintaining these tables.
That's my current take on it, anyway. Good luck

Order by Rand() is a bad idea.
Here's a better solution:
How can i optimize MySQL's ORDER BY RAND() function?
Google is your friend, a lot of people have it explained it better than I ever could.
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
http://www.phpbuilder.com/board/showthread.php?t=10338930
http://www.paperplanes.de/2008/4/24/mysql_nonos_order_by_rand.html

Related

Quickly find a record appearing more than twice in a table

Title pretty much says it all, using Oracle SQL, I'd like to get, as quickly as possible, three records that share an ID from a very large table. The rows are not duplicates, they share one ID (rID) but differ in another (mID).
One approach I know I could do (that would be very slow) would be to load the first say 1000 records into a C# program, then execute a COUNT query to count the number of records with each ID until I hit one with 3 records and return that ID. I know this is a terrible approach but should give an idea of what I want to get out of this.
I've tried using GROUP BY, and this would work but would be unacceptably slow, I don't care about the state of the rest of the table, I just need a single ID that has three records. Ideally I'd do something like a GROUP BY that would stop after finding the first ID with three or more records and just return that one. There are over a million records in the table so efficiency is important.
What you describe translates to:
select the_id
from the_table
group by the_id
having count(*) >= 3
fetch first row only;
This should be as fast as it gets. You can help Oracle by providing an index on the id. That's about it.

What's an Efficient Alternative to LIMIT?

My question basically is: How can I tell my database to do a seq scan and STOP after the first match to my WHERE condition?
Assuming I want to find the first event of a certain type, I could write the following query:
select *
from installs
where country = 'China'
order by install_date
limit 1
The problem here is that according to order of operations the engine would scan all the table and generate a dataset that matches my filter, then sort this dataset (with an immense cost), and then return only the first row.
I could of course filter by specific dates, but let's assume I don't know the period to filter by - how can I optimize this type of query in Amazon Redshift (something in the where clause maybe)?
Redshift's general strategy is to do a lot of scanning, but parallelize it. Any case that involves getting a single row is not going to be ideal. That said, you can do four things:
1. Reduce scanning, to a point
If country will always be the field filtered on, set the sortkey for the table to a compound sortkey on country first.
2. Eliminate the need for sorting
A more efficient way to do ORDER BY x LIMIT 1 is often MAX.
Then try
SELECT *
FROM installs
WHERE pk = (
SELECT MAX(pk) -- or install_date, if install date is unique
FROM installs
WHERE country = 'China'
)
3. Tailor the selected columns between row oriented and columnar
Asking a columnar database like Redshift to select * incurs costs for each column. Try selecting only the columns you need.
4. Add more nodes, so each node does less scanning
(Make sure the data is not set to distribution style all)
If you remove the ORDER BY, then it can work efficiently.
The requirement to sort the results means it needs to examine all rows where the country is China, which is not efficient for returning one item.
The where country = 'China' clause is efficient if SORTKEY = country since it can skip over any storage blocks that do not contain the desired value. This will be highly efficient if there are very relatively few rows that match.
If you are frequently querying for one-row results, then it might be worth storing such information in a separate table for faster lookup. The value could be calculated every day, or every hour, as necessary.

Should I use a unique ID for a row in a junction table?

I am using SQL Server 2008.
A while back, I asked the question "should I use RecordID in a junction table". The tables would look like this:
// Images
ImageID// PK
// Persons
PersonID // pk
// Images_Persons
RecordID // pk
ImageID // fk
PersonID // fk
I was strongly advised NOT to use RecordID because it's useless in a table where the two IDs create a unique combination, meaning there will be no duplicate records.
Now, I am trying to find a random record in the junction table to create a quiz. I want to pull the first id and see if someone can match the second id. Specifically, I grab a random image and display it with three possible choices of persons.
The following query works, but I've quite a bit of negativity that suggests that it's very slow. My database might have 10,000 records, so I don't think that matters much. I've also read that the values generated aren't truly random.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Should I add the RecordID column or not? Is there a better way to find a random record in this case?
Previous questions for reference
Should I use "RecordID" as a column name?
SQL - What is the best table design to store people as musicians and artists?
NEWID is random enough and probably best
10k rows is peanuts
You don't need a surrogate key for a junction (link, many-many) table
Edit: in case you want to prematurely optimise...
You could ignore this and read these from #Mitch Wheat. But with just 10k rows your development time will be longer than any saved execution time..
Efficiently select random rows from large resultset with LINQ (ala TABLESAMPLE)
Efficiently randomize (shuffle) data in Sql Server table
Personally, I don't think that having the RecordID column should be advised AGAINST. Rather I'd advise that often it is UNNECESSARY.
There are cases where having a single value to identify a row makes for simpler code. But they're at the cost of additional storage, often additional indexes, etc. The overheads realistically are small, but so are the benefits.
In terms of the selection of random records, the existence of a single unique identifier can make the task easier if the identifiers are both sequential and consecutive.
The reason I say this is because your proposed solution requires the assignment of NEWID() to every record, and the sorting of all records to find the first one. As the table size grows this operation grows, and can become relatively expensive. Whether it's expensive enough to be worth optimising depends on whatever else is happening, how often, etc.
Where there are sequential consecutive unique identifiers, however, one can then choose a random value between MIN(id) and MAX(id), and then SEEK that value out. The requirement that all value are consecutive, however, is often a constraint too far; you're never allowed to delete a value mid-table, for example...
To overcome this, and depending on indexes, you may find the following approach useful.
DECLARE
#max_id INT
SELECT
#id = COUNT(*)
FROM
Images_Persons
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY ImageID, PersonID) AS id
FROM
Images_Persons
)
AS data
WHERE
Images_Persons.id = CAST(#max_id * RAND() + 1 AS INT)
-- Assuming that `ImageID, PersonID` is the clustered index.
A down side here is that RAND() is notoriously poor at being truly random. Yet it normally perfectly suitable if executed at a random time relative to any other call to RAND().
Consider what you've got.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Not truly random? Excluding the 'truly random is impossible' bit, you're probably right - I believe that there are patterns in generated uniqueidentifiers. But you should test this yourself. It'd be simple; just create a table with 1 to 100 in it, order by newid() a lot of times, and look at the results. If it's random 'enough' for you (which it probably will be, for a quiz) then it's good enough.
Very slow? I wouldn't worry about that. I'd be very surprised if the newid() is slower than reading the record from the table. But again, test and benchmark.
I'd be happy with the solution you have, pending tests if you're concerned about it.
I've always used order by newid().

Fast Way To Estimate Rows By Criteria

I have seen a few posts detailing fast ways to "estimate" the number of rows in a given SQL table without using COUNT(*). However, none of them seem to really solve the problem if you need to estimate the number of rows which satisfy a given criteria. I am trying to get a way of estimating the number of rows which satisfy a given criteria, but the information for these criteria is scattered around two or three tables. Of course a SELECT COUNT(*) with the NOLOCK hint and a few joins will do, and I can afford under- or over-estimating the total records. The probem is that this kind of query will be running every 5-10 minutes or so, and since I don't need the actual number-only an estimate-I would like to trade-off accuracy for speed.
The solution, if any, may be "SQL Server"-specific. In fact, it must be compatible with SQL Server 2005. Any hints?
There is no easy way to do this. You can get an estimate for the total number of rows in a table, e.g. from system catalog views.
But there's no way to do this for a given set of criteria in a WHERE clause - either you would have to keep counts for each set of criteria and the values, or you'd have to use black magic to find that out. The only place that SQL Server keeps something that would go into that direction is the statistics it keeps on the indices. Those will have certain information about what kind of values occur how frequently in an index - but I quite honestly don't have any idea if (and how) you could leverage the information in the statistics in your own queries......
If you really must know the number of rows matching a certain criteria, you need to do a count of some sort - either a SELECT COUNT(*) FROM dbo.YourTable WHERE (yourcriteria) or something else.
Something else could be something like this:
wrap your SELECT statement into a CTE (Common Table Expression)
define a ROW_NUMBER() in that CTE ordering your data by some column (or set of columns)
add a second ROW_NUMBER() to that CTE that orders your data by the same column (or columns) - but in the opposite direction (DESC vs. ASC)
Something like this:
;WITH YourDataCTE AS
(
SELECT (list of columns you need),
ROW_NUMBER() OVER(ORDER BY <your column>) AS 'RowNum',
ROW_NUMBER() OVER(ORDER BY <your column> DESC) AS 'RowNum2'
FROM
dbo.YourTable
WHERE
<your conditions here>
)
SELECT *
FROM YourDataCTE
Doing this, you would get the following effect:
your first row in your result set will contain your usual data columns
the first ROW_NUMBER() will contain the value 1
the second ROW_NUMBER() will contain the total number of row that match that criteria set
It's surprisingly good at dealing with small to mid-size result sets - I haven't tried yet how it'll hold up with really large result sets - but it might be something to investigate and see if it works.
Possible solutions:
If the count number is big in comparison to the total number of rows in the table, then adding indexes that cover where condition will help and the query will be very fast.
If the result number is close to the total number of rows in the table, indexes will not help much. You could implement a trigger that would maintain a 'conditional count table'. So whenever row matching condition added you would increment the value in the table, and when row is deleted you would decrement the value. So you will query this small 'summary count table'.

Selecting 'highest' X rows without sorting

I've got a table with huge amount of data. Lets say 10GB of lines, containing bunch of crap. I need to select for example X rows (X is usually below 10) with highest amount column.
Is there any way how to do it without sorting the whole table? Sorting this amount of data is extremely time-expensive, I'd be OK with one scan through the whole table and selecting X highest values, and letting the rest untouched. I'm using SQL Server.
Create an index on amount then SQL Server can select the top 10 from that and do bookmark lookups to retrieve the missing columns.
SELECT TOP 10 Amount FROM myTable ORDER BY Amount DESC
if it is indexed, the query optimizer should use the index.
If not, I do no see how one could avoid scanning the whole thing...
Wether an index is usefull or not depends on how often you do that search.
You could also consider putting that query into an indexed view. I think this will give you the best benefit/cost ration.