Fast way to select small sample from huge table

Fast way to select small sample from huge table - sql

The table I have is huge about 100+ million entries, it is ordered by default by 'A'. There could be many items with the same column A, A increases from 0 to... A big number. I tried TABLESAMPLE but it does not quite select a good number from each A number, it skips some of them or maybe I am not using it well. So I would like to select the same amount of values from each A number. And I would like the total of selected rows to be a number, let's say 10 million or let's call it B.

While it's not exactly clear to me what you need to achieve, when I have needed a large sample subset that is very well distributed between Parent and/or common Attribute values, I have done it like this:
SELECT *
FROM YourTable
WHERE (YourID % 10) = 3
This also has the advantage that you can get another completely different sample just by changing the "3" to another digit. Plus you can change the sub-sample size by adjusting the "10".

You can make use of NEWID():
SELECT TOP 100
*
FROM
YourTable
ORDER BY NEWID()

#RBarryYoung solution is right, generic and it works for any constant statistic distribution, like id sequences (or any auto-increment column). Sometimes, though, your distribution is not constant or you can run into performance issues (SQL Server has to scan all index entries to calculate the WHERE clause).
If any of those affects your problem, consider the built-in T-SQL operator TOP that may suit your needs:
SELECT TOP (30) PERCENT *
FROM YourTable;

Related

What is the distribution of getting a single random row in Oracle using this SQL statement?

We are attempting to pull a semi-random row from Oracle. (We don't need perfectly random row that meets rigorous statistical scrutiny but we would like something that has a chance of getting any row in the table even though there may be some degree of skew.)
We are using this approach:
SELECT PERSON_ID FROM ENCOUNTER SAMPLE(0.0001) WHERE EXTRACT(YEAR FROM REG_DT_TM) = 2020 AND ROWNUM = 1
This approach appears to be giving us just one random result each time we run it.
However, according to answers to this question, this approach gives results from the beginning of the table far more commonly.
How commonly? If that statement is true then how much more commonly are values taken from the top of the table? Our typical table has tens of millions of rows (occasionally billions.) Is there a simple heuristic or a rough estimate to understand the skew in the distribution we can expect?
We are asking for skew because other methods aren't fast enough for our use case. We are avoiding using ORDER because the source tables can be so large (i.e. billions of rows) that the reporting server will run for hours or can time out before we get an answer. Thus, our constraint is we need to use approaches like SAMPLE that respond with little database overhead.

The issue than sample is basically going through the table in order and randomly selecting rows. The issue is the rownum, not the sample.
The solution is to use sample and then randomly sort:
SELECT p.*
FROM (SELECT PERSON_ID
FROM ENCOUNTER SAMPLE(0.0001)
WHERE EXTRACT(YEAR FROM REG_DT_TM) = 2020
ORDER BY dbms_random.value
) p
WHERE ROWNUM = 1

Just for fun, here is an alternative way to select a single, uniformly distributed row out of a (uniformly distributed) "small" sample of rows from the table.
Suppose the table has millions or billions of rows, and we use the sample clause to select only a small, random (and presumably uniformly distributed) sample of rows. Let's say the sample size is 200 rows. How can we select a single row out of those 200, in such a way that the selection is not biased?
As the OP explained, if we always select the first row generated in the sample, that has a very high likelihood to be biased. Gordon Linoff has shown a perfectly valid way to fix that. Here I describe a different approach - which is even more efficient, as it only generates a single random number, and it does not need to order the 200 rows. (Admittedly this is not a lot of overhead, but it may still matter if the query must be run many times.)
Namely: Given any 200 rows, generate a (hopefully uniformly distributed) single integer between 1 and 200. Also, as the 200 rows are generated, capture ROWNUM at the same time. Then it's as simple as selecting the row where ROWNUM = <the randomly generated integer>
Unfortunately, the sample clause doesn't generate a fixed number of rows, even if the table and the percentage sampled are fixed (and even if stats on the table are current). So the solution is just slightly more complicated - first I generate the sample, then I count how many rows it contains, and then I select the one row we want.
The output will include a column for the "random row number"; if that is an issue, just list the columns from the base table instead of * in the final query. I assume the name of the base table is t.
with
p as ( select t.*, rownum as rn
from t sample(0.0001)
)
, r as ( select trunc(dbms_random.value(1, (select count(*) from p) + 1)) as rn
from dual
)
select p.*
from p join r on p.rn = r.rn
;

It's not accurate to say "[SAMPLE] gives results from the beginning of the table far more commonly," unless you're using SAMPLE wrong. However, there are some unusual cases where earlier rows are favored if those early rows are much larger than subsequent rows.
SAMPLE Isn't That Bad
If you use a large sample size, the first rows returned do appear to come from the "first" rows of the table. (But tables are unordered, and while I observe
this behavior on my machine there is no guarantee you will always see it.)
The below query does seem to do a good job of picking random rows, but not if you only look at the first N rows returned:
select * from test1 sample(99);
SAMPLE Isn't Perfect Either
The below test case shows how the row size can skew the results. If you insert 10,000 large rows and then insert 10,000 small rows, a small SAMPLE will
almost always only return large rows.
--drop table test1 purge;
create table test1(a varchar2(5), b varchar2(4000));
--Insert 10K large records.
insert into test1 select 'large', lpad('A', 4000, 'A') from dual connect by level <= 10000;
--Insert 10K small records.
insert into test1 select 'small', null from dual connect by level <= 10000;
--Select about 10 rows. Notice that they are almost always a "LARGE" row.
select * from test1 sample (0.1);
However, the skew completely disappears if you insert the small rows before the large rows.
I think these results imply that SAMPLE is based on the distribution of data in blocks (8 KB of data), and not strictly random per rows. If small rows are "hidden" in a physically small part of the table they are much less likely to show up. However, Oracle always seems to check the first part of the table, and if the small rows exist there, then the sample is evenly distributed. The rows have to be hiding very well to be missed.
The real answer depends on Oracle's implementation, which I don't have access to. Hopefully this test case will at least give you some ideas to play around and determine if SAMPLE is random enough for your needs.

Qlikview, Conditional based on number of rows in straight table

So in Qlikview, I am trying to make a conditional that would only display the table if the table has less than 50000 rows. How would I go about to doing this?
The table I am working with is used for the user to create their own reports. They are able to choose what fields they want to see and are able to see those fields next to a calculated value column. I have tried using the RowNo() and NoOfRows() functions but was not able to get anywhere with that. If you have any other ideas, I would appreciate it.
Thanks

Consider that the number of rows will be determined by the number of distinct entries for your dimension for the table. So you could use:
Count(Distinct myDimension) < 50000
Where myDimension is the dimension of your table (or some concatenation of many dimensions if you have more than one dimension in your table).
Chris J's answer should be faster than the above Count(Distinct... since it does not require runtime elimination of duplicates, but depending on your data, you may need to create an extra table with a resident load to contain the counter correctly.
In my experience however, users prefer to have a logical limit on their data (something like being forced to select a week) rather than having a fixed limit to the number of records.
You can enforce this kind of limit with a condition like
GetSelectedCount(myWeekField) <= 1

As part of your load script, you should add an additional field to the table that you are
,1 as RecordSum;
Then set a variable in the script
set vRecordSum = sum(RecordSum)
Then, on the straight table set to conditional with the formula $(vRecordSum)<50000

One easy way should be to do, as a condition :
SUM(1) < 50.000
Sum(1) should represent the number of rows.

Fast Way To Estimate Rows By Criteria

I have seen a few posts detailing fast ways to "estimate" the number of rows in a given SQL table without using COUNT(*). However, none of them seem to really solve the problem if you need to estimate the number of rows which satisfy a given criteria. I am trying to get a way of estimating the number of rows which satisfy a given criteria, but the information for these criteria is scattered around two or three tables. Of course a SELECT COUNT(*) with the NOLOCK hint and a few joins will do, and I can afford under- or over-estimating the total records. The probem is that this kind of query will be running every 5-10 minutes or so, and since I don't need the actual number-only an estimate-I would like to trade-off accuracy for speed.
The solution, if any, may be "SQL Server"-specific. In fact, it must be compatible with SQL Server 2005. Any hints?

There is no easy way to do this. You can get an estimate for the total number of rows in a table, e.g. from system catalog views.
But there's no way to do this for a given set of criteria in a WHERE clause - either you would have to keep counts for each set of criteria and the values, or you'd have to use black magic to find that out. The only place that SQL Server keeps something that would go into that direction is the statistics it keeps on the indices. Those will have certain information about what kind of values occur how frequently in an index - but I quite honestly don't have any idea if (and how) you could leverage the information in the statistics in your own queries......
If you really must know the number of rows matching a certain criteria, you need to do a count of some sort - either a SELECT COUNT(*) FROM dbo.YourTable WHERE (yourcriteria) or something else.
Something else could be something like this:
wrap your SELECT statement into a CTE (Common Table Expression)
define a ROW_NUMBER() in that CTE ordering your data by some column (or set of columns)
add a second ROW_NUMBER() to that CTE that orders your data by the same column (or columns) - but in the opposite direction (DESC vs. ASC)
Something like this:
;WITH YourDataCTE AS
(
SELECT (list of columns you need),
ROW_NUMBER() OVER(ORDER BY <your column>) AS 'RowNum',
ROW_NUMBER() OVER(ORDER BY <your column> DESC) AS 'RowNum2'
FROM
dbo.YourTable
WHERE
<your conditions here>
)
SELECT *
FROM YourDataCTE
Doing this, you would get the following effect:
your first row in your result set will contain your usual data columns
the first ROW_NUMBER() will contain the value 1
the second ROW_NUMBER() will contain the total number of row that match that criteria set
It's surprisingly good at dealing with small to mid-size result sets - I haven't tried yet how it'll hold up with really large result sets - but it might be something to investigate and see if it works.

Possible solutions:
If the count number is big in comparison to the total number of rows in the table, then adding indexes that cover where condition will help and the query will be very fast.
If the result number is close to the total number of rows in the table, indexes will not help much. You could implement a trigger that would maintain a 'conditional count table'. So whenever row matching condition added you would increment the value in the table, and when row is deleted you would decrement the value. So you will query this small 'summary count table'.

Selecting 'highest' X rows without sorting

I've got a table with huge amount of data. Lets say 10GB of lines, containing bunch of crap. I need to select for example X rows (X is usually below 10) with highest amount column.
Is there any way how to do it without sorting the whole table? Sorting this amount of data is extremely time-expensive, I'd be OK with one scan through the whole table and selecting X highest values, and letting the rest untouched. I'm using SQL Server.

Create an index on amount then SQL Server can select the top 10 from that and do bookmark lookups to retrieve the missing columns.

SELECT TOP 10 Amount FROM myTable ORDER BY Amount DESC
if it is indexed, the query optimizer should use the index.
If not, I do no see how one could avoid scanning the whole thing...
Wether an index is usefull or not depends on how often you do that search.
You could also consider putting that query into an indexed view. I think this will give you the best benefit/cost ration.

Optimized way to get x Random rows satisfying given criteria in MySQL

I need to get x rows from a Database Table which satisfy some given criteria.
I know that we can get random rows from MySQL using ORDER BY RAND ().
SELECT * FROM 'vids' WHERE 'cat'=n ORDER BY RAND() LIMIT x
I am looking for the most optimized way do the same {Low usage of system resources is main priority. Next important priority is speed of the query}. Also, in the table design, should I make 'cat' INDEX ?

I'm trying to think of how to do this too. My thinking at the moment is the following three alternatives:
1) select random rows ignoring criteria, then throw out ones that do not match at the application level and select more random rows if needed. This method will be effective if your criteria matches lots of rows in your table, perhaps 20% or more (need to benchmark)
2) select rows following criteria, and choosing a row based on a random number between 1 and count(*) (random number determined in the application). This will be effective if the data matching the criteria is evenly distributed, but will fail terribly if for example you are selecting a date range, and the majority of random numbers will fall upon records outside this range.
3) my current favourite, but also the most work. For every combination of criteria you intend to use to select a random record, you insert a record into a special table for that criteria. You then select random records from the special table, and follow them back to your data. For example, you might have a table like this:
Table cat: name, age, eye_colour, fur_type
If you want to be able to select random cats with brown fur, then you need a table like this:
Table cats_with_brown_fur: id (autonumber), cat_fk
You can then select a random record from this table based on the autonumber id, and it will be fast, and will produce evenly distributed random results. But indeed, if you select from many sets of criteria, you will have some overheads on maintaining these tables.
That's my current take on it, anyway. Good luck

Order by Rand() is a bad idea.
Here's a better solution:
How can i optimize MySQL's ORDER BY RAND() function?
Google is your friend, a lot of people have it explained it better than I ever could.
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
http://www.phpbuilder.com/board/showthread.php?t=10338930
http://www.paperplanes.de/2008/4/24/mysql_nonos_order_by_rand.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas