Taking a Random Sample From Each Group in Big Query

Taking a Random Sample From Each Group in Big Query - google-bigquery

I'm trying to figure out what is the best way to take a random sample of 100 records for each group in a table in Big Query.
For example, I have a table where column A is a unique recordID, and column B is the groupID to which the record belongs. For every distinct groupID, I would like to take a random sample of 100 recordIDs. Is there a simple way to complete this?

Something like below should work
SELECT recordID, groupID
FROM (
SELECT
recordID, groupID,
RAND() AS rnd, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY rnd) AS pos
FROM yourTable
)
WHERE pos <= 100
ORDER BY groupID, recordID
Also check RAND() here if you want to improve randomness

Had a similar need, namely cluster sampling, over 400M and more columns but hit Exceeded resources... error when using ROW_NUMBER().
If you don't need RAND() because your data is unordered anyway, this performs quite well (<30s in my case):
SELECT ARRAY_AGG(x LIMIT 100)
FROM yourtable x
GROUP BY groupId
You can:
decorate with UNNEST() if front-end cannot render nested records
add ORDER BY groupId to find/confirm patterns more quickly

Related

SQL random sampling into equal groups

I need to randomly sample users in a table into 4 equal groups using SQL from a table. For that I did the below:
First, randomize all users in the table using RANDOM() function, then use the result of it with NTILE() function to divide them into 4 equal halves, like below:
WITH randomised_users AS (
SELECT *
FROM users_table
ORDER BY RANDOM()
) SELECT *,
ntile(4) OVER(ORDER BY (SELECT 1)) AS tile_nr
FROM randomised_users
Is this approach of sampling correct or is there a chance for bias in the 4 groups created from this?

What you have looks fine to me. You don't need a subquery BTW. This will do just fine
select *, ntile(4) over (order by random())
Snowflake doesn't guarantee the query will reproduce the same result set even if you provide a random seed so make sure to dump any intermediate result set into a temp table if you plan on re-using it.

Efficient way to get top 5 rows with max value without using order by?

A Relational database table holds the information of Insurance details, say id and amount. Table consists of millions of records. requirement is to fetch top 5 records with max amount without using order by clause.
A solution I could think of is to use the temp table to maintain the max 5 records and update these entries each time the main table is updated but would like to know if there are better solution to above problem ?

An efficient way is to put an index on amount desc and use order by. Something like:
select t.*
from t
order by t.amount desc
fetch first 5 rows only; -- or however your database does this
This should be quite efficient.

You can try using analytic functions (example below), but you still have to order at some stage
select id,
amount
from (select id,
amount,
row_number() over (order by amount desc nulls last) as rn
from t)
where rn<=5;

SQL Server 2008 Paged Row Retrieval and Large Tables

I'm using SQL Server 2008 and the following query to implement paged data retrieval from our JSF application, in below code i am retrieving 25 rows at a time sorted by the default sort column in DESC order.
SELECT * FROM
(
SELECT TOP 25 * FROM
(
SELECT TOP 25 ...... WHERE CONDITIONS
--ORDER BY ... DESC
)AS INNERQUERY ORDER BY INNERQUERY.... ASC
)
AS OUTERQUERY
ORDER BY OUTERQUERY.... DESC
It works, but with one obvious flow. If the users request to see the last page and there are over 10 million records in table, then the second TOP Query will have to first retrieve the 10 million records and only then the first top Query will pick out the Top 25 which will look like:
SELECT * FROM
(
SELECT TOP 25 * FROM
(
SELECT TOP 10000000 ...... WHERE CONDITIONS
--ORDER BY ... DESC
)AS INNERQUERY ORDER BY INNERQUERY.... ASC
)
AS OUTERQUERY
ORDER BY OUTERQUERY.... DESC
I looked into replacing the above with ROW_NUMBER OVER(....) but seemingly i had the same issue where the second TOP statement will have to get the entire result and only then you can do a where ROW_NUMBER between x and y.
Can you please point out my mistakes in the above approach and hints on how it can be optimized?

I'm currently using the following to code to retrieve subset of rows:
WITH PAGED_QRY (
SELECT *, ROW_NUMVER() OVER(ORDER BY Y) AS ROW_NO
FROM TABLE WHERE ....
)
SELECT * FROM PAGED_QRY WHERE ROW_NO BETWEEN #CURRENT_INDEX and # ROWS_TO_RETRIEVE
ORDER BY ROW_NO
where #current_index and #rows_to_retrieve (ie. 1 and 50) are your paging variables. it's cleaner and easier to read.
I've also tried using SET ROW_COUNT #ROWS_TO_RETRIEVE but doesn't seem to make much difference.
Using above query and by carefully studying the execution path of the query and modifying/creating indexes and statistics I've reached results that are sufficiently satisfactory, hence why i'm making this as the answer. The original goal of retrieving only the required rows in the inner query seems to be not possible yet, if you do find the way please let me know.

we can improve above query a bit more.
If I assume that #current_index is the current page number then we can rewrite the above query as:
WITH PAGED_QRY (
SELECT top (#current_index * #rows_to_retrieve) *, ROW_NUMVER()
OVER(ORDER BY Y) AS ROW_NO
FROM TABLE WHERE ....
)
SELECT TOP #ROWS_TO_RETRIEVE FROM PAGED_QRY
ORDER BY ROW_NO DESC
In this case, our inner query will not return the whole record set. Suppose our page_index is 3 & page_size is 50, then it will select only 150 rows(even if our table contains hundreds/thousands/millions of rows) & we can skip the where clause also.

How should I handle "ranked x out of y" data in PostgreSQL?

I have a table that I would like to be able to present "ranked X out of Y" data for. In particular, I'd like to be able to present that data for an individual row in a relatively efficient way (i.e. without selecting every row in the table). The ranking itself is quite simple, it's a straight ORDER BY on a single column in the table.
Postgres seems to present some unique challenges in this regard; AFAICT it doesn't have a RANK or ROW_NUMBER or equivalent function (at least in 8.3, which I'm stuck on for the moment). The canonical answer in the mailing list archives seems to be to create a temporary sequence and select from it:
test=> create temporary sequence tmp_seq;
CREATE SEQUENCE
test=*> select nextval('tmp_seq') as row_number, col1, col2 from foo;
It seems like this solution still won't help when I want to select just a single row from the table (and I want to select it by PK, not by rank).
I could denormalize and store the rank in a separate column, which makes presenting the data trivial, but just relocates my problem. UPDATE doesn't support ORDER BY, so I'm not sure how I'd construct an UPDATE query to set the ranks (short of selecting every row and running a separate UPDATE for each row, which seems like way too much DB activity to trigger every time the ranks need updating).
Am I missing something obvious? What's the Right Way to do this?
EDIT: Apparently I wasn't clear enough. I'm aware of OFFSET/LIMIT, but I don't see how it helps solve this problem. I'm not trying to select the Xth-ranked item, I'm trying to select an arbitrary item (by its PK, say), and then be able to display to the user something like "ranked 43rd out of 312."

If you want the rank, do something like
SELECT id,num,rank FROM (
SELECT id,num,rank() OVER (ORDER BY num) FROM foo
) AS bar WHERE id=4
Or if you actually want the row number, use
SELECT id,num,row_number FROM (
SELECT id,num,row_number() OVER (ORDER BY num) FROM foo
) AS bar WHERE id=4
They'll differ when you have equal values somewhere. There is also dense_rank() if you need that.
This requires PostgreSQL 8.4, of course.

Isn't it just this:
SELECT *
FROM mytable
ORDER BY
col1
OFFSET X LIMIT 1
Or I am missing something?
Update:
If you want to show the rank, use this:
SELECT mi.*, values[1] AS rank, values[2] AS total
FROM (
SELECT (
SELECT ARRAY[SUM(((mi.col1, mi.ctid) < (mo.col1, mo.ctid))::INTEGER), COUNT(*)]
FROM mytable mi
) AS values
FROM mytable mo
WHERE mo.id = #myid
) q

ROW_NUMBER functionality in PostgreSQL is implemented via LIMIT n OFFSET skip.
Find an overview here.
On the pitfalls of ranking see this SO question.
EDIT: Since you are asking for ROW_NUMBER() instead of simple ranking: row_number() is introduced to PostgreSQL in version 8.4. So you might consider to update. Otherwise this workaround might be helpful.

Previous replies tackle the question "select all rows and get their rank" which is not what you want...
you have a row
you want to know its rank
Just do :
SELECT count(*) FROM table WHERE score > $1
Where $1 is the score of the row you just selected (I suppose you'd like to display it so you might select it...).
Or do :
SELECT a., (SELECT count() FROM table b WHERE score > b.score) AS rank FROM table AS a WHERE pk = ...
However, if you select a row which is ranked last, yes you will need to count all the rows which are ranked before it, so you'll need to scan the whole table, and it will be very slow.
Solution :
SELECT count(*) FROM (SELECT 1 FROM table WHERE score > $1 LIMIT 30)
You'll get precise ranking for the 30 best scores, and it will be fast.
Who cares about the losers ?
OK, If you really do care about the losers, you'll need to make a histogram :
Suppose score can go from 0 to 100, and you have 1000000 losers with score < 80 and 10 winners with score > 80.
You make a histogram of how many rows have a score of X, it's a simple small table with 100 rows. Add a trigger to your main table to update the histogram.
Now if you want to rank a loser which has score X, his rank is sum( histo ) where histo_score > X.
Since your score probably isn't between 0 and 100, but (say) between 0 and 1000000000, you'll need to fudge it a bit, enlarge your histogram bins, for instance. so you only need 100 bins max, or use some log-histogram distribution function.
By the way postgres does this when you ANALYZE the table, so if you set statistics_target to 100 or 1000 on score, ANALYZE, and then run :
EXPLAIN SELECT * FROM table WHERE score > $1
you'll get a nice rowcount estimate.
Who needs exact answers ?

SQL Server rand() aggregate

Problem: a table of coordinate lat/lngs. Two rows can potentially have the same coordinate. We want a query that returns a set of rows with unique coordinates (within the returned set). Note that distinct is not usable because I need to return the id column which is, by definition, distinct. This sort of works (#maxcount is the number of rows we need, intid is a unique int id column):
select top (#maxcount) max(intid)
from Documents d
group by d.geoLng, d.geoLat
It will always return the same row for a given coordinate unfortunately, which is bit of a shame for my use. If only we had a rand() aggregate we could use instead of max()... Note that you can't use max() with guids created by newid().
Any ideas?
(there's some more background here, if you're interested: http://www.itu.dk/~friism/blog/?p=121)
UPDATE: Full solution here

You might be able to use a CTE for this with the ROW_NUMBER function across lat and long and then use rand() against that. Something like:
WITH cte AS
(
SELECT
intID,
ROW_NUMBER() OVER
(
PARTITION BY geoLat, geoLng
ORDER BY NEWID()
) AS row_num,
COUNT(intID) OVER (PARTITION BY geoLat, geoLng) AS TotalCount
FROM
dbo.Documents
)
SELECT TOP (#maxcount)
intID, RAND(intID)
FROM
cte
WHERE
row_num = 1 + FLOOR(RAND() * TotalCount)
This will always return the first sets of lat and lngs and I haven't been able to make the order random. Maybe someone can continue on with this approach. It will give you a random row within the matching lat and lng combinations though.
If I have more time later I'll try to get around that last obstacle.

this doesn't work for you?
select top (#maxcount) *
from
(
select max(intid) as id from Documents d group by d.geoLng, d.geoLat
) t
order by newid()

Where did you get the idea that DISTINCT only works on one column? Anyway, you could also use a GROUP BY clause.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas