SQL random sampling into equal groups - sql

I need to randomly sample users in a table into 4 equal groups using SQL from a table. For that I did the below:
First, randomize all users in the table using RANDOM() function, then use the result of it with NTILE() function to divide them into 4 equal halves, like below:
WITH randomised_users AS (
SELECT *
FROM users_table
ORDER BY RANDOM()
) SELECT *,
ntile(4) OVER(ORDER BY (SELECT 1)) AS tile_nr
FROM randomised_users
Is this approach of sampling correct or is there a chance for bias in the 4 groups created from this?

What you have looks fine to me. You don't need a subquery BTW. This will do just fine
select *, ntile(4) over (order by random())
Snowflake doesn't guarantee the query will reproduce the same result set even if you provide a random seed so make sure to dump any intermediate result set into a temp table if you plan on re-using it.

Related

Select random value for each row

I'm trying to select a new random value from a column in another table for each row of a table I'm updating. I'm getting the random value, however I can't get it to change for each row. Any ideas? Here's the code:
UPDATE srs1.courseedition
SET ta_id = teacherassistant.ta_id
FROM srs1.teacherassistant
WHERE (SELECT ta_id FROM srs1.teacherassistant ORDER BY RANDOM()
LIMIT 1) = teacherassistant.ta_id
My guess is that Postgres is optimizing out the subquery, because it has no dependencies on the outer query. Have you simply considered using a subquery?
UPDATE srs1.courseedition
SET ta_id = (SELECT ta.ta_id
FROM srs1.teacherassistant ta
ORDER BY RANDOM()
LIMIT 1
);
I don't think this will fix the problem (smart optimizers, alas). But, if you correlate to the outer query, then it should run each time. Perhaps:
UPDATE srs1.courseedition ce
SET ta_id = (SELECT ta.ta_id
FROM srs1.teacherassistant ta
WHERE ce.ta_id IS NULL -- or something like that
ORDER BY RANDOM()
LIMIT 1
);
You can replace the WHERE clause with something more nonsensical such as WHERE COALESCE(ca.ta_id, '') IS NOT NULL.
This following solution should be faster by order(s) of magnitude than running a correlated subquery for every row. N random sorts over the whole table vs. 1 random sort. The result is just as random, but we get a perfectly even distribution with this method, whereas independent random picks like in Gordon's solution can (and probably will) assign some rows more often than others. There are different kinds of "random". Actual requirements for "randomness" need to be defined carefully.
Assuming the number of rows in courseedition is bigger than in teacherassistant.
To update all rows in courseedition:
UPDATE srs1.courseedition c1
SET ta_id = t.ta_id
FROM (
SELECT row_number() OVER (ORDER BY random()) - 1 AS rn -- random order
, count(*) OVER () As ct -- total count
, ta_id
FROM srs1.teacherassistant -- smaller table
) t
JOIN (
SELECT row_number() OVER () - 1 AS rn -- arbitrary order
, courseedition_id -- use actual PK of courseedition
FROM srs1.courseedition -- bigger table
) c ON c.rn%t.ct = t.rn -- rownumber of big modulo count of small table
WHERE c.courseedition_id = c1.courseedition_id;
Notes
Match the random rownumber of the bigger table modulo the count of the smaller table to the rownumber of the smaller table.
row_number() - 1 to get a 0-based index. Allows using the modulo operator % more elegantly.
Random sort for one table is enough. The smaller table is cheaper. The second can have any order (arbitrary is cheaper). The assignment after the join is random either way. Perfect randomness would only be impaired indirectly if there are regular patterns in sort order of the bigger table. In this unlikely case, apply ORDER BY random() to the bigger table to eliminate any such effect.

Taking a Random Sample From Each Group in Big Query

I'm trying to figure out what is the best way to take a random sample of 100 records for each group in a table in Big Query.
For example, I have a table where column A is a unique recordID, and column B is the groupID to which the record belongs. For every distinct groupID, I would like to take a random sample of 100 recordIDs. Is there a simple way to complete this?
Something like below should work
SELECT recordID, groupID
FROM (
SELECT
recordID, groupID,
RAND() AS rnd, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY rnd) AS pos
FROM yourTable
)
WHERE pos <= 100
ORDER BY groupID, recordID
Also check RAND() here if you want to improve randomness
Had a similar need, namely cluster sampling, over 400M and more columns but hit Exceeded resources... error when using ROW_NUMBER().
If you don't need RAND() because your data is unordered anyway, this performs quite well (<30s in my case):
SELECT ARRAY_AGG(x LIMIT 100)
FROM yourtable x
GROUP BY groupId
You can:
decorate with UNNEST() if front-end cannot render nested records
add ORDER BY groupId to find/confirm patterns more quickly

How to select 4 equally sized result sets from database table

I have a SQL Server database table that has a lot of rows. I'm using a program that uses that table as a data source. The program itself doesn't support multi-threading so I have to run multiple instances of the program and for every instance I need to tell what part of the whole base data to process.
I've been using this statement to slice my base data (data from the table) into two equally sized result sets:
SELECT TOP 50 PERCENT *
FROM MyTable
ORDER BY MyField ASC
So this will select the first 50% of the data. Then I'm using the following statement to return another half:
SELECT TOP 50 PERCENT *
FROM MyTable
ORDER BY MyField DESC
But I can't figure out how can I select, let's say, 25% chunks. I tried like this
SELECT *
FROM MyTable
WHERE MyField NOT IN (SELECT TOP 50 PERCENT *
FROM MyTable
ORDER BY MyField DESC)
AND MyField NOT IN (SELECT TOP 25 PERCENT *
FROM MyTable
ORDER BY MyField ASC)
ORDER BY MyField ASC
So it would return everything else but not first 25% or last 50%. So in other words it would return data between 25% row and 50%. You get the point I'm sure.
I tried this with my local machine (in Visual Studio connected to my SQL database) and it worked well but when I implemented this in a test environment I'm getting the following error
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
I don't really know what that means in this context. And those SELECT TOP 50 PERCENT statements works well on test environment too.
You should look at the NTILE window function - it partitions a set of rows into chunks (any number of them - you decide) and allows you to easily pick the one you need:
WITH ChunkedData AS
(
SELECT
Chunk = NTILE(4) OVER (ORDER BY MyField ASC),
*
FROM MyTable
)
SELECT *
FROM ChunkedData
WHERE Chunk = 1
With the NTILE(4) window function, you basically get all your rows labelled with a 1, 2, 3 or 4 - 4 almost equal chunks of data. Pick the one you need - works like a charm!
And of course, if you need to, you can use other number of chunks - NTILE(10) gives you 10 equally sized chunks - your pick.
when you are writing condition like column_name 'in' or 'not in' then subquery should return only one column as a result for comparison.
In your subquery you are returning all columns so it cant compare.. please try below query
SELECT *
FROM MyTable
WHERE MyField NOT IN (SELECT TOP 50 PERCENT MyField
FROM MyTable
ORDER BY MyField DESC)
AND MyField NOT IN (SELECT TOP 25 PERCENT MyField
FROM MyTable
ORDER BY MyField ASC)
ORDER BY MyField ASC

How should I handle "ranked x out of y" data in PostgreSQL?

I have a table that I would like to be able to present "ranked X out of Y" data for. In particular, I'd like to be able to present that data for an individual row in a relatively efficient way (i.e. without selecting every row in the table). The ranking itself is quite simple, it's a straight ORDER BY on a single column in the table.
Postgres seems to present some unique challenges in this regard; AFAICT it doesn't have a RANK or ROW_NUMBER or equivalent function (at least in 8.3, which I'm stuck on for the moment). The canonical answer in the mailing list archives seems to be to create a temporary sequence and select from it:
test=> create temporary sequence tmp_seq;
CREATE SEQUENCE
test=*> select nextval('tmp_seq') as row_number, col1, col2 from foo;
It seems like this solution still won't help when I want to select just a single row from the table (and I want to select it by PK, not by rank).
I could denormalize and store the rank in a separate column, which makes presenting the data trivial, but just relocates my problem. UPDATE doesn't support ORDER BY, so I'm not sure how I'd construct an UPDATE query to set the ranks (short of selecting every row and running a separate UPDATE for each row, which seems like way too much DB activity to trigger every time the ranks need updating).
Am I missing something obvious? What's the Right Way to do this?
EDIT: Apparently I wasn't clear enough. I'm aware of OFFSET/LIMIT, but I don't see how it helps solve this problem. I'm not trying to select the Xth-ranked item, I'm trying to select an arbitrary item (by its PK, say), and then be able to display to the user something like "ranked 43rd out of 312."
If you want the rank, do something like
SELECT id,num,rank FROM (
SELECT id,num,rank() OVER (ORDER BY num) FROM foo
) AS bar WHERE id=4
Or if you actually want the row number, use
SELECT id,num,row_number FROM (
SELECT id,num,row_number() OVER (ORDER BY num) FROM foo
) AS bar WHERE id=4
They'll differ when you have equal values somewhere. There is also dense_rank() if you need that.
This requires PostgreSQL 8.4, of course.
Isn't it just this:
SELECT *
FROM mytable
ORDER BY
col1
OFFSET X LIMIT 1
Or I am missing something?
Update:
If you want to show the rank, use this:
SELECT mi.*, values[1] AS rank, values[2] AS total
FROM (
SELECT (
SELECT ARRAY[SUM(((mi.col1, mi.ctid) < (mo.col1, mo.ctid))::INTEGER), COUNT(*)]
FROM mytable mi
) AS values
FROM mytable mo
WHERE mo.id = #myid
) q
ROW_NUMBER functionality in PostgreSQL is implemented via LIMIT n OFFSET skip.
Find an overview here.
On the pitfalls of ranking see this SO question.
EDIT: Since you are asking for ROW_NUMBER() instead of simple ranking: row_number() is introduced to PostgreSQL in version 8.4. So you might consider to update. Otherwise this workaround might be helpful.
Previous replies tackle the question "select all rows and get their rank" which is not what you want...
you have a row
you want to know its rank
Just do :
SELECT count(*) FROM table WHERE score > $1
Where $1 is the score of the row you just selected (I suppose you'd like to display it so you might select it...).
Or do :
SELECT a., (SELECT count() FROM table b WHERE score > b.score) AS rank FROM table AS a WHERE pk = ...
However, if you select a row which is ranked last, yes you will need to count all the rows which are ranked before it, so you'll need to scan the whole table, and it will be very slow.
Solution :
SELECT count(*) FROM (SELECT 1 FROM table WHERE score > $1 LIMIT 30)
You'll get precise ranking for the 30 best scores, and it will be fast.
Who cares about the losers ?
OK, If you really do care about the losers, you'll need to make a histogram :
Suppose score can go from 0 to 100, and you have 1000000 losers with score < 80 and 10 winners with score > 80.
You make a histogram of how many rows have a score of X, it's a simple small table with 100 rows. Add a trigger to your main table to update the histogram.
Now if you want to rank a loser which has score X, his rank is sum( histo ) where histo_score > X.
Since your score probably isn't between 0 and 100, but (say) between 0 and 1000000000, you'll need to fudge it a bit, enlarge your histogram bins, for instance. so you only need 100 bins max, or use some log-histogram distribution function.
By the way postgres does this when you ANALYZE the table, so if you set statistics_target to 100 or 1000 on score, ANALYZE, and then run :
EXPLAIN SELECT * FROM table WHERE score > $1
you'll get a nice rowcount estimate.
Who needs exact answers ?

SQL Server rand() aggregate

Problem: a table of coordinate lat/lngs. Two rows can potentially have the same coordinate. We want a query that returns a set of rows with unique coordinates (within the returned set). Note that distinct is not usable because I need to return the id column which is, by definition, distinct. This sort of works (#maxcount is the number of rows we need, intid is a unique int id column):
select top (#maxcount) max(intid)
from Documents d
group by d.geoLng, d.geoLat
It will always return the same row for a given coordinate unfortunately, which is bit of a shame for my use. If only we had a rand() aggregate we could use instead of max()... Note that you can't use max() with guids created by newid().
Any ideas?
(there's some more background here, if you're interested: http://www.itu.dk/~friism/blog/?p=121)
UPDATE: Full solution here
You might be able to use a CTE for this with the ROW_NUMBER function across lat and long and then use rand() against that. Something like:
WITH cte AS
(
SELECT
intID,
ROW_NUMBER() OVER
(
PARTITION BY geoLat, geoLng
ORDER BY NEWID()
) AS row_num,
COUNT(intID) OVER (PARTITION BY geoLat, geoLng) AS TotalCount
FROM
dbo.Documents
)
SELECT TOP (#maxcount)
intID, RAND(intID)
FROM
cte
WHERE
row_num = 1 + FLOOR(RAND() * TotalCount)
This will always return the first sets of lat and lngs and I haven't been able to make the order random. Maybe someone can continue on with this approach. It will give you a random row within the matching lat and lng combinations though.
If I have more time later I'll try to get around that last obstacle.
this doesn't work for you?
select top (#maxcount) *
from
(
select max(intid) as id from Documents d group by d.geoLng, d.geoLat
) t
order by newid()
Where did you get the idea that DISTINCT only works on one column? Anyway, you could also use a GROUP BY clause.