How to efficiently get a range of ranked users (for a leaderboard) using Postgresql - sql

I have read many posts on this topic, such as
mysql-get-rank-from-leaderboards.
However, none of the solutions are efficient at scale for getting a range of ranks from the database.
The problem is simple. Suppose we have a Postgres table with an "id" column and another INTEGER column whose values are not unique, but we have an index for this column.
e.g. table could be:
CREATE TABLE my_game_users (id serial PRIMARY KEY, rating INTEGER NOT NULL);
The goal
Define a rank for users ordering users on the "rating" column descending
Be able to query for a list of ~50 users ordered by this new "rank", centered at any particular user
For example, we might return users with ranks { 15, 16, ..., 64, 65 } where the center user has rank #40
Performance must scale, e.g. be under 80 ms for 100,000 users.
Attempt #1: row_number() window function
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank
FROM my_game_users)
SELECT *
FROM my_ranks
WHERE rank >= 4000 AND rank <= 4050
ORDER BY rank ASC;
This "works", but the queries average 550ms with 100,000 users on a fast laptop without any other real work being done.
I tried adding indexes, and re-phrasing this query to not use the "WITH" syntax, and nothing worked to speed it up.
Attempt #2 - count the number of rows with a greater rating value
I tried a query like this:
SELECT t1.*,
(SELECT COUNT(*)
FROM my_game_users t2
WHERE (t1.rating, -t1.id) <= (t2.rating, -t2.id)
) AS rank
FROM my_game_users t1
WHERE id = 2000;
This is decent, this query takes about 120ms with 100,000 users having random ratings. However, this only returns the rank for user with a particular id (2000).
I can't see any efficient way to extend this query to get a range of ranks. Any attempt at extending this makes a very slow query.
I only know the ID of the "center" user, since the users have to be ordered by rank before we know which ones are in the range!
Attempt #3: in-memory ordered Tree
I ended up using a Java TreeSet to store the ranks. I can update the TreeSet whenever a new user is inserted into the database, or a user's rating changes.
This is super fast, around 25 ms with 100,000 users.
However, it has a serious drawback that it's only updated on the Webapp node that serviced the request. I'm using Heroku and will deploy multiple nodes for my app. So, I needed to add a scheduled task for the server to re-build this ranking tree every hour, to make sure the nodes don't get too out-of-sync!
If anyone knows of an efficient way to do this in Postgres with full solution, then I am all ears!

You can get the same results by using order by rating desc and offset and limit to get users between a certain rank.
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank FROM my_game_users)
SELECT * FROM my_ranks WHERE rank >= 4000 AND rank <= 4050 ORDER BY rank ASC;
The query above is the same as
select * , rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 4000
If you want to select users around rank #40 you could select ranks #15-#65
select *, rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 15

Thanks, #FuzzyTree !
Your solution doesn't quite give me everything I need, but it nudged me in the right direction. Here's the full solution I'm going with for now.
The only limitation with your solution is that there's no way to get a unique rank for a particular user. All users with the same rating would have the same rank (or at least it is undefined by SQL standard). If I knew the OFFSET ahead of time, then your rank would be good enough, but I have to get the rank of a particular user first.
My solution is to do the following query to get a range of ranks:
SELECT * FROM my_game_users ORDER BY rating DESC, id ASC LIMIT ? OFFSET ?
This is basically uniquely defining the ranks by rating, then by who joined the Game first (lower id).
To make this efficient I'm creating an index on (rating DESC, id)
Then, I'm getting a particular user's rank to plug in to this query with:
SELECT COUNT(*) FROM my_game_users WHERE rating > ? OR (rating = ? AND id < ?)
I actually made this more efficient with:
SELECT (SELECT COUNT(*) FROM my_game_users WHERE rating > ?) + (SELECT COUNT(*) FROM my_game_users WHERE rating = ? AND id < ?) + 1
Now, even with these queries it takes about 78ms average and median time to get the ranks around a user. If anyone has a good idea how to speed these up I'm all ears!
For example, getting a range of ranks takes about 60ms, and explaining it yields:
EXPLAIN SELECT * FROM word_users ORDER BY rating DESC, id ASC LIMIT 50 OFFSET 50000;
"Limit (cost=6350.28..6356.63 rows=50 width=665)"
" -> Index Scan using idx_rating_desc_and_id on word_users (cost=0.29..12704.83 rows=100036 width=665)"
So, it's using the rating and id index, yet it still has this highly variable cost from 0.29...12704.83. Any ideas how to improve??

If you order it in desc order you have it in the right order. Use the rownumber() function.
Select Row number in postgres
Also you would use an in memory cache to store stuff in memory. Something like redis. Its a separate application that can serve multiple instances, even remotely.

Related

How does one get the total rows for a partition in postgresql

I'm using a windows function to help me pagination through a list of records in the database.
For example
I have a list of dogs and they all have a breed associated with them.
I want to show 10 dogs from each breed to my users.
So that would be
select * from dogs
join (
SELECT id, row_number() OVER (PARTITION BY breed) as row_number FROM dogs
) rn on dogs.id = rn.id
where (row_number between 1 and 10)
That will give me ~ten dogs from each breed..
What I need though is a count. Is there a way to get the count of the partitions. I want to know how many Staffies I have waiting for adoption.
I do notice that there's a percentage and all the docs I find seem to indicate theres something called total rows. But I don't see it.
Just run the window aggregate function count() over the same partition (without adding ORDER BY!) to get the total count for the partition:
SELECT *
FROM (
SELECT *
, row_number() OVER (PARTITION BY breed ORDER BY id) AS rn
, count() OVER (PARTITION BY breed) AS breed_count -- !
FROM dogs
) sub
WHERE rn < 11;
Also removed the unnecessary join and simplified.
See:
Run a query with a LIMIT/OFFSET and also get the total number of rows
And I added ORDER BY to the frame definition of row_number() to get a deterministic result. Without, Postgres is free to return any 10 arbitrary rows. Any write to the table (or VACUUM, etc.) can and will change the result without ORDER BY.
Aside, pagination with LIMIT / OFFSET does not scale well. Consider:
Optimize query with OFFSET on large table

How to get the most frequent value SQL

I have a table Orders(id_trip, id_order), table Trip(id_hotel, id_bus, id_type_of_trip) and table Hotel(id_hotel, name).
I would like to get name of the most frequent hotel in table Orders.
SELECT hotel.name from Orders
JOIN Trip
on Orders.id_trip = Trip.id_hotel
JOIN hotel
on trip.id_hotel = hotel.id_hotel
FROM (SELECT hotel.name, rank() over (order by cnt desc) rnk
FROM (SELECT hotel.name, count(*) cnt
FROM Orders
GROUP BY hotel.name))
WHERE rnk = 1;
The "most frequently occurring value" in a distribution is a distinct concept in statistics, with a technical name. It's called the MODE of the distribution. And Oracle has the STATS_MODE() function for it. https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm
For example, using the EMP table in the standard SCOTT schema, select stats_mode(deptno) from scott.emp will return 30 - the number of the department with the most employees. (30 is the department "name" or number, it is NOT the number of employees in that department!)
In your case:
select stats_mode(h.name) from (the rest of your query)
Note: if two or more hotels are tied for "most frequent", then STATS_MODE() will return one of them (non-deterministic). If you need all the tied values, you will need a different solution - a good example is in the documentation (linked above). This is a documented flaw in Oracle's understanding and implementation of the statistical concept.
Use FIRST for a single result:
SELECT MAX(hotel.name) KEEP (DENSE_RANK FIRST ORDER BY cnt DESC)
FROM (
SELECT hotel.name, COUNT(*) cnt
FROM orders
JOIN trip USING (id_trip)
JOIN hotel USING (id_hotel)
GROUP BY hotel.name
) t
Here is one method:
select name
from (select h.name,
row_number() over (order by count(*) desc) as seqnum -- use `rank()` if you want duplicates
from orders o join
trip t
on o.id_trip = t.id_trip join -- this seems like the right join condition
hotels h
on t.id_hotel = h.id_hotel
) oth
where seqnum = 1;
** Getting the most recent statistical mode out of a data sample **
I know it's more than a year, but here's my answer. I came across this question hoping to find a simpler solution than what I know, but alas, nope.
I had a similar situation where I needed to get the mode from a data sample, with the requirement to get the mode of the most recently inserted value if there were multiple modes.
In such a case neither the STATS_MODE nor the LAST aggregate functions would do (as they would tend to return the first mode found, not necessarily the mode with the most recent entries.)
In my case it was easy to use the ROWNUM pseudo-column because the tables in question were performance metric tables that only experienced inserts (not updates)
In this oversimplified example, I'm using ROWNUM - it could easily be changed to a timestamp or sequence field if you have one.
SELECT VALUE
FROM
(SELECT VALUE ,
COUNT( * ) CNT,
MAX( R ) R
FROM
( SELECT ID, ROWNUM R FROM FOO
)
GROUP BY ID
ORDER BY CNT DESC,
R DESC
)
WHERE
(
ROWNUM < 2
);
That is, get the total count and max ROWNUM for each value (I'm assuming the values are discrete. If they aren't, this ain't gonna work.)
Then sort so that the ones with largest counts come first, and for those with the same count, the one with the largest ROWNUM (indicating most recent insertion in my case).
Then skim off the top row.
Your specific data model should have a way to discern the most recent (or the oldest or whatever) rows inserted in your table, and if there are collisions, then there's not much of a way other than using ROWNUM or getting a random sample of size 1.
If this doesn't work for your specific case, you'll have to create your own custom aggregator.
Now, if you don't care which mode Oracle is going to pick (your bizness case just requires a mode and that's it, then STATS_MODE will do fine.

How do I select 8 random songs from top 50, with unique user_id?

I am trying to get top 50 downloads, and then shuffling (randomizing) 8 results. Plus, the 8 results have to be unique user_id's. I came up with this so far:
Song.select('DISTINCT songs.user_id, songs.*').where(:is_downloadable => true).order('songs.downloads_count DESC').limit(50).sort_by{rand}.slice(0,8)
My only gripe with this is, the last part .sort_by{rand}.slice(0,8) is being done via Ruby. Any way I can do all this via Active Record?
I wonder how the column user_id ended up in the table songs? That means you have one row for every combination of song and user? In a normalized schema, that would be an n:m relationship implemented with three tables:
song(song_id, ...)
usr(usr_id, ...) -- "user" is a reserved word
download (song_id, user_id, ...) -- implementing the n:m relationship
The query in your question yields incorrect results. The same user_id can pop up multiple times. DISTINCT does not do what you seem to expect it to. You need DISTINCT ON or some other method like aggregate or window functions.
You also need to use subqueries or CTEs, because this cannot be done in one step. When you use DISTINCT you cannot at the same time ORDER BY random(), because the sort order cannot disagree with the order dictated by DISTINCT. This query is certainly not trivial.
Simple case, top 50 songs
If you are happy to just pick the top 50 songs (not knowing how many duplicate user_ids are among them), this "simple" case will do:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 50
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC -- pick most popular song per user
-- ORDER BY user_id, random() -- pick random song per user
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the 50 songs with the highest download_count. Users can show up multiple times.
Pick 1 song per user. Randomly or the most popular one, that's not defined in your question.
Pick 8 songs with now distinct user_id randomly.
You only need an index on songs.downloads_count for this to be fast:
CREATE INDEX songs_downloads_count_idx ON songs (downloads_count DESC);
Top 50 songs with unique user_id
WITH x AS (
SELECT DISTINCT ON (user_id) *
FROM songs
WHERE is_downloadable
ORDER BY user_id, downloads_count DESC
)
, y AS (
SELECT *
FROM x
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the song with the highest download_count per user. Every user can only show up once, so it has to be the one song with the highest download_count.
Pick the 50 with highest downloads_count from that.
Pick 8 songs from that randomly.
With a big table, performance will suck, because you have to find the best row for every user before you can proceed. A multi-column index will help, but it will still not be very fast:
CREATE INDEX songs_u_dc_idx ON songs (user_id, downloads_count DESC);
The same, faster
If duplicate user_ids among the top songs are predictably rare, you can use a trick. Pick just enough of the top downloads, so that the top 50 with unique user_id are certainly among them. After this step, proceed like above. This will be much faster with big tables, because the top n rows can be read from the top of an index quickly:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 100 -- adjust to your secure estimate
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC
)
, z AS (
SELECT *
FROM y
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM z
ORDER BY random()
LIMIT 8;
The index from the simple case above will suffice to make it almost as fast as the simple case.
This would fall short if less than 50 distinct users are among the top 100 "songs".
All queries should work with PostgreSQL 8.4 or later.
If it has to be faster, yet, create a materialized view that holds the pre-selected top 50, and rewrite that table in regular intervals or triggered by events. If you make heavy use of this and the table is big, I would go for that. Otherwise it's not worth the overhead.
Generalized, improved solution
I later formalized and improved this approach further to be applicable to a whole class of similar problems under this related question at dba.SE.
You could use PostgreSQL's RANDOM() function in the order by, making it
___.order('songs.downloads_count DESC, RANDOM()').limit(8)
though this doesn't work though because PostgreSQL requires the columns used in the ORDER BY be found in the SELECT. You'll get an error like
ActiveRecord::StatementInvalid: PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
The only way to do what your'e asking all in SQL (using PostgreSQL) is with a subquery, which may or may not be a better solution for you. If it is, your best bet is to write out the full query/subquery using find_by_sql.
I'm happy to help come up with the SQL, though now that you know about RANDOM(), it should be pretty trivial.

Find row number in a sort based on row id, then find its neighbours

Say that I have some SELECT statement:
SELECT id, name FROM people
ORDER BY name ASC;
I have a few million rows in the people table and the ORDER BY clause can be much more complex than what I have shown here (possibly operating on a dozen columns).
I retrieve only a small subset of the rows (say rows 1..11) in order to display them in the UI. Now, I would like to solve following problems:
Find the number of a row with a given id.
Display the 5 items before and the 5 items after a row with a given id.
Problem 2 is easy to solve once I have solved problem 1, as I can then use something like this if I know that the item I was looking for has row number 1000 in the sorted result set (this is the Firebird SQL dialect):
SELECT id, name FROM people
ORDER BY name ASC
ROWS 995 TO 1005;
I also know that I can find the rank of a row by counting all of the rows which come before the one I am looking for, but this can lead to very long WHERE clauses with tons of OR and AND in the condition. And I have to do this repeatedly. With my test data, this takes hundreds of milliseconds, even when using properly indexed columns, which is way too slow.
Is there some means of achieving this by using some SQL:2003 features (such as row_number supported in Firebird 3.0)? I am by no way an SQL guru and I need some pointers here. Could I create a cached view where the result would include a rank/dense rank/row index?
Firebird appears to support window functions (called analytic functions in Oracle). So you can do the following:
To find the "row" number of a a row with a given id:
select id, row_number() over (partition by NULL order by name, id)
from t
where id = <id>
This assumes the id's are unique.
To solve the second problem:
select t.*
from (select id, row_number() over (partition by NULL order by name, id) as rownum
from t
) t join
(select id, row_number() over (partition by NULL order by name, id) as rownum
from t
where id = <id>
) tid
on t.rownum between tid.rownum - 5 and tid.rownum + 5
I might suggest something else, though, if you can modify the table structure. Most databases offer the ability to add an auto-increment column when a row is inserted. If your records are never deleted, this can server as your counter, simplifying your queries.

JOIN after processing SELECT

Given the following schema:
CREATE TABLE players (
id BIGINT PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE trials (
timestamp TIMESTAMP PRIMARY KEY,
player BIGINT,
score NUMERIC
);
How would I create a SELECT that first finds the best scores from trials, then joins the name field from users? I've been able to get the scores I'm after using this query:
SELECT * FROM trials GROUP BY player ORDER BY score ASC LIMIT 10;
And my query for returning the top 10 scores looks like:
CREATE VIEW top10place AS
SELECT player.name AS name, trial.*
FROM trials AS trial, players AS player
WHERE trial.player = player.id
AND trial.score = (
SELECT MAX(score)
FROM trials AS tsub
WHERE tsub.player = trial.player
)
ORDER BY trial.score DESC, trial.timestamp ASC
LIMIT 10;
But when I hit thousands of entries in the tables, the DB performance starts to crawl. I figure the subquery is killing my performance. The first query (returning only the top scores) still performs adequately, so I was wondering if there is a way to force a JOIN operation to occur after the top scores have been selected.
EDIT Note that the query will return the top 10 ranked players, not just the top 10 scores. If the same player has many high scores, he should only show up once in the top 10 list.
I'm using SQLite, so it doesn't have some of the extended features of SQL Server or MySQL.
Don't have sqlite running, hope the limit is right.
select players.name, trials.player, trials.timestamp, trials.score from
(select player, score, timestamp from
trials order by score desc, timestamp asc limit 10) trials, players
where players.id = trials.player
Regards
This is an instance of you making something harder than it needs to be. The correct code is:
CREATE VIEW top10place AS
SELECT player.name AS name, trial.*
FROM trials AS trial, players AS player
WHERE trial.player = player.id
ORDER BY trial.score ASC, trial.timestamp ASC
LIMIT 10;
Basically, let the LIMIT statement do the work :)
A subquery in a WHERE can be expensive if the optimizer runs it for every row.
(Edit) Here's another way to write the query, now with an exclusive join: it says there's no row for that user with a higher score:
SELECT
(select name from user where id = cur.userid) as UserName
, cur.score as MaxScore
FROM trails cur
LEFT JOIN trials higher
ON higher.userid = cur.userid
AND higher.timestamp <> cur.timestamp
AND higher.score > cur.score
WHERE higher.userid is null
ORDER BY cur.score DESC
LIMIT 10
This would return the 10 highest scoring users. If you'd like the 10 highest scores regardless of user, check Silas' answer.
As has been mentioned, since your identifying key between players and trials is the player.id and trials.player, you should have an index on trials.player. Particularly if you relate those two tables a lot.
Also you might try making your query more like.
SELECT p.name as name, t.* FROM players as p
INNER JOIN (SELECT * FROM trials WHERE trials.score = (SELECT MAX(score) FROM trials as tsub WHERE tsub.player = trials.player) LIMIT 10) as t ON t.player = p.id
ORDER BY t.score DESC, t.timestamp ASC
This might even be able to be optimized a little more, but I'm no good at that without some data to throw the query at.