PostgreSQL: five stars rating, ordering objects - sql

In my database users can add a vote from 1 to 5 stars to every groups.
Then I have to display a leaderboard by those votes.
What I was doing until now is to order them by votes average without a weight. This is not so nice because a group having 5.0 with 20 votes is before of a group having 4.9 avg and 10000 votes.
This is my votes table:
CREATE TABLE IF NOT EXISTS votes(
user_id BIGINT,
group_id BIGINT,
vote SMALLINT,
vote_date timestamp,
PRIMARY KEY (user_id, group_id)
This is how I sort them now:
SELECT
group_id,
COUNT(vote) AS amount,
ROUND(AVG(vote), 1) AS average,
RANK() OVER(PARTITION BY s.lang ORDER BY ROUND(AVG(VOTE), 1)DESC, COUNT(VOTE)DESC)
FROM votes
LEFT OUTER JOIN supergroups AS s
USING (group_id)
GROUP BY group_id, s.lang, s.banned_until, s.bot_inside
HAVING
(s.banned_until IS NULL OR s.banned_until < now())
AND COUNT(vote) >= %s
AND s.bot_inside IS TRUE
How could I add a sort of weight to solve the problem I said before?
I read about bayesan approach here but I am not sure if it's the right thing because I read it's about to sort the top 'n' elements, while I have to do a leaderboard including anyone of them.

you're going to have to fudge it somehow,
perhaps this way.
order by (0.0+sum(vote))/(count(vote)+log(count(vote)))
Or sqrt might work better than log, it depends how much weight you want the population size to have.
order by (0.0+sum(vote))/(count(vote)+sqrt(count(vote)))
basically the fudge needs to be a function that increases at a slower rate than it input. you could even try a constant.

Related

How to efficiently get a range of ranked users (for a leaderboard) using Postgresql

I have read many posts on this topic, such as
mysql-get-rank-from-leaderboards.
However, none of the solutions are efficient at scale for getting a range of ranks from the database.
The problem is simple. Suppose we have a Postgres table with an "id" column and another INTEGER column whose values are not unique, but we have an index for this column.
e.g. table could be:
CREATE TABLE my_game_users (id serial PRIMARY KEY, rating INTEGER NOT NULL);
The goal
Define a rank for users ordering users on the "rating" column descending
Be able to query for a list of ~50 users ordered by this new "rank", centered at any particular user
For example, we might return users with ranks { 15, 16, ..., 64, 65 } where the center user has rank #40
Performance must scale, e.g. be under 80 ms for 100,000 users.
Attempt #1: row_number() window function
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank
FROM my_game_users)
SELECT *
FROM my_ranks
WHERE rank >= 4000 AND rank <= 4050
ORDER BY rank ASC;
This "works", but the queries average 550ms with 100,000 users on a fast laptop without any other real work being done.
I tried adding indexes, and re-phrasing this query to not use the "WITH" syntax, and nothing worked to speed it up.
Attempt #2 - count the number of rows with a greater rating value
I tried a query like this:
SELECT t1.*,
(SELECT COUNT(*)
FROM my_game_users t2
WHERE (t1.rating, -t1.id) <= (t2.rating, -t2.id)
) AS rank
FROM my_game_users t1
WHERE id = 2000;
This is decent, this query takes about 120ms with 100,000 users having random ratings. However, this only returns the rank for user with a particular id (2000).
I can't see any efficient way to extend this query to get a range of ranks. Any attempt at extending this makes a very slow query.
I only know the ID of the "center" user, since the users have to be ordered by rank before we know which ones are in the range!
Attempt #3: in-memory ordered Tree
I ended up using a Java TreeSet to store the ranks. I can update the TreeSet whenever a new user is inserted into the database, or a user's rating changes.
This is super fast, around 25 ms with 100,000 users.
However, it has a serious drawback that it's only updated on the Webapp node that serviced the request. I'm using Heroku and will deploy multiple nodes for my app. So, I needed to add a scheduled task for the server to re-build this ranking tree every hour, to make sure the nodes don't get too out-of-sync!
If anyone knows of an efficient way to do this in Postgres with full solution, then I am all ears!
You can get the same results by using order by rating desc and offset and limit to get users between a certain rank.
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank FROM my_game_users)
SELECT * FROM my_ranks WHERE rank >= 4000 AND rank <= 4050 ORDER BY rank ASC;
The query above is the same as
select * , rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 4000
If you want to select users around rank #40 you could select ranks #15-#65
select *, rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 15
Thanks, #FuzzyTree !
Your solution doesn't quite give me everything I need, but it nudged me in the right direction. Here's the full solution I'm going with for now.
The only limitation with your solution is that there's no way to get a unique rank for a particular user. All users with the same rating would have the same rank (or at least it is undefined by SQL standard). If I knew the OFFSET ahead of time, then your rank would be good enough, but I have to get the rank of a particular user first.
My solution is to do the following query to get a range of ranks:
SELECT * FROM my_game_users ORDER BY rating DESC, id ASC LIMIT ? OFFSET ?
This is basically uniquely defining the ranks by rating, then by who joined the Game first (lower id).
To make this efficient I'm creating an index on (rating DESC, id)
Then, I'm getting a particular user's rank to plug in to this query with:
SELECT COUNT(*) FROM my_game_users WHERE rating > ? OR (rating = ? AND id < ?)
I actually made this more efficient with:
SELECT (SELECT COUNT(*) FROM my_game_users WHERE rating > ?) + (SELECT COUNT(*) FROM my_game_users WHERE rating = ? AND id < ?) + 1
Now, even with these queries it takes about 78ms average and median time to get the ranks around a user. If anyone has a good idea how to speed these up I'm all ears!
For example, getting a range of ranks takes about 60ms, and explaining it yields:
EXPLAIN SELECT * FROM word_users ORDER BY rating DESC, id ASC LIMIT 50 OFFSET 50000;
"Limit (cost=6350.28..6356.63 rows=50 width=665)"
" -> Index Scan using idx_rating_desc_and_id on word_users (cost=0.29..12704.83 rows=100036 width=665)"
So, it's using the rating and id index, yet it still has this highly variable cost from 0.29...12704.83. Any ideas how to improve??
If you order it in desc order you have it in the right order. Use the rownumber() function.
Select Row number in postgres
Also you would use an in memory cache to store stuff in memory. Something like redis. Its a separate application that can serve multiple instances, even remotely.

Add value from another table to my resultset (e.g. find username for user_id)

Some friends dragged me into writing an IRC bot that helps monitoring the consumption of fluids throughout the day. Every user in our channel can submit an amount in liters every time he/she drank something and that value will be stored in a drinks_today table which is reset at the end of the day. The bot uses SQLite for data storage.
I am stuck with an SQL-only way to find out the top 3 drinkers of the day.
I have the following database tables:
CREATE TABLE users(user_id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, active_days INTEGER DEFAULT 0, drinks_total FLOAT DEFAULT 0);
CREATE TABLE drinks_today(user_id INTEGER, amount FLOAT, timestamp INTEGER, FOREIGN KEY(user_id) REFERENCES users(user_id));
I can find the top 3 user_ids as follows:
SELECT user_id,drinks_sum FROM ( SELECT SUM(amount) AS drinks_sum,user_id FROM drinks_today GROUP BY user_id ) ORDER BY drinks_sum DESC LIMIT 3;
The result will be:
1|9.0
4|8.5
3|6.0
Now I am looking for a way to (correctly) map the username into the result set. I tried the following statement, but the result was not correct:
SELECT u.name,drinks_sum FROM ( SELECT SUM(d.amount) AS drinks_sum FROM drinks_today d GROUP BY d.user_id) JOIN users AS u ON u.user_id=user_id ORDER BY drinks_sum DESC LIMIT 3;
The result set will contain the first three users of users table and each will be equipped with the one top score. Which is, of course, completely wrong.
How can I get the username into my result set?
think you can do this all in one.
SELECT u.user_id, u.name, SUM(d.amount) as drunk
FROM users u
INNER JOIN drinks_today dt ON dt.user_id = u.user_id
GROUP BY u.user_id, u.name
ORDER by drunk DESC -- or maybe ORDER BY SUM(d.amount) DESC
LIMIT 3
Edit
Enjoy responsibly.
Cheers.

How do I select 8 random songs from top 50, with unique user_id?

I am trying to get top 50 downloads, and then shuffling (randomizing) 8 results. Plus, the 8 results have to be unique user_id's. I came up with this so far:
Song.select('DISTINCT songs.user_id, songs.*').where(:is_downloadable => true).order('songs.downloads_count DESC').limit(50).sort_by{rand}.slice(0,8)
My only gripe with this is, the last part .sort_by{rand}.slice(0,8) is being done via Ruby. Any way I can do all this via Active Record?
I wonder how the column user_id ended up in the table songs? That means you have one row for every combination of song and user? In a normalized schema, that would be an n:m relationship implemented with three tables:
song(song_id, ...)
usr(usr_id, ...) -- "user" is a reserved word
download (song_id, user_id, ...) -- implementing the n:m relationship
The query in your question yields incorrect results. The same user_id can pop up multiple times. DISTINCT does not do what you seem to expect it to. You need DISTINCT ON or some other method like aggregate or window functions.
You also need to use subqueries or CTEs, because this cannot be done in one step. When you use DISTINCT you cannot at the same time ORDER BY random(), because the sort order cannot disagree with the order dictated by DISTINCT. This query is certainly not trivial.
Simple case, top 50 songs
If you are happy to just pick the top 50 songs (not knowing how many duplicate user_ids are among them), this "simple" case will do:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 50
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC -- pick most popular song per user
-- ORDER BY user_id, random() -- pick random song per user
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the 50 songs with the highest download_count. Users can show up multiple times.
Pick 1 song per user. Randomly or the most popular one, that's not defined in your question.
Pick 8 songs with now distinct user_id randomly.
You only need an index on songs.downloads_count for this to be fast:
CREATE INDEX songs_downloads_count_idx ON songs (downloads_count DESC);
Top 50 songs with unique user_id
WITH x AS (
SELECT DISTINCT ON (user_id) *
FROM songs
WHERE is_downloadable
ORDER BY user_id, downloads_count DESC
)
, y AS (
SELECT *
FROM x
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the song with the highest download_count per user. Every user can only show up once, so it has to be the one song with the highest download_count.
Pick the 50 with highest downloads_count from that.
Pick 8 songs from that randomly.
With a big table, performance will suck, because you have to find the best row for every user before you can proceed. A multi-column index will help, but it will still not be very fast:
CREATE INDEX songs_u_dc_idx ON songs (user_id, downloads_count DESC);
The same, faster
If duplicate user_ids among the top songs are predictably rare, you can use a trick. Pick just enough of the top downloads, so that the top 50 with unique user_id are certainly among them. After this step, proceed like above. This will be much faster with big tables, because the top n rows can be read from the top of an index quickly:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 100 -- adjust to your secure estimate
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC
)
, z AS (
SELECT *
FROM y
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM z
ORDER BY random()
LIMIT 8;
The index from the simple case above will suffice to make it almost as fast as the simple case.
This would fall short if less than 50 distinct users are among the top 100 "songs".
All queries should work with PostgreSQL 8.4 or later.
If it has to be faster, yet, create a materialized view that holds the pre-selected top 50, and rewrite that table in regular intervals or triggered by events. If you make heavy use of this and the table is big, I would go for that. Otherwise it's not worth the overhead.
Generalized, improved solution
I later formalized and improved this approach further to be applicable to a whole class of similar problems under this related question at dba.SE.
You could use PostgreSQL's RANDOM() function in the order by, making it
___.order('songs.downloads_count DESC, RANDOM()').limit(8)
though this doesn't work though because PostgreSQL requires the columns used in the ORDER BY be found in the SELECT. You'll get an error like
ActiveRecord::StatementInvalid: PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
The only way to do what your'e asking all in SQL (using PostgreSQL) is with a subquery, which may or may not be a better solution for you. If it is, your best bet is to write out the full query/subquery using find_by_sql.
I'm happy to help come up with the SQL, though now that you know about RANDOM(), it should be pretty trivial.

Optimizing a category filter

This recent question had me thinking about optimizing a category filter.
Suppose we wish to create a database referencing a huge number of audio tracks, with their release date and a list of world locations from which the audio track is downloadable.
The requests we wish to optimize for are:
Give me the 10 most recent tracks downloadable from location A.
Give me the 10 most recent tracks downloadable from locations A or B.
Give me the 10 most recent tracks downloadable from locations A and B.
How would one go about structuring that database ? I have a hard time coming up with a simple solution that doesn't require reading through all the tracks for at least one location...
To optimise these queries, you need to slightly de-normalise the data.
For example, you may have a track table that contains the track's id, name and release date, and a map_location_to_track table that describes where those tracks can be down-loaded from. To answer "10 most recent tracks for location A" you need to get ALL of the tracks for Location A from map_location_to_track, then join them to the track table to order them by release date, and pick the top 10.
If instead all the data is in a single table, the ordering step can be avoided. For example...
CREATE TABLE map_location_to_track (
location_id INT,
track_id INT,
release_date DATETIME,
PRIMARY KEY (location_id, release_date, track_id)
)
SELECT * FROM map_location_to_track
WHERE location_id = A
ORDER BY release_date DESC LIMIT 10
Having location_id as the first entry in the primary key ensures that the WHERE clause is simply an index seek. Then there is no requirement to re-order the data, it's already ordered for us by the primary key, but instead just pick the 10 records at the end.
You may indeed still join on to the track table to get the name, price, etc, but you now only have to do that for 10 records, not everything at that location.
To solve the same query for "locations A OR B", there are a couple of options that can perform differently depending on the RDBMS you are using.
The first is simple, though some RDBMS don't play nice with IN...
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id IN (A, B)
GROUP BY track_id, release_date
ORDER BY release_date DESC LIMIT 10
The next option is nearly identical, but still some RDBMS don't play nice with OR logic being applied to INDEXes.
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A or location_id = B
GROUP BY track_id, release_date
ORDER BY release_date DESC LIMIT 10
In either case, the algorithm being used to rationalise the list of records down to 10 is hidden from you. It's a matter of try it and see; the index is still available such that this CAN be performant.
An alternative is to explicitly determine part of the approach in your SQL statement...
SELECT
*
FROM
(
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A
ORDER BY release_date DESC LIMIT 10
UNION
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = B
ORDER BY release_date DESC LIMIT 10
)
AS data
ORDER BY
release_date DESC
LIMIT 10
-- NOTE: This is a UNION and not a UNION ALL
-- The same track can be available in both locations, but should only count once
-- It's in place of the GROUP BY in the previous 2 examples
It is still possible for an optimiser to realise that these two unioned data sets are ordered, and so make the external order by very quick. Even if not, however, ordering 20 items is pretty quick. More importantly, it's a fixed overhead: it doesn't matter if you have a billion tracks in each location, we're just merging two lists of 10.
The hardest to optimise is the AND condition, but even then the existance of the "TOP 10" constraint can help work wonders.
Adding a HAVING clause to the IN or OR based approaches can solve this, but, again, depending on your RDBMS, may run less than optimally.
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A or location_id = B
GROUP BY track_id, release_date
HAVING COUNT(*) = 2
ORDER BY release_date DESC LIMIT 10
The alternative is to try the "two queries" approach...
SELECT
location_a.*
FROM
(
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A
)
AS location_a
INNER JOIN
(
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = B
)
AS location_b
ON location_a.release_date = location_b.release_date
AND location_a.track_id = location_b.track_id
ORDER BY
location_a.release_date DESC
LIMIT 10
This time we can't restrict the two sub-queries to just 10 records; for all we know the most recent 10 in location a don't appear in location b at all. The primary key rescues us again though. The two data sets are orgnised by release date, the RDBMScan just start at the top record of each set and merge the two until it has 10 records, then stop.
NOTE: Because the release_date is in the primary key, and before the track_id, one should ensure that it is used in the join.
Depending on the RDBMS, you don't even need the sub-queries. You may be able to just self-join the table without altering the RDBMS' plan...
SELECT
location_a.*
FROM
map_location_to_track AS location_a
INNER JOIN
map_location_to_track AS location_b
ON location_a.release_date = location_b.release_date
AND location_a.track_id = location_b.track_id
WHERE
location_a.location_id = A
AND location_b.location_id = B
ORDER BY
location_a.release_date DESC
LIMIT 10
All in all, the combination of three things makes this pretty efficient:
- Partially De-Normalising the data to ensure it's in a friendly order for our needs
- Knowing we only ever need the first 10 results
- Knowing we're only ever dealing with 2 locations at the most
There are variations that can optimise to any number of records and any number of locations, but these are significantly less performant than the problem stated in this question.
In a classic relational schema you would have a many-to-many relationship between tracks and locations in order to avoid redundancy:
CREATE TABLE tracks (
id INT,
...
release_date DATETIME,
PRIMARY KEY (id)
)
CREATE TABLE locations (
id INT,
...
PRIMARY KEY (id)
)
CREATE TABLE tracks_locations (
location_id INT,
track_id INT,
...
PRIMARY KEY (location_id, track_id)
)
SELECT tracks.* FROM tracks_locations LEFT JOIN tracks ON tracks.id = tracks_locations.location_id
WHERE tracks_locations.location_id = A
ORDER BY tracks.release_date DESC LIMIT 10
You could modify that schema using table partitions by location. Problem is that it depends on implementation issues or usage constraints. For example, AFAIK in MySQL you cannot have foreign keys in partitioned tables. To solve this you could also have a collection of tables (call it "partitioning by hand") like tracks_by_location_#, where # is the ID of a known location. These tables could store filtered results and be created/updated/deleted using triggers.

JOIN after processing SELECT

Given the following schema:
CREATE TABLE players (
id BIGINT PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE trials (
timestamp TIMESTAMP PRIMARY KEY,
player BIGINT,
score NUMERIC
);
How would I create a SELECT that first finds the best scores from trials, then joins the name field from users? I've been able to get the scores I'm after using this query:
SELECT * FROM trials GROUP BY player ORDER BY score ASC LIMIT 10;
And my query for returning the top 10 scores looks like:
CREATE VIEW top10place AS
SELECT player.name AS name, trial.*
FROM trials AS trial, players AS player
WHERE trial.player = player.id
AND trial.score = (
SELECT MAX(score)
FROM trials AS tsub
WHERE tsub.player = trial.player
)
ORDER BY trial.score DESC, trial.timestamp ASC
LIMIT 10;
But when I hit thousands of entries in the tables, the DB performance starts to crawl. I figure the subquery is killing my performance. The first query (returning only the top scores) still performs adequately, so I was wondering if there is a way to force a JOIN operation to occur after the top scores have been selected.
EDIT Note that the query will return the top 10 ranked players, not just the top 10 scores. If the same player has many high scores, he should only show up once in the top 10 list.
I'm using SQLite, so it doesn't have some of the extended features of SQL Server or MySQL.
Don't have sqlite running, hope the limit is right.
select players.name, trials.player, trials.timestamp, trials.score from
(select player, score, timestamp from
trials order by score desc, timestamp asc limit 10) trials, players
where players.id = trials.player
Regards
This is an instance of you making something harder than it needs to be. The correct code is:
CREATE VIEW top10place AS
SELECT player.name AS name, trial.*
FROM trials AS trial, players AS player
WHERE trial.player = player.id
ORDER BY trial.score ASC, trial.timestamp ASC
LIMIT 10;
Basically, let the LIMIT statement do the work :)
A subquery in a WHERE can be expensive if the optimizer runs it for every row.
(Edit) Here's another way to write the query, now with an exclusive join: it says there's no row for that user with a higher score:
SELECT
(select name from user where id = cur.userid) as UserName
, cur.score as MaxScore
FROM trails cur
LEFT JOIN trials higher
ON higher.userid = cur.userid
AND higher.timestamp <> cur.timestamp
AND higher.score > cur.score
WHERE higher.userid is null
ORDER BY cur.score DESC
LIMIT 10
This would return the 10 highest scoring users. If you'd like the 10 highest scores regardless of user, check Silas' answer.
As has been mentioned, since your identifying key between players and trials is the player.id and trials.player, you should have an index on trials.player. Particularly if you relate those two tables a lot.
Also you might try making your query more like.
SELECT p.name as name, t.* FROM players as p
INNER JOIN (SELECT * FROM trials WHERE trials.score = (SELECT MAX(score) FROM trials as tsub WHERE tsub.player = trials.player) LIMIT 10) as t ON t.player = p.id
ORDER BY t.score DESC, t.timestamp ASC
This might even be able to be optimized a little more, but I'm no good at that without some data to throw the query at.