SQL LIMIT, but from the end - sql

Can you LIMIT a query from the end of the results, rather than from the beginning? In particular, I'm looking for a solution w/ Postgresql, if that makes a difference.
Allow me to clarify with an example.
Let's say I want to return the 3 oldest people in my people table, but in ascending order of age. The best way I know how to select the 3 people returns the correct records, but in the reverse order:
SELECT * FROM people
ORDER BY age DESC
LIMIT 2

should be this way-
SELECT * FROM (
SELECT *
FROM PEOPLE
ORDER BY AGE DESC
LIMIT 3 ) X
ORDER BY AGE ASC

Related

How does one get the total rows for a partition in postgresql

I'm using a windows function to help me pagination through a list of records in the database.
For example
I have a list of dogs and they all have a breed associated with them.
I want to show 10 dogs from each breed to my users.
So that would be
select * from dogs
join (
SELECT id, row_number() OVER (PARTITION BY breed) as row_number FROM dogs
) rn on dogs.id = rn.id
where (row_number between 1 and 10)
That will give me ~ten dogs from each breed..
What I need though is a count. Is there a way to get the count of the partitions. I want to know how many Staffies I have waiting for adoption.
I do notice that there's a percentage and all the docs I find seem to indicate theres something called total rows. But I don't see it.
Just run the window aggregate function count() over the same partition (without adding ORDER BY!) to get the total count for the partition:
SELECT *
FROM (
SELECT *
, row_number() OVER (PARTITION BY breed ORDER BY id) AS rn
, count() OVER (PARTITION BY breed) AS breed_count -- !
FROM dogs
) sub
WHERE rn < 11;
Also removed the unnecessary join and simplified.
See:
Run a query with a LIMIT/OFFSET and also get the total number of rows
And I added ORDER BY to the frame definition of row_number() to get a deterministic result. Without, Postgres is free to return any 10 arbitrary rows. Any write to the table (or VACUUM, etc.) can and will change the result without ORDER BY.
Aside, pagination with LIMIT / OFFSET does not scale well. Consider:
Optimize query with OFFSET on large table

query which returns 10 values with a complex condition

I am writing now a pretty complex query and I am facing now a problem I am not able to solve.
I have a table called tbl with 2 columns:
movie_id, Rank
(INTEGER), (LIKE\DISLIKE\NULL)
I need to write a query that returns the top 10 movies which
have the most number of LIKES.
(If there is equality of likes, they need to ordered by Ascending movie_id)
Edge Cases:
If there are less than 10 movies which have Rank = 'LIKE'
(let's say there are only 7) then I need to return those 7 movie_id's ordered by the number of likes and another 3 movies_id which are ordered by movie_id
(it doesn't matter if there is 'DISLIKE' or NULL in the Rank value)
If there aren't 10 movies on the table then I need to return the movies that are in the table (in the same way explained before, that is, first I need to return the movies ordered by the number of'LIKES' and then the rest ordered by movie_id)
Can someone please help me with this?
Thank you!
I think this does what you describe:
select t.*
from tbl t
order by ( (ranktype = 'like')::int ) desc,
rank desc
fetch first 10 rows only;

How to efficiently get a range of ranked users (for a leaderboard) using Postgresql

I have read many posts on this topic, such as
mysql-get-rank-from-leaderboards.
However, none of the solutions are efficient at scale for getting a range of ranks from the database.
The problem is simple. Suppose we have a Postgres table with an "id" column and another INTEGER column whose values are not unique, but we have an index for this column.
e.g. table could be:
CREATE TABLE my_game_users (id serial PRIMARY KEY, rating INTEGER NOT NULL);
The goal
Define a rank for users ordering users on the "rating" column descending
Be able to query for a list of ~50 users ordered by this new "rank", centered at any particular user
For example, we might return users with ranks { 15, 16, ..., 64, 65 } where the center user has rank #40
Performance must scale, e.g. be under 80 ms for 100,000 users.
Attempt #1: row_number() window function
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank
FROM my_game_users)
SELECT *
FROM my_ranks
WHERE rank >= 4000 AND rank <= 4050
ORDER BY rank ASC;
This "works", but the queries average 550ms with 100,000 users on a fast laptop without any other real work being done.
I tried adding indexes, and re-phrasing this query to not use the "WITH" syntax, and nothing worked to speed it up.
Attempt #2 - count the number of rows with a greater rating value
I tried a query like this:
SELECT t1.*,
(SELECT COUNT(*)
FROM my_game_users t2
WHERE (t1.rating, -t1.id) <= (t2.rating, -t2.id)
) AS rank
FROM my_game_users t1
WHERE id = 2000;
This is decent, this query takes about 120ms with 100,000 users having random ratings. However, this only returns the rank for user with a particular id (2000).
I can't see any efficient way to extend this query to get a range of ranks. Any attempt at extending this makes a very slow query.
I only know the ID of the "center" user, since the users have to be ordered by rank before we know which ones are in the range!
Attempt #3: in-memory ordered Tree
I ended up using a Java TreeSet to store the ranks. I can update the TreeSet whenever a new user is inserted into the database, or a user's rating changes.
This is super fast, around 25 ms with 100,000 users.
However, it has a serious drawback that it's only updated on the Webapp node that serviced the request. I'm using Heroku and will deploy multiple nodes for my app. So, I needed to add a scheduled task for the server to re-build this ranking tree every hour, to make sure the nodes don't get too out-of-sync!
If anyone knows of an efficient way to do this in Postgres with full solution, then I am all ears!
You can get the same results by using order by rating desc and offset and limit to get users between a certain rank.
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank FROM my_game_users)
SELECT * FROM my_ranks WHERE rank >= 4000 AND rank <= 4050 ORDER BY rank ASC;
The query above is the same as
select * , rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 4000
If you want to select users around rank #40 you could select ranks #15-#65
select *, rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 15
Thanks, #FuzzyTree !
Your solution doesn't quite give me everything I need, but it nudged me in the right direction. Here's the full solution I'm going with for now.
The only limitation with your solution is that there's no way to get a unique rank for a particular user. All users with the same rating would have the same rank (or at least it is undefined by SQL standard). If I knew the OFFSET ahead of time, then your rank would be good enough, but I have to get the rank of a particular user first.
My solution is to do the following query to get a range of ranks:
SELECT * FROM my_game_users ORDER BY rating DESC, id ASC LIMIT ? OFFSET ?
This is basically uniquely defining the ranks by rating, then by who joined the Game first (lower id).
To make this efficient I'm creating an index on (rating DESC, id)
Then, I'm getting a particular user's rank to plug in to this query with:
SELECT COUNT(*) FROM my_game_users WHERE rating > ? OR (rating = ? AND id < ?)
I actually made this more efficient with:
SELECT (SELECT COUNT(*) FROM my_game_users WHERE rating > ?) + (SELECT COUNT(*) FROM my_game_users WHERE rating = ? AND id < ?) + 1
Now, even with these queries it takes about 78ms average and median time to get the ranks around a user. If anyone has a good idea how to speed these up I'm all ears!
For example, getting a range of ranks takes about 60ms, and explaining it yields:
EXPLAIN SELECT * FROM word_users ORDER BY rating DESC, id ASC LIMIT 50 OFFSET 50000;
"Limit (cost=6350.28..6356.63 rows=50 width=665)"
" -> Index Scan using idx_rating_desc_and_id on word_users (cost=0.29..12704.83 rows=100036 width=665)"
So, it's using the rating and id index, yet it still has this highly variable cost from 0.29...12704.83. Any ideas how to improve??
If you order it in desc order you have it in the right order. Use the rownumber() function.
Select Row number in postgres
Also you would use an in memory cache to store stuff in memory. Something like redis. Its a separate application that can serve multiple instances, even remotely.

How do I select 8 random songs from top 50, with unique user_id?

I am trying to get top 50 downloads, and then shuffling (randomizing) 8 results. Plus, the 8 results have to be unique user_id's. I came up with this so far:
Song.select('DISTINCT songs.user_id, songs.*').where(:is_downloadable => true).order('songs.downloads_count DESC').limit(50).sort_by{rand}.slice(0,8)
My only gripe with this is, the last part .sort_by{rand}.slice(0,8) is being done via Ruby. Any way I can do all this via Active Record?
I wonder how the column user_id ended up in the table songs? That means you have one row for every combination of song and user? In a normalized schema, that would be an n:m relationship implemented with three tables:
song(song_id, ...)
usr(usr_id, ...) -- "user" is a reserved word
download (song_id, user_id, ...) -- implementing the n:m relationship
The query in your question yields incorrect results. The same user_id can pop up multiple times. DISTINCT does not do what you seem to expect it to. You need DISTINCT ON or some other method like aggregate or window functions.
You also need to use subqueries or CTEs, because this cannot be done in one step. When you use DISTINCT you cannot at the same time ORDER BY random(), because the sort order cannot disagree with the order dictated by DISTINCT. This query is certainly not trivial.
Simple case, top 50 songs
If you are happy to just pick the top 50 songs (not knowing how many duplicate user_ids are among them), this "simple" case will do:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 50
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC -- pick most popular song per user
-- ORDER BY user_id, random() -- pick random song per user
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the 50 songs with the highest download_count. Users can show up multiple times.
Pick 1 song per user. Randomly or the most popular one, that's not defined in your question.
Pick 8 songs with now distinct user_id randomly.
You only need an index on songs.downloads_count for this to be fast:
CREATE INDEX songs_downloads_count_idx ON songs (downloads_count DESC);
Top 50 songs with unique user_id
WITH x AS (
SELECT DISTINCT ON (user_id) *
FROM songs
WHERE is_downloadable
ORDER BY user_id, downloads_count DESC
)
, y AS (
SELECT *
FROM x
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the song with the highest download_count per user. Every user can only show up once, so it has to be the one song with the highest download_count.
Pick the 50 with highest downloads_count from that.
Pick 8 songs from that randomly.
With a big table, performance will suck, because you have to find the best row for every user before you can proceed. A multi-column index will help, but it will still not be very fast:
CREATE INDEX songs_u_dc_idx ON songs (user_id, downloads_count DESC);
The same, faster
If duplicate user_ids among the top songs are predictably rare, you can use a trick. Pick just enough of the top downloads, so that the top 50 with unique user_id are certainly among them. After this step, proceed like above. This will be much faster with big tables, because the top n rows can be read from the top of an index quickly:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 100 -- adjust to your secure estimate
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC
)
, z AS (
SELECT *
FROM y
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM z
ORDER BY random()
LIMIT 8;
The index from the simple case above will suffice to make it almost as fast as the simple case.
This would fall short if less than 50 distinct users are among the top 100 "songs".
All queries should work with PostgreSQL 8.4 or later.
If it has to be faster, yet, create a materialized view that holds the pre-selected top 50, and rewrite that table in regular intervals or triggered by events. If you make heavy use of this and the table is big, I would go for that. Otherwise it's not worth the overhead.
Generalized, improved solution
I later formalized and improved this approach further to be applicable to a whole class of similar problems under this related question at dba.SE.
You could use PostgreSQL's RANDOM() function in the order by, making it
___.order('songs.downloads_count DESC, RANDOM()').limit(8)
though this doesn't work though because PostgreSQL requires the columns used in the ORDER BY be found in the SELECT. You'll get an error like
ActiveRecord::StatementInvalid: PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
The only way to do what your'e asking all in SQL (using PostgreSQL) is with a subquery, which may or may not be a better solution for you. If it is, your best bet is to write out the full query/subquery using find_by_sql.
I'm happy to help come up with the SQL, though now that you know about RANDOM(), it should be pretty trivial.

JOIN after processing SELECT

Given the following schema:
CREATE TABLE players (
id BIGINT PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE trials (
timestamp TIMESTAMP PRIMARY KEY,
player BIGINT,
score NUMERIC
);
How would I create a SELECT that first finds the best scores from trials, then joins the name field from users? I've been able to get the scores I'm after using this query:
SELECT * FROM trials GROUP BY player ORDER BY score ASC LIMIT 10;
And my query for returning the top 10 scores looks like:
CREATE VIEW top10place AS
SELECT player.name AS name, trial.*
FROM trials AS trial, players AS player
WHERE trial.player = player.id
AND trial.score = (
SELECT MAX(score)
FROM trials AS tsub
WHERE tsub.player = trial.player
)
ORDER BY trial.score DESC, trial.timestamp ASC
LIMIT 10;
But when I hit thousands of entries in the tables, the DB performance starts to crawl. I figure the subquery is killing my performance. The first query (returning only the top scores) still performs adequately, so I was wondering if there is a way to force a JOIN operation to occur after the top scores have been selected.
EDIT Note that the query will return the top 10 ranked players, not just the top 10 scores. If the same player has many high scores, he should only show up once in the top 10 list.
I'm using SQLite, so it doesn't have some of the extended features of SQL Server or MySQL.
Don't have sqlite running, hope the limit is right.
select players.name, trials.player, trials.timestamp, trials.score from
(select player, score, timestamp from
trials order by score desc, timestamp asc limit 10) trials, players
where players.id = trials.player
Regards
This is an instance of you making something harder than it needs to be. The correct code is:
CREATE VIEW top10place AS
SELECT player.name AS name, trial.*
FROM trials AS trial, players AS player
WHERE trial.player = player.id
ORDER BY trial.score ASC, trial.timestamp ASC
LIMIT 10;
Basically, let the LIMIT statement do the work :)
A subquery in a WHERE can be expensive if the optimizer runs it for every row.
(Edit) Here's another way to write the query, now with an exclusive join: it says there's no row for that user with a higher score:
SELECT
(select name from user where id = cur.userid) as UserName
, cur.score as MaxScore
FROM trails cur
LEFT JOIN trials higher
ON higher.userid = cur.userid
AND higher.timestamp <> cur.timestamp
AND higher.score > cur.score
WHERE higher.userid is null
ORDER BY cur.score DESC
LIMIT 10
This would return the 10 highest scoring users. If you'd like the 10 highest scores regardless of user, check Silas' answer.
As has been mentioned, since your identifying key between players and trials is the player.id and trials.player, you should have an index on trials.player. Particularly if you relate those two tables a lot.
Also you might try making your query more like.
SELECT p.name as name, t.* FROM players as p
INNER JOIN (SELECT * FROM trials WHERE trials.score = (SELECT MAX(score) FROM trials as tsub WHERE tsub.player = trials.player) LIMIT 10) as t ON t.player = p.id
ORDER BY t.score DESC, t.timestamp ASC
This might even be able to be optimized a little more, but I'm no good at that without some data to throw the query at.