JOIN after processing SELECT - sql

Given the following schema:
CREATE TABLE players (
id BIGINT PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE trials (
timestamp TIMESTAMP PRIMARY KEY,
player BIGINT,
score NUMERIC
);
How would I create a SELECT that first finds the best scores from trials, then joins the name field from users? I've been able to get the scores I'm after using this query:
SELECT * FROM trials GROUP BY player ORDER BY score ASC LIMIT 10;
And my query for returning the top 10 scores looks like:
CREATE VIEW top10place AS
SELECT player.name AS name, trial.*
FROM trials AS trial, players AS player
WHERE trial.player = player.id
AND trial.score = (
SELECT MAX(score)
FROM trials AS tsub
WHERE tsub.player = trial.player
)
ORDER BY trial.score DESC, trial.timestamp ASC
LIMIT 10;
But when I hit thousands of entries in the tables, the DB performance starts to crawl. I figure the subquery is killing my performance. The first query (returning only the top scores) still performs adequately, so I was wondering if there is a way to force a JOIN operation to occur after the top scores have been selected.
EDIT Note that the query will return the top 10 ranked players, not just the top 10 scores. If the same player has many high scores, he should only show up once in the top 10 list.
I'm using SQLite, so it doesn't have some of the extended features of SQL Server or MySQL.

Don't have sqlite running, hope the limit is right.
select players.name, trials.player, trials.timestamp, trials.score from
(select player, score, timestamp from
trials order by score desc, timestamp asc limit 10) trials, players
where players.id = trials.player
Regards

This is an instance of you making something harder than it needs to be. The correct code is:
CREATE VIEW top10place AS
SELECT player.name AS name, trial.*
FROM trials AS trial, players AS player
WHERE trial.player = player.id
ORDER BY trial.score ASC, trial.timestamp ASC
LIMIT 10;
Basically, let the LIMIT statement do the work :)

A subquery in a WHERE can be expensive if the optimizer runs it for every row.
(Edit) Here's another way to write the query, now with an exclusive join: it says there's no row for that user with a higher score:
SELECT
(select name from user where id = cur.userid) as UserName
, cur.score as MaxScore
FROM trails cur
LEFT JOIN trials higher
ON higher.userid = cur.userid
AND higher.timestamp <> cur.timestamp
AND higher.score > cur.score
WHERE higher.userid is null
ORDER BY cur.score DESC
LIMIT 10
This would return the 10 highest scoring users. If you'd like the 10 highest scores regardless of user, check Silas' answer.

As has been mentioned, since your identifying key between players and trials is the player.id and trials.player, you should have an index on trials.player. Particularly if you relate those two tables a lot.
Also you might try making your query more like.
SELECT p.name as name, t.* FROM players as p
INNER JOIN (SELECT * FROM trials WHERE trials.score = (SELECT MAX(score) FROM trials as tsub WHERE tsub.player = trials.player) LIMIT 10) as t ON t.player = p.id
ORDER BY t.score DESC, t.timestamp ASC
This might even be able to be optimized a little more, but I'm no good at that without some data to throw the query at.

Related

How does one get the total rows for a partition in postgresql

I'm using a windows function to help me pagination through a list of records in the database.
For example
I have a list of dogs and they all have a breed associated with them.
I want to show 10 dogs from each breed to my users.
So that would be
select * from dogs
join (
SELECT id, row_number() OVER (PARTITION BY breed) as row_number FROM dogs
) rn on dogs.id = rn.id
where (row_number between 1 and 10)
That will give me ~ten dogs from each breed..
What I need though is a count. Is there a way to get the count of the partitions. I want to know how many Staffies I have waiting for adoption.
I do notice that there's a percentage and all the docs I find seem to indicate theres something called total rows. But I don't see it.
Just run the window aggregate function count() over the same partition (without adding ORDER BY!) to get the total count for the partition:
SELECT *
FROM (
SELECT *
, row_number() OVER (PARTITION BY breed ORDER BY id) AS rn
, count() OVER (PARTITION BY breed) AS breed_count -- !
FROM dogs
) sub
WHERE rn < 11;
Also removed the unnecessary join and simplified.
See:
Run a query with a LIMIT/OFFSET and also get the total number of rows
And I added ORDER BY to the frame definition of row_number() to get a deterministic result. Without, Postgres is free to return any 10 arbitrary rows. Any write to the table (or VACUUM, etc.) can and will change the result without ORDER BY.
Aside, pagination with LIMIT / OFFSET does not scale well. Consider:
Optimize query with OFFSET on large table

How to efficiently get a range of ranked users (for a leaderboard) using Postgresql

I have read many posts on this topic, such as
mysql-get-rank-from-leaderboards.
However, none of the solutions are efficient at scale for getting a range of ranks from the database.
The problem is simple. Suppose we have a Postgres table with an "id" column and another INTEGER column whose values are not unique, but we have an index for this column.
e.g. table could be:
CREATE TABLE my_game_users (id serial PRIMARY KEY, rating INTEGER NOT NULL);
The goal
Define a rank for users ordering users on the "rating" column descending
Be able to query for a list of ~50 users ordered by this new "rank", centered at any particular user
For example, we might return users with ranks { 15, 16, ..., 64, 65 } where the center user has rank #40
Performance must scale, e.g. be under 80 ms for 100,000 users.
Attempt #1: row_number() window function
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank
FROM my_game_users)
SELECT *
FROM my_ranks
WHERE rank >= 4000 AND rank <= 4050
ORDER BY rank ASC;
This "works", but the queries average 550ms with 100,000 users on a fast laptop without any other real work being done.
I tried adding indexes, and re-phrasing this query to not use the "WITH" syntax, and nothing worked to speed it up.
Attempt #2 - count the number of rows with a greater rating value
I tried a query like this:
SELECT t1.*,
(SELECT COUNT(*)
FROM my_game_users t2
WHERE (t1.rating, -t1.id) <= (t2.rating, -t2.id)
) AS rank
FROM my_game_users t1
WHERE id = 2000;
This is decent, this query takes about 120ms with 100,000 users having random ratings. However, this only returns the rank for user with a particular id (2000).
I can't see any efficient way to extend this query to get a range of ranks. Any attempt at extending this makes a very slow query.
I only know the ID of the "center" user, since the users have to be ordered by rank before we know which ones are in the range!
Attempt #3: in-memory ordered Tree
I ended up using a Java TreeSet to store the ranks. I can update the TreeSet whenever a new user is inserted into the database, or a user's rating changes.
This is super fast, around 25 ms with 100,000 users.
However, it has a serious drawback that it's only updated on the Webapp node that serviced the request. I'm using Heroku and will deploy multiple nodes for my app. So, I needed to add a scheduled task for the server to re-build this ranking tree every hour, to make sure the nodes don't get too out-of-sync!
If anyone knows of an efficient way to do this in Postgres with full solution, then I am all ears!
You can get the same results by using order by rating desc and offset and limit to get users between a certain rank.
WITH my_ranks AS
(SELECT my_game_users.*, row_number() OVER (ORDER BY rating DESC) AS rank FROM my_game_users)
SELECT * FROM my_ranks WHERE rank >= 4000 AND rank <= 4050 ORDER BY rank ASC;
The query above is the same as
select * , rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 4000
If you want to select users around rank #40 you could select ranks #15-#65
select *, rank() over (order by rating desc) rank
from my_game_users
order by rating desc
limit 50 offset 15
Thanks, #FuzzyTree !
Your solution doesn't quite give me everything I need, but it nudged me in the right direction. Here's the full solution I'm going with for now.
The only limitation with your solution is that there's no way to get a unique rank for a particular user. All users with the same rating would have the same rank (or at least it is undefined by SQL standard). If I knew the OFFSET ahead of time, then your rank would be good enough, but I have to get the rank of a particular user first.
My solution is to do the following query to get a range of ranks:
SELECT * FROM my_game_users ORDER BY rating DESC, id ASC LIMIT ? OFFSET ?
This is basically uniquely defining the ranks by rating, then by who joined the Game first (lower id).
To make this efficient I'm creating an index on (rating DESC, id)
Then, I'm getting a particular user's rank to plug in to this query with:
SELECT COUNT(*) FROM my_game_users WHERE rating > ? OR (rating = ? AND id < ?)
I actually made this more efficient with:
SELECT (SELECT COUNT(*) FROM my_game_users WHERE rating > ?) + (SELECT COUNT(*) FROM my_game_users WHERE rating = ? AND id < ?) + 1
Now, even with these queries it takes about 78ms average and median time to get the ranks around a user. If anyone has a good idea how to speed these up I'm all ears!
For example, getting a range of ranks takes about 60ms, and explaining it yields:
EXPLAIN SELECT * FROM word_users ORDER BY rating DESC, id ASC LIMIT 50 OFFSET 50000;
"Limit (cost=6350.28..6356.63 rows=50 width=665)"
" -> Index Scan using idx_rating_desc_and_id on word_users (cost=0.29..12704.83 rows=100036 width=665)"
So, it's using the rating and id index, yet it still has this highly variable cost from 0.29...12704.83. Any ideas how to improve??
If you order it in desc order you have it in the right order. Use the rownumber() function.
Select Row number in postgres
Also you would use an in memory cache to store stuff in memory. Something like redis. Its a separate application that can serve multiple instances, even remotely.

Add value from another table to my resultset (e.g. find username for user_id)

Some friends dragged me into writing an IRC bot that helps monitoring the consumption of fluids throughout the day. Every user in our channel can submit an amount in liters every time he/she drank something and that value will be stored in a drinks_today table which is reset at the end of the day. The bot uses SQLite for data storage.
I am stuck with an SQL-only way to find out the top 3 drinkers of the day.
I have the following database tables:
CREATE TABLE users(user_id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, active_days INTEGER DEFAULT 0, drinks_total FLOAT DEFAULT 0);
CREATE TABLE drinks_today(user_id INTEGER, amount FLOAT, timestamp INTEGER, FOREIGN KEY(user_id) REFERENCES users(user_id));
I can find the top 3 user_ids as follows:
SELECT user_id,drinks_sum FROM ( SELECT SUM(amount) AS drinks_sum,user_id FROM drinks_today GROUP BY user_id ) ORDER BY drinks_sum DESC LIMIT 3;
The result will be:
1|9.0
4|8.5
3|6.0
Now I am looking for a way to (correctly) map the username into the result set. I tried the following statement, but the result was not correct:
SELECT u.name,drinks_sum FROM ( SELECT SUM(d.amount) AS drinks_sum FROM drinks_today d GROUP BY d.user_id) JOIN users AS u ON u.user_id=user_id ORDER BY drinks_sum DESC LIMIT 3;
The result set will contain the first three users of users table and each will be equipped with the one top score. Which is, of course, completely wrong.
How can I get the username into my result set?
think you can do this all in one.
SELECT u.user_id, u.name, SUM(d.amount) as drunk
FROM users u
INNER JOIN drinks_today dt ON dt.user_id = u.user_id
GROUP BY u.user_id, u.name
ORDER by drunk DESC -- or maybe ORDER BY SUM(d.amount) DESC
LIMIT 3
Edit
Enjoy responsibly.
Cheers.

How do I select 8 random songs from top 50, with unique user_id?

I am trying to get top 50 downloads, and then shuffling (randomizing) 8 results. Plus, the 8 results have to be unique user_id's. I came up with this so far:
Song.select('DISTINCT songs.user_id, songs.*').where(:is_downloadable => true).order('songs.downloads_count DESC').limit(50).sort_by{rand}.slice(0,8)
My only gripe with this is, the last part .sort_by{rand}.slice(0,8) is being done via Ruby. Any way I can do all this via Active Record?
I wonder how the column user_id ended up in the table songs? That means you have one row for every combination of song and user? In a normalized schema, that would be an n:m relationship implemented with three tables:
song(song_id, ...)
usr(usr_id, ...) -- "user" is a reserved word
download (song_id, user_id, ...) -- implementing the n:m relationship
The query in your question yields incorrect results. The same user_id can pop up multiple times. DISTINCT does not do what you seem to expect it to. You need DISTINCT ON or some other method like aggregate or window functions.
You also need to use subqueries or CTEs, because this cannot be done in one step. When you use DISTINCT you cannot at the same time ORDER BY random(), because the sort order cannot disagree with the order dictated by DISTINCT. This query is certainly not trivial.
Simple case, top 50 songs
If you are happy to just pick the top 50 songs (not knowing how many duplicate user_ids are among them), this "simple" case will do:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 50
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC -- pick most popular song per user
-- ORDER BY user_id, random() -- pick random song per user
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the 50 songs with the highest download_count. Users can show up multiple times.
Pick 1 song per user. Randomly or the most popular one, that's not defined in your question.
Pick 8 songs with now distinct user_id randomly.
You only need an index on songs.downloads_count for this to be fast:
CREATE INDEX songs_downloads_count_idx ON songs (downloads_count DESC);
Top 50 songs with unique user_id
WITH x AS (
SELECT DISTINCT ON (user_id) *
FROM songs
WHERE is_downloadable
ORDER BY user_id, downloads_count DESC
)
, y AS (
SELECT *
FROM x
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the song with the highest download_count per user. Every user can only show up once, so it has to be the one song with the highest download_count.
Pick the 50 with highest downloads_count from that.
Pick 8 songs from that randomly.
With a big table, performance will suck, because you have to find the best row for every user before you can proceed. A multi-column index will help, but it will still not be very fast:
CREATE INDEX songs_u_dc_idx ON songs (user_id, downloads_count DESC);
The same, faster
If duplicate user_ids among the top songs are predictably rare, you can use a trick. Pick just enough of the top downloads, so that the top 50 with unique user_id are certainly among them. After this step, proceed like above. This will be much faster with big tables, because the top n rows can be read from the top of an index quickly:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 100 -- adjust to your secure estimate
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC
)
, z AS (
SELECT *
FROM y
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM z
ORDER BY random()
LIMIT 8;
The index from the simple case above will suffice to make it almost as fast as the simple case.
This would fall short if less than 50 distinct users are among the top 100 "songs".
All queries should work with PostgreSQL 8.4 or later.
If it has to be faster, yet, create a materialized view that holds the pre-selected top 50, and rewrite that table in regular intervals or triggered by events. If you make heavy use of this and the table is big, I would go for that. Otherwise it's not worth the overhead.
Generalized, improved solution
I later formalized and improved this approach further to be applicable to a whole class of similar problems under this related question at dba.SE.
You could use PostgreSQL's RANDOM() function in the order by, making it
___.order('songs.downloads_count DESC, RANDOM()').limit(8)
though this doesn't work though because PostgreSQL requires the columns used in the ORDER BY be found in the SELECT. You'll get an error like
ActiveRecord::StatementInvalid: PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
The only way to do what your'e asking all in SQL (using PostgreSQL) is with a subquery, which may or may not be a better solution for you. If it is, your best bet is to write out the full query/subquery using find_by_sql.
I'm happy to help come up with the SQL, though now that you know about RANDOM(), it should be pretty trivial.

Fetch data with single and fast SQL query

I have the following data:
ExamEntry Student_ID Grade
11 1 80
12 2 70
13 3 20
14 3 68
15 4 75
I want to find all the students that passed an exam. In this case, if there are few exams
that one student attended to, I need to find the last result.
So, in this case I'd get that all students passed.
Can I find it with one fast query? I do it this way:
Find the list of entries by
select max(ExamEntry) from data group by Student_ID
Find the results:
select ExamEntry from data where ExamEntry in ( ).
But this is VERY slow - I get around 1000 entries, and this 2 step process takes 10 seconds.
Is there a better way?
Thanks.
If your query is very slow at with 1000 records in your table, there is something wrong.
For a modern Database system a table containing, 1000 entries is considered very very small.
Most likely, you did not provid a (primary) key for your table?
Assuming that a student would pass if at least on of the grades is above the minimum needed, the appropriate query would be:
SELECT
Student_ID
, MAX(Grade) AS maxGrade
FROM table_name
GROUP BY Student_ID
HAVING maxGrade > MINIMUM_GRADE_NEEDED
If you really need the latest grade to be above the minimum:
SELECT
Student_ID
, Grade
FROM table_name
WHERE ExamEntry IN (
SELECT
MAX(ExamEntry)
FROM table_name
GROUP BY Student_ID
)
HAVING Grade > MINIMUM_GRADE_NEEDED
SELECT student_id, MAX(ExamEntry)
FROM data
WHERE Grade > :threshold
GROUP BY student_id
Like this?
I'll make some assumptions that you have a student table and test table and the table you are showing us is the test_result table... (if you don't have a similar structure, you should revisit your schema)
select s.id, s.name, t.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
left outer join test t on r.test_id = t.id
group by s.id, s.name, t.name
All the fields with id in it should be indexed.
If you really only have a single test (type) in your domain... then the query would be
select s.id, s.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
group by s.id, s.name
I've used the hints given here, and here the query I found that runs almost 3 orders faster than my first one (.03 sec instead of 10 sec):
SELECT ExamEntry, Student_ID, Grade from data,
( SELECT max(ExamEntry) as ExId GROUP BY Student_ID) as newdata
WHERE `data`.`ExamEntry`=`newdata`.`ExId` AND Grade > 60;
Thanks All!
As mentioned, indexing is a powerful tool for speeding up queries. The order of the index, however, is fundamentally important.
An index in order of (ExamEntry) then (Student_ID) then (Grade) would be next to useless for finding exams where the student passed.
An index in the opposite order would fit perfectly, if all you wanted was to find what exams had been passed. This would enable the query engine to quickly identify rows for exams that have been passed, and just process those.
In MS SQL Server this can be done with...
CREATE INDEX [IX_results] ON [dbo].[results]
(
[Grade],
[Student_ID],
[ExamEntry]
)
ON [PRIMARY]
(I recommend reading more about indexs to see what other options there are, such as ClusterdIndexes, etc, etc)
With that index, the following query would be able to ignore the 'failed' exams very quickly, and just display the students who ever passed the exam...
(This assumes that if you ever get over 60, you're counted as a pass, even if you subsequently take the exam again and get 27.)
SELECT
Student_ID
FROM
[results]
WHERE
Grade >= 60
GROUP BY
Student_ID
Should you definitely need the most recent value, then you need to change the order of the index back to something like...
CREATE INDEX [IX_results] ON [dbo].[results]
(
[Student_ID],
[ExamEntry],
[Grade]
)
ON [PRIMARY]
This is because the first thing we are interested in is the most recent ExamEntry for any given student. Which can be achieved using the following query...
SELECT
*
FROM
[results]
WHERE
[results].ExamEntry = (
SELECT
MAX([student_results].ExamEntry)
FROM
[results] AS [student_results]
WHERE
[student_results].Student_ID = [results].student_id
)
AND [results].Grade > 60
Having a sub query like this can appear slow, especially since it appears to be executed for every row in [results].
This, however, is not the case...
- Both main and sub query reference the same table
- The query engine scans through the Index for every unique Student_ID
- The sub query is executed, for that Student_ID
- The query engine is already in that part of the index
- So a new Index Lookup is not needed
EDIT:
A comment was made that at 1000 records indexs are not relevant. It should be noted that the question states that there are 1000 records Returned, not that the table contains 1000 records. For a basic query to take as long as stated, I'd wager there are many more than 1000 records in the table. Maybe this can be clarified?
EDIT:
I have just investigated 3 queries, with 999 records in each (3 exam results for each of 333 students)
Method 1: WHERE a.ExamEntry = (SELECT MAX(b.ExamEntry) FROM results [a] WHERE a.Student_ID = b.student_id)
Method 2: WHERE a.ExamEntry IN (SELECT MAX(ExamEntry) FROM resuls GROUP BY Student_ID)
Method 3: USING an INNER JOIN instead of the IN clause
The following times were found:
Method QueryCost(No Index) QueryCost(WithIndex)
1 23% 9%
2 38% 46%
3 38% 46%
So, Query 1 is faster regardless of indexes, but indexes also definitely make method 1 substantially faster.
The reason for this is that indexes allow lookups, where otherwise you need a scan. The difference between a linear law and a square law.
Thanks for the answers!!
I think that Dems is probably closest to what I need, but I will elaborate a bit on the issue.
Only the latest grade counts. If the student had passed first time, attended again and failed, he failed in total. He/She could've attended 3 or 4 exams, but still only the last one counts.
I use MySQL server. The problem I experience in both Linux and Windows installations.
My data set is around 2K entries now and grows with the speed of ~ 1K per new exam.
The query for specific exam also returns ~ 1K entries, when ~ 1K would be the number of students attended (received by SELECT DISTINCT STUDENT_ID from results;), then almost all have passed and some have failed.
I perform the following query in my code:
SELECT ExamEntry, Student_ID from exams WHERE ExamEntry in ( SELECT MAX(ExamEntry) from exams GROUP BY Student_ID). As subquery returns about ~1K entries, it appears that main query scans them in loop, making all the query run for a very long time and with 50% server load (100% on Windows).
I feel that there is a better way :-), just can't find it yet.
select examentry,student_id,grade
from data
where examentry in
(select max(examentry)
from data
where grade > 60
group by student_id)
don't use
where grade > 60
but
where grade between 60 and 100
that should go faster