SQL left join query runs VERY slow - sql

Basically I'm trying to pull a random poll question that a user has not yet responded to from a database. This query takes about 10-20 seconds to execute, which is obviously no good! The responses table is about 30K rows and the database also has about 300 questions.
SELECT questions.id
FROM questions
LEFT JOIN responses ON ( questions.id = responses.questionID
AND responses.username = 'someuser' )
WHERE
responses.username IS NULL
ORDER BY RAND() ASC
LIMIT 1
PK for questions and reponses tables is 'id' if that matters.
Any advice would be greatly appreciated.

You most likely need an index on
responses.questionID
responses.username
Without the index searching through 30k rows will always be slow.

Here's a different approach to the query which might be faster:
SELECT q.id
FROM questions q
WHERE q.id NOT IN (
SELECT r.questionID
FROM responses r
WHERE r.username = 'someuser'
)
Make sure there is an index on r.username and that should be pretty quick.
The above will return all the unanswered questios. To choose the random one, you could go with the inefficient (but easy) ORDER BY RAND() LIMIT 1, or use the method suggested by Tom Leys.

The problem is probably not the join, it's almost certainly sorting 30k rows by order rand()

See: Do not order by rand
He suggests (replace quotes in this example with your query)
SELECT COUNT(*) AS cnt FROM quotes
-- generate random number between 0 and cnt-1 in your programming language and run
-- the query:
SELECT quote FROM quotes LIMIT $generated_number, 1
Of course you could probably make the first statement a subselect inside the second.

Is OP even sure the original query returns the correct result set?
I assume the "AND responses.username = 'someuser'" clause was added to join specification with intention that join will then generate null rightside columns for only the id's that someuser has not answered.
My question: won't that join generate null rightside columns for every question.id that has not been answered by all users? The left join works such that, "If any row from the target table does not match the join expression, then NULL values are generated for all column references to the target table in the SELECT column list."
In any case, nickf's suggestion looks good to me.

Related

Adding ORDER BY on SQLite takes huge amount of time

I've written the following query:
WITH m2 AS (
SELECT m.id, m.original_title, m.votes, l.name as lang
FROM movies m
JOIN movie_languages ml ON m.id = ml.movie_id
JOIN languages l ON l.id = ml.language_id
)
SELECT m.original_title
FROM movies m
WHERE NOT EXISTS (
SELECT 1
FROM m2
WHERE m.id = m2.id AND m2.lang <> 'English'
)
The results appear after 1.5 seconds.
After adding the following line at the end of the query, it takes at least 5 minutes to run it:
ORDER BY votes DESC;
It's not the size of the data, as ORDER BY on the entire table return results in notime.
What am I doing wrong?
Why is the ORDER BY adds so much time? (The query SELECT * FROM movies ORDER BY votes DESC returns immediately).
The order by in the CTE is irrelevant. But I would suggest aggregation for this purpose:
SELECT m.original_title
FROM movies m JOIN
movie_languages ml
ON m.id = ml.movie_id JOIN
languages l
ON l.id = ml.language_id
GROUP BY m.original_title, m.id
HAVING SUM(lang = 'English') = 0;
In order to examine your queries you may turn on the timer by entering .time on at the SQLite prompt. More importantly utilize the EXPLAIN function to see details on your query.
The query initially written does seem to be rather more complex than necessary as already pointed out above. It does not seem apparent what the necessity is for 'movie_languages' and 'languages' tables in general, but especially in this particular query. That would require more explanation on your part but I believe at least one could be removed thus speeding up your query.
The ORDER BY clause in SQLite is handled as described below.
SQLite attempts to use an index to satisfy the ORDER BY clause of a query when possible. When faced with the choice of using an index to satisfy WHERE clause constraints or satisfying an ORDER BY clause, SQLite does the same cost analysis described above and chooses the index that it believes will result in the fastest answer.
SQLite will also attempt to use indices to help satisfy GROUP BY clauses and the DISTINCT keyword. If the nested loops of the join can be arranged such that rows that are equivalent for the GROUP BY or for the DISTINCT are consecutive, then the GROUP BY or DISTINCT logic can determine if the current row is part of the same group or if the current row is distinct simply by comparing the current row to the previous row. This can be much faster than the alternative of comparing each row to all prior rows.
Since there is no index or type on votes stated and the above logic may be followed thus choosing 'the index that it believes will result in the fastest answer'. With the over-complicated query and no index on votes which is being used as ORDER BY then there is much more for it to figure out than necessary. Since the simple query with ORDER BY executes then the complexity of the query causing SQLite much more to compute than necessary.
Additionally the type of the column, most likely INTEGER, is important when sorting (and joining). Attempting to sort on a character type will not only get you wrong results in this case if votes end up above single digits it would be the wrong type to use (I'm not assuming you are just mentioning it).
So simplify the query, ensure your PRIMARY KEYS are properly set, and test it. If it is still not returning in time try an index on votes. This will give you much better insight into what is going on and how different changes affect your queries.
SQLite Documentation - check all and note 6. Sorting, Grouping and Compound SELECTs
SQLite Documentation - check 10. ORDER BY optimizations
You can do it with NOT EXISTS, without joins and aggregation (assuming that there is always at least 1 row for each movie in the table movie_languages):
SELECT m.*
FROM movies m
WHERE NOT EXISTS (
SELECT 1 FROM movie_languages ml
WHERE m.id = ml.movie_id
AND ml.language_id <> (SELECT l.id FROM languages l WHERE l.lang = 'English')
)
ORDER BY m.votes DESC
or with a LEFT join to languages to get the unmatched rows:
SELECT m.*
FROM movies m
INNER JOIN movie_languages ml ON m.id = ml.movie_id
LEFT JOIN languages l ON l.id = ml.language_id AND l.lang <> 'English'
WHERE l.id IS NULL
ORDER BY m.votes DESC
Refer to this link for more information:
here
In a nutshell, When you include an order by clause, the database builds a list of the rows in the correct order and then returns the data in that order.
The creation of the list mentioned above takes a lot of extra processing, translating into a longer execution time.

Is there a better way to optimize this SQL query?

Assume my table has millions of records.
The following columns have indexes:
type
date
unique_id
Here is my query:
SELECT TOP (1000) T.TIME, T.TYPE, F.NAME,
B.NAME, T.MESSAGE
FROM MY_TABLE T
LEFT OUTER JOIN FOO F ON F.ID = T.FID
LEFT OUTER JOIN BAR B ON B.ID = T.BID
WHERE T.TYPE IN ('success', 'failure')
AND T.DATE BETWEEN 1592585183437 AND 1594232320525
AND T.UNIQUE_ID = "my unique ID"
ORDER BY T.DATE DESC
My question is am I causing myself any trouble with this query if I have tons of records in my table? Can this be optimized further?
My question is am I causing myself any trouble with this query if I have tons of records
in my table?
Wrong question. As long as you are only asking for data you need, you NEED it. Any trouble means you STIL need it.
The query looks as good as it gets. I somehow doubt the TOP 10000 (that is a LOT of data unless you compress it somehow).
The question is more whether you have the proper indices. Without the proper indices. I would also not use strings like this for TYPE, but then the question is about the query, not possible failures in table design.
Check indices.

postgresql query to determine recommendation records that haven't been seen by a user

I have 3 tables: user, recommendation (post_id, user_id), post
When a user votes on a post, a new recommendation record gets created with the post_id, user_id, and vote value.
I want to have a query that shows a random post that a user hasn't seen/voted on yet.
My thought on this is that it needs to join all recommendations of a user to the post table... and then select the records that don't have a joined recommendation. Not sure how to do this though...
What I have so far that definitely doesn't work:
SELECT "posts".*
FROM "posts"
INNER JOIN "recommendations" ON "recommendations"."post_id" = "posts"."id"
ORDER BY RANDOM()
LIMIT 1
You can do this with a left outer join:
SELECT p.*
FROM posts p LEFT OUTER JOIN
recommendations r
ON r.post_id = p.id and r.userid = YOURUSERID
WHERE r.post_id IS NULL
ORDER BY RANDOM()
LIMIT 1;
Note that I simplified the query by removing the double quotes (not needed for your identifier names) and adding table aliases. These changes make the query easier to write and to read.
There are several good ways to exclude rows that already have a recommendation from a given user:
Select rows which are not present in other table
The important question is: arbitrary or random?
For an arbitrary pick (any qualifying row is good enough), this should be cheapest:
SELECT *
FROM posts p
WHERE NOT EXISTS (
SELECT 1
FROM recommendations
WHERE post_id = p.id
AND user_id = $my_user_id
)
LIMIT 1;
The sort step might be expensive (and unnecessary) for lots of posts. In such a use case most of the posts will typically have no recommendation from the user at hand, yet. You'd have to order all those rows by random() every time.
If any post without recommendation is good enough, dropping ORDER BY will make it considerably faster. Postgres can just return the first qualifying post it finds.
so you need a set of all posts EXCEPT posts already recommended.
SELECT p.id FROM posts p
EXCEPT
SELECT r.post_id FROM recommendations r WHERE r.user_id = X
...

SQL COUNT(col) vs extra logging column... efficiency?

I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;

mysql statement with nested SELECT - how to improve performance

This statement appears inefficient because only one one out of 10 records are selected and only 1 of 100 entries contain comments.
What can I do to improve it ?
$query = "SELECT
A,B,C,
(SELECT COUNT(*)
FROM comments
WHERE comments.nid = header_file.nid)
as my_comment_count
FROM header_file
Where A = 'admin' "
edit: I want header records even if no comments are found.
You can add index on a A and nid column.
I am using an inner join here because it sounds like you only want header_file records that contain comments. If this is not the case, change it to a left outer join:
select h.a, h.b, h.c, c.Count
from header_file h
inner join (
select nid, count(*) as Count
from comments
group by nid
) c on c.nid = h.nid
where h.a = 'admin'
From question:
This statement appears inefficient ...
How do you know it is inefficient?
Do you have execution plan?
Did you measure execution times?
Are you sure it uses index?
You commented Peter Lang's answer: ... not sure if any performance gain here - is based on what?
Basic thing you should know about query execution:
Most modern RDBMS-s have query optimizer that analyzes your SQL and determines optimal execution plan.
Your feeling that some query is "bad" doesn't mean anything. You need to check execution plan and then you see if there is anything you can do to improve performance.
For MySql, see article: 7.2.1. Optimizing Queries with EXPLAIN
You could also run SQL from answers and compare execution plans to see if any of proposed solution gives better performance.
You can try to use a Left Join, this could allow better optimization:
Select a, b, c, Count(*) As my_comment_count
From header_file h
Left Outer Join comments c On ( c.nid = h.nid )
Group By a, b, c
Where A = 'admin'
Insert the comment count number directly in the table, count(*) isn't very efficient.
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/