Query with JOIN or WHERE (SELECT COUNT(*) ...) >= 1? - sql

I have a database schema which contains about 20 tables. For the sake of my question, I simplify asking with only 3 tables :
* posts
id
title
...
* posts_users
post_id
user_id
status (draft, published, etc)
...
* users
id
username
...
For reasons which are out of this topic, Posts and Users have a "many to many" relationship and the status field is part of posts_users (could have been in the posts table).
I'd like to get published posts. I hesitate between 2 kinds of query:
SELECT posts.*
FROM posts
INNER JOIN posts_users ON posts_users.post_id = posts.id
WHERE status = 'published'
or
SELECT posts.*
FROM posts
WHERE (
SELECT COUNT(*)
FROM posts_users
WHERE post_id = posts.id
AND status = 'published'
) >= 1
(I have simplified my question, but in real, posts are linked to far more other data to filter)
My DB is SQLite. My questions are:
What is the difference?
Which way of querying is best in terms of performance?

These queries have different semantics: The first query returns multiple rows if more than one user has published a post (if that is even possible).
The SQLite query optimizer usually cannot rewrite very much, so what you write is likely to be how it is implemented. So your second query will count all posts_users entries, which is not necessary if you only want to find out if there is at least one. You should better use EXISTS for that.
An even simpler way to write the second query would be:
SELECT *
FROM posts
WHERE id IN (SELECT post_id
FROM posts_users
WHERE status = 'published');
(This is one case where SQLite will rewrite it as a correlated subquery, if it estimates it to be more efficient.)
Ultimately, all these queries have to look up the same rows and will have similar performance; what matters most is that you have proper indexes. (But in this case, if most posts are published, an index on status would not help.)

I can tell you the perfomance of this query dependent of your row and column table.
At query 1 - Join is make
Output.row = tableA.row * tableB.row
Output.column = tableA.column * tableB.column
At query 2 - select count like that:
Output.row = tableA.row + tableB.row
Output.column = tableA.column + tableB.column
I recommend query 2 for more perfomance.

Related

Getting duplicate rows on subquery

So this has been bothering me for some time because I feel like in MSSQL this query would run just fine but at my new job I am forced to use Oracle. I have a subselect in a query where I want to find all of the people not assigned to a survey. My query is as follows:
Select distinct * From GetAllUsers, getperms
Where id not in (getperms.users) and Survey_ID = '1'
If there are three users in the getperms table I get three rows for each person in the the GETALLUsers table.
I guess I could do some kind of join and that's no problem, it's just really bothering me that this doesn't work when i think that it should.
I feel like in MSSQL this query would run just fine
It would not. In both Oracle and MS-SQL, an IN clause needs to be a static list of items or a subquery that returns one column, so you'd need something like:
Select distinct *
From GetAllUsers
Where id not in (SELECT id FROM getperms.users)
and Survey_ID = '1'
Note that I took getperms out of the FROM since it produces a cross-join, which is why you get every combination of records from both tables.
Your query looks just horrible. First of all you are using a join syntax that has become out of date more than twenty years ago. Then you are using NOT IN where it doesn't make sense (your set contains just one value). Then I can hardly imagine that DISTINCT makes sense in your query. Or one of the two tables contains records that are exact duplicates, which it shouldn't. Is Survey_ID really a string? And as you are working with two tables, you should use qualifiers on your columns to indicate where they reside in, e.g. GetAllUsers.id. I assume that Survey_ID resides in getperms.
Your query properly written looks thus:
Select *
From GetAllUsers gau
Join getperms gp On gp.Survey_ID = '1' And gp.users <> gau.id;
But of course whether you write it this way or the other, it does exactly the same. And there is no difference whether you use it in Oracle or SQL Server. What is does is: get all combinations of GetAllUsers and survey-1 getperms, except for the matches. So every user will be in the results, each combined with almost every getperms record.
You say: "I want to find all of the people not assigned to a survey". That would rather be:
Select *
From GetAllUsers
Where id Not In (Select users From getperms Where Survey_ID = '1');
You can use any of the following 3 methods,if you wanted to exclude the getperms users from GetAllUSers .
SELECT DISTINCT a.*
FROM GetAllUsers a
LEFT JOIN getperms b
ON a.id=b.users
WHERE b.users is null AND Survey_ID = '1'
OR
SELECT DISTINCT *
FROM GetAllUsers a
WHERE NOT EXISTS (SELECT 1
FROM getperms b
WHERE a.id=b.users)
AND Survey_ID = '1'
OR
SELECT DISTINCT *
FROM GetAllUsers
WHERE ID NOT IN (SELECT users
FROM getperms )
AND Survey_ID = '1'

Mysql performance of repeated count

I need to fetch for each blog article, number of comments and I currently use this SQL
select
id as article_id,
title,
content,
pic,
(select count(id) as comments from article_comments where
article_comments.article_parent_id = article_id group by article_id) as comments
from articles limit 1000);
This query has some significant delay compared to query without the count(id) subquery. The delay is about roughly 2 - 4 seconds for 1000 selected articles. Is there a way to improve performance of this query?
Using count for big data will create a delay increasingly. In order to improve getting the number of comments in an article, create an attribute in the article table called comment_count. And everytime someone enters a comment the number will be increased by 1 in the corresponding article record. In that way, when you want to retrieve the article, you don't have to count the comments every time you load the page, it will be just an attribute.
This is your query:
select id as article_id, title, content, pic,
(select count(id) as comments
from article_comments
where article_comments.article_parent_id = articles.article_id
group by article_id
) as comments
from articles
limit 1000;
First, the group by is unnecessary. Second, the index article_comments(article_parent_id) should help. The final query might look like this:
select a.id as article_id, a.title, a.content, a.pic,
(select count(*) as comments
from article_comments ac
where ac.article_parent_id = a.article_id
) as comments
from articles a
limit 1000;
Note that this also introduces table aliases. Those make the query easier to write and read.
I discovered that if circumstances allow it, it is much faster to make 1st sql query then extract required ids from it and make 2nd sql query with in() operator instead of joining tables / nesting queries.
select id as article_id, title, content, pic from articles limit 1000
At this point we need to declare string variable that is going to contain set of ids that will go into in() operator in next query.
<?php $in = '1, 2, 3, 4,...,1000'; ?>
Now we select comment count for a set of previously fetched article ids.
select count(*) from article_comments where article_id in ($in)
This method is slightly messier in terms of php code, because at his point we need $articles array containing article data and $comments['article_id'] array containing count of comments for each article.
Contrary to improvement in performance this method is messier for php code and makes it impossible to search for values in second or any next table.
This method is hence only applicable if performance is key and no other operations are required.

postgresql query to determine recommendation records that haven't been seen by a user

I have 3 tables: user, recommendation (post_id, user_id), post
When a user votes on a post, a new recommendation record gets created with the post_id, user_id, and vote value.
I want to have a query that shows a random post that a user hasn't seen/voted on yet.
My thought on this is that it needs to join all recommendations of a user to the post table... and then select the records that don't have a joined recommendation. Not sure how to do this though...
What I have so far that definitely doesn't work:
SELECT "posts".*
FROM "posts"
INNER JOIN "recommendations" ON "recommendations"."post_id" = "posts"."id"
ORDER BY RANDOM()
LIMIT 1
You can do this with a left outer join:
SELECT p.*
FROM posts p LEFT OUTER JOIN
recommendations r
ON r.post_id = p.id and r.userid = YOURUSERID
WHERE r.post_id IS NULL
ORDER BY RANDOM()
LIMIT 1;
Note that I simplified the query by removing the double quotes (not needed for your identifier names) and adding table aliases. These changes make the query easier to write and to read.
There are several good ways to exclude rows that already have a recommendation from a given user:
Select rows which are not present in other table
The important question is: arbitrary or random?
For an arbitrary pick (any qualifying row is good enough), this should be cheapest:
SELECT *
FROM posts p
WHERE NOT EXISTS (
SELECT 1
FROM recommendations
WHERE post_id = p.id
AND user_id = $my_user_id
)
LIMIT 1;
The sort step might be expensive (and unnecessary) for lots of posts. In such a use case most of the posts will typically have no recommendation from the user at hand, yet. You'd have to order all those rows by random() every time.
If any post without recommendation is good enough, dropping ORDER BY will make it considerably faster. Postgres can just return the first qualifying post it finds.
so you need a set of all posts EXCEPT posts already recommended.
SELECT p.id FROM posts p
EXCEPT
SELECT r.post_id FROM recommendations r WHERE r.user_id = X
...

SQL COUNT(col) vs extra logging column... efficiency?

I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.