Mysql performance of repeated count - sql

I need to fetch for each blog article, number of comments and I currently use this SQL
select
id as article_id,
title,
content,
pic,
(select count(id) as comments from article_comments where
article_comments.article_parent_id = article_id group by article_id) as comments
from articles limit 1000);
This query has some significant delay compared to query without the count(id) subquery. The delay is about roughly 2 - 4 seconds for 1000 selected articles. Is there a way to improve performance of this query?

Using count for big data will create a delay increasingly. In order to improve getting the number of comments in an article, create an attribute in the article table called comment_count. And everytime someone enters a comment the number will be increased by 1 in the corresponding article record. In that way, when you want to retrieve the article, you don't have to count the comments every time you load the page, it will be just an attribute.

This is your query:
select id as article_id, title, content, pic,
(select count(id) as comments
from article_comments
where article_comments.article_parent_id = articles.article_id
group by article_id
) as comments
from articles
limit 1000;
First, the group by is unnecessary. Second, the index article_comments(article_parent_id) should help. The final query might look like this:
select a.id as article_id, a.title, a.content, a.pic,
(select count(*) as comments
from article_comments ac
where ac.article_parent_id = a.article_id
) as comments
from articles a
limit 1000;
Note that this also introduces table aliases. Those make the query easier to write and read.

I discovered that if circumstances allow it, it is much faster to make 1st sql query then extract required ids from it and make 2nd sql query with in() operator instead of joining tables / nesting queries.
select id as article_id, title, content, pic from articles limit 1000
At this point we need to declare string variable that is going to contain set of ids that will go into in() operator in next query.
<?php $in = '1, 2, 3, 4,...,1000'; ?>
Now we select comment count for a set of previously fetched article ids.
select count(*) from article_comments where article_id in ($in)
This method is slightly messier in terms of php code, because at his point we need $articles array containing article data and $comments['article_id'] array containing count of comments for each article.
Contrary to improvement in performance this method is messier for php code and makes it impossible to search for values in second or any next table.
This method is hence only applicable if performance is key and no other operations are required.

Related

Query with JOIN or WHERE (SELECT COUNT(*) ...) >= 1?

I have a database schema which contains about 20 tables. For the sake of my question, I simplify asking with only 3 tables :
* posts
id
title
...
* posts_users
post_id
user_id
status (draft, published, etc)
...
* users
id
username
...
For reasons which are out of this topic, Posts and Users have a "many to many" relationship and the status field is part of posts_users (could have been in the posts table).
I'd like to get published posts. I hesitate between 2 kinds of query:
SELECT posts.*
FROM posts
INNER JOIN posts_users ON posts_users.post_id = posts.id
WHERE status = 'published'
or
SELECT posts.*
FROM posts
WHERE (
SELECT COUNT(*)
FROM posts_users
WHERE post_id = posts.id
AND status = 'published'
) >= 1
(I have simplified my question, but in real, posts are linked to far more other data to filter)
My DB is SQLite. My questions are:
What is the difference?
Which way of querying is best in terms of performance?
These queries have different semantics: The first query returns multiple rows if more than one user has published a post (if that is even possible).
The SQLite query optimizer usually cannot rewrite very much, so what you write is likely to be how it is implemented. So your second query will count all posts_users entries, which is not necessary if you only want to find out if there is at least one. You should better use EXISTS for that.
An even simpler way to write the second query would be:
SELECT *
FROM posts
WHERE id IN (SELECT post_id
FROM posts_users
WHERE status = 'published');
(This is one case where SQLite will rewrite it as a correlated subquery, if it estimates it to be more efficient.)
Ultimately, all these queries have to look up the same rows and will have similar performance; what matters most is that you have proper indexes. (But in this case, if most posts are published, an index on status would not help.)
I can tell you the perfomance of this query dependent of your row and column table.
At query 1 - Join is make
Output.row = tableA.row * tableB.row
Output.column = tableA.column * tableB.column
At query 2 - select count like that:
Output.row = tableA.row + tableB.row
Output.column = tableA.column + tableB.column
I recommend query 2 for more perfomance.

How exactly is the value of count(*) determined in BigQuery?

I am joining a table of about 70000 rows with a slightly bigger second table through inner join each. Now count(a.business_column) and count(*) give different results. The former correctly reports back ~70000, while the latter gives ~200000. But this only happens when I select count(*) alone, when I select them together they give the same result (~70000). How is this possible?
select
count(*)
/*,count(a.business_column)*/
from table_a a
inner join each table_b b
on b.key_column = a.business_column
UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.
To answer the title question: COUNT(*) in BigQuery is always accurate.
The caveat is that in SQL COUNT(*) and COUNT(column) have semantically different meanings - and the sample query can be interpreted in different ways.
See: http://www.xaprb.com/blog/2009/04/08/the-dangerous-subtleties-of-left-join-and-count-in-sql/
There they have this sample query:
select user.userid, count(email.subject)
from user
inner join email on user.userid = email.userid
group by user.userid;
That query turns out to be ambigous, and the article author changes it for a more explicit one, adding this comment:
But what if that’s not what the author of the query meant? There’s no
way to really know. There are several possible intended meanings for
the query, and there are several different ways to write the query to
express those meanings more clearly. But the original query is
ambiguous, for a few reasons. And everyone who reads this query
afterwards will end up guessing what the original author meant. “I
think I can safely change this to…”
UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.
COUNT(*) counts most repeated field in your query, if you want to count full records - use COUNT(0).

SQL COUNT(col) vs extra logging column... efficiency?

I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;

Sorting out the dublettes in SQL table

I give an anology of my real problem below:
Imagine a website showing articles and all articles have comments associated with it. Now I want to get the articles that have comments that are commented bigger than a certain date, let say 2011-02-02. I also want to get the comment nearest in time to 2011-02-02. Don't forget that every article have more than one comment associated with it. I want this to happen in one single SQL query.
I found it hard to explain my problem so I give the SQL code:
SELECT articles.*, comments.date AS date
FROM articles, comments
WHERE comments.commentId in (SELECT commentId
FROM comments
WHERE date > 2011-02-02
ORDER BY date asc
LIMIT 1)
ORDER BY comments.date desc
The problem lies in the member section of the SQL query. Because it is only returning one single row. i want this to happen for each article
Use a subquery. Unfortunately your question doesn't give me much for schema...so I'll invent as I go. Lets say you have a table 'Article' with article_id as it's PK and your other table is comments (links on article_ID). I'm assuming article_id + date makes a comment unique.
Select article.article_id, comment.comment_text,comment.comment_date from article
inner join (select min(comment_date) 'comment_date', article_id
from comment
where comment_date < '2010-02-02'
group by article_id) c
on c.article_id = article.article_id
inner join comment on comment.article_id = c.article_id and c.comment_date = comment.comment_date
You can use subqueries as tables within joins. Use the subquery to isolate the single comment you want, then join back to the comment table to get the comment text. Hopefully this made sense. I don't have a MYSQL database to test this on, but I think the syntax should work (it does on MSSQL atleast)
editted for formatting. And you can include a where statement at the bottom of this query to filter what articles you wanted to see.
You just have to query comments greater than your date, returning article ID (you do have a normalised structure, right? It's hard to tell without any detail).
To find the comment closest to your date, order the data by comment date in ascending order and take the first.
select top 1 a.* from articles a
inner join comments c on c.articleid = a.id
where c.date > '2011-02-02'
order by c.date asc
That should do it, although I'm not super familiar with MySQL.

SQL left join query runs VERY slow

Basically I'm trying to pull a random poll question that a user has not yet responded to from a database. This query takes about 10-20 seconds to execute, which is obviously no good! The responses table is about 30K rows and the database also has about 300 questions.
SELECT questions.id
FROM questions
LEFT JOIN responses ON ( questions.id = responses.questionID
AND responses.username = 'someuser' )
WHERE
responses.username IS NULL
ORDER BY RAND() ASC
LIMIT 1
PK for questions and reponses tables is 'id' if that matters.
Any advice would be greatly appreciated.
You most likely need an index on
responses.questionID
responses.username
Without the index searching through 30k rows will always be slow.
Here's a different approach to the query which might be faster:
SELECT q.id
FROM questions q
WHERE q.id NOT IN (
SELECT r.questionID
FROM responses r
WHERE r.username = 'someuser'
)
Make sure there is an index on r.username and that should be pretty quick.
The above will return all the unanswered questios. To choose the random one, you could go with the inefficient (but easy) ORDER BY RAND() LIMIT 1, or use the method suggested by Tom Leys.
The problem is probably not the join, it's almost certainly sorting 30k rows by order rand()
See: Do not order by rand
He suggests (replace quotes in this example with your query)
SELECT COUNT(*) AS cnt FROM quotes
-- generate random number between 0 and cnt-1 in your programming language and run
-- the query:
SELECT quote FROM quotes LIMIT $generated_number, 1
Of course you could probably make the first statement a subselect inside the second.
Is OP even sure the original query returns the correct result set?
I assume the "AND responses.username = 'someuser'" clause was added to join specification with intention that join will then generate null rightside columns for only the id's that someuser has not answered.
My question: won't that join generate null rightside columns for every question.id that has not been answered by all users? The left join works such that, "If any row from the target table does not match the join expression, then NULL values are generated for all column references to the target table in the SELECT column list."
In any case, nickf's suggestion looks good to me.