MySQL performance using IN predicate - sql

If I run the following queries each one returns quickly (0.01 sec) and gives me my desired result.
SELECT tagId FROM tag WHERE name='programming'
SELECT COUNT(DISTINCT workcode) FROM worktag WHERE tagId=123 OR tagId=124
(assume the two tagId numbers were the results from the first query)
I would like to combine these queries so I only have to run it once:
SELECT COUNT(DISTINCT workcode) FROM worktag WHERE tagId IN (SELECT tagId FROM tag WHERE name='programming')
However this query completes in about 1 min and 20 sec. I have indexes on worktag.workcode, worktag.tagId, tag.tagId, and tag.name.
If I run DESCRIBE on the queries the first two use the indexes and the second one uses the index for the subquery (on the tag table) but doesn't use any indexes on the worktag table.
Does anyone know why this might be?
NOTE: the worktag table has over 18 million records in it.

Why don't you use a join instead of a subquery?
SELECT COUNT(DISTINCT workcode)
FROM worktag
LEFT JOIN tag
ON worktag.tagId = tag.tagID
WHERE tag.name = 'programming'
P.S.: Seems to be reported as bug.

A database admin told me recently, that the syntax WHERE x IN ( ... ) is a pain for the database. A join is almost always better:
SELECT COUNT(DISTINCT wt.workcode)
FROM worktag wt, tag t
WHERE wt.tagId = t.tagId
AND t.name='programming'

SELECT COUNT(DISTINCT workcode)
FROM worktag
inner join tag on worktag.tagid = tag.tagid
WHERE tag.name='programming'

MySQL generally doesn't do so well with subqueries, even independent ones. The posters who discussed joins are right - if you've got a choice, use a join. If you can't easily use a join (ie, foo.x in (select y from bar where y = xxx limit 10)), you're better off running the limit into a temporary IN MEMORY table and using a join on it.
If you're using MySQL a lot, use EXPLAIN and you'll see how it's using your indexes and such things.

Have you tried:
SELECT COUNT(DISTINCT workcode) FROM worktag WHERE tagId IN (123, 124)
?
I'm not a MySQL expert, but it looks to me like you might be looking at a significant failure of the query optimizer.
On the other had, good for MySQL that it optimizes the OR in the second statement. I know databases that will successfully optimize the IN (), but not the OR version of the same logical request.

I guess the optimizer does some bad guess. Replacing the query with an inner join might help.

Related

SELECT MAX() too slow - any alternatives?

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.
SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').
Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix
This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!
The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.
Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.
[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

How do I exclude or negate two queries?

I am new to SQL, so this is probably very simple, however, I wasn't able to find the solution.
Basically my query is as follows:
SELECT UserID
FROM Users
NOT UNION
SELECT UserID
FROM User_Groups
WHERE GroupID = '$_[0]'
However, I am not sure what the syntax is to exclude one query from another.
What I am trying to say is give me all the user ID's except for those that are in group X.
SELECT UserID FROM Users
WHERE UserID NOT IN (SELECT UserID FROM User_Groups WHERE GroupID = ?)
P.S. Don't interpolate variables into your queries as this can lead to SQL injection vulnerabilities in your code. Use placeholders instead.
SELECT Users.UserID
FROM Users
LEFT JOIN User_Groups ON Users.UserID = User_Groups.UserID
WHERE Users.GroupID = '$_[0]'
AND User_Groups.UserID IS NULL
You can left join to the other table and then put an IS NULL check on the other table in you WHERE clause as I've shown.
You could use EXCEPT as well:
SELECT UserID
FROM Users
EXCEPT
SELECT UserID
FROM User_Groups
WHERE GroupID = '$_[0]'
EXCEPT is SQL's version of set subtraction. Which of the various approaches (EXCEPT, NOT IN, ...) you should use depends, as usual, on your specific circumstances, what your database supports, and which one works best for you.
And eugene y has already mentioned the SQL injection issue with your code so I'll just consider that covered.
I linked to the PostgreSQL documentation even though this isn't a PostgreSQL question because the PostgreSQL documentation is quite good. SQLite does support EXCEPT:
The EXCEPT operator returns the subset of rows returned by the left SELECT that are not also returned by the right-hand SELECT. Duplicate rows are removed from the results of INTERSECT and EXCEPT operators before the result set is returned.
NOT IN() - Negating IN()
SELECT UserID FROM User_Groups WHERE GroupID NOT IN('1','2')
The IN() parameter can also be a sub-query.
Are you looking for a solution to be used with a postgres or a mySQL database?
Or are you looking for a plain SQL solution?
With postgres a subquery with "WHERE NOT EXISTS" might work like:
SELECT * FROM
(SELECT * FROM SCHEMA_NAME.TABLE_NAME)
WHERE
(NOT EXISTS (SELECT * FROM SCHEMA_NAME.TABLE_NAME)

Is this SQL select code following good practice?

I am using SQLite and will port to MySQL (5) later.
I wanted to know if I am doing something I shouldn't be doing. I tried purposely to design so I'll compare to 0 instead of 1 (I changed hasApproved to NotApproved to do this, not a big deal and I haven't written any code). I was told I never need to write a subquery but I do here. My Votes table is just id, ip, postid (I don't think I can write that subquery as a join instead?) and that's pretty much all that is on my mind.
Naming conventions I don't really care about since the tables are created via reflection and is all over the place.
select
id,
name,
body,
upvotes,
downvotes,
(select 1 from UpVotes where IPAddr=? AND post=Post.id) as myup,
(select 1 from DownVotes where IPAddr=#0 AND post=Post.id) as mydown
from Post
where
flag = '0'
limit ?, ?"
Since you're asking about good practices... the "upvotes" and "downvotes" appearing in your Posts table looks like you're duplicating data in your database. That's a problem, because now you always have to worry whether or not the data is in sync and correct. If you want to know the number of upvotes then count them, don't also store them in the Post table. I'm not positive that is what you're doing, but it's a guess.
Onto your query... You will probably get better performance using a JOINed subquery instead of how you have it. With the scalar subqueries as columns they have to be run once for every row that is returned. That could be a pretty big performance hit if you're returning a bunch of rows. Instead, try:
SELECT
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes,
COALESCE(UV.cnt, 0) AS upvotes2,
COALESCE(DV.cnt, 0) AS downvotes2
FROM
dbo.Posts P
LEFT OUTER JOIN (SELECT post_id, COUNT(*) cnt FROM dbo.UpVotes GROUP BY post_id) AS UV ON UV.post_id = P.id
LEFT OUTER JOIN (SELECT post_id, COUNT(*) cnt FROM dbo.DownVotes GROUP BY post_id) AS DV ON DV.post_id = P.id
Compare it to your own query and see if it gives you better performance.
EDIT: A couple of other posters have advocated a single table for up/down votes. They are absolutely correct. That makes the query even easier and also probably much faster:
SELECT
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes,
SUM(CASE WHEN V.vote_type = 'UP' THEN 1 ELSE 0 END) AS upvotes2,
SUM(CASE WHEN V.vote_type = 'DOWN' THEN 1 ELSE 0 END) AS downvotes2,
FROM
dbo.Posts P
LEFT OUTER JOIN Votes V ON
V.post_id = P.id
GROUP BY
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes
I'm guessing that you're trying to ensure that a user only votes once on each post here.
I wouldn't - I don't - use separate tables for up votes and down votes. Add vote type to your votes table and you won't need correlated subqueries.
Here is my opinions:
It seems that table "UpVotes" and "DownVotes" have same structure and can be merged into one table.
The relation between table "Post" and "Up/DownVotes" can be constrained by foreign key.
Although I am not sure about the performance difference, but I think it would be better to use "join" mechanism rather than nesting two select statement in a select statement.
You can use joins to achieve the same thing, and I would expect joins to work a lot more efficiently than embeded selects.

Two left joins and one query to MySQL performance problem

I'm designing a project for quizzes and quizz results. So I have two tables: quizz_result and quizz. quizz has primary key on ID and quizz_result has foreign key QUIZZ_ID to quizz identity.
Query below is designed to take public quizzes ordered by date with asociated informations: if current user (683735) took this quizz and has a valid result (>0) and how many people filled this quizz till this point in time.
So i did this simple query with two left joins:
select
a.*,
COUNT(countt.QUIZZ_ID) SUMFILL
from
quizz a
left join quizz_result countt
on countt.QUIZZ_ID = a.ID
group by
a.ID
And added indexes on these columns:
Quizz:
ID, (ID, DATE), PUBLIC, (PUBLIC, DATE)
And on quizz_result:
ID, (QUIZZ_ID, USER_ID), QUIZZ_ID, USER_ID, (QUIZZ_ID, QUIZZ_RESULT_ID)
But still when I do query it takes like about one minute. And i have only 34k rows in QUIZZ_RESULTS and 120 rows in QUIZZ table.
When I do EXPLAIN on this query I get this:
SELECT TYPE: simple, possible keys: IDX_PUBLIC,DATE, rows: 34 extra: Using where; Using temporary; Using filesort
SELECT TYPE: simple, possible keys: IDX_QUIZZ_USER,IDX_QUIZZ_RES_RES_QUIZ,IDX_USERID,I..., rows: 1, extra: nothing here
SELECT TYPE: simple, possible keys: IDX_QUIZZ_USER,IDX_QUIZ_RES_RES_QUIZZ,ID_RESULT_ID, rows: 752, extra: Using index
And I don't know what to do to optimise this query. I see this:
Using where; Using temporary; Using filesort
But still I don't know how to get this better, or maybe number of rows in last select is to hight? 752?
How can I optimise this query?
EDIT: I've upadated query to this one with only one left join because it has the same long execution time.
EDIT2: I did remove everything to and thats it: this simple select with one query takes 1s to execute. How to optimise it?
Try taking some of those additional conditions out of your joins.
Moving them to the where clause can sometimes help. Also, consider putting the core joins into their own subquery and then limiting that with a where clause.
What about an index on (USER_ID, QUIZZ_ID, QUIZZ_RESULT_ID), since they're all AND'd together?
I've changed it to this:
select
a.*,
COUNT(a.ID) SUMFILL
from
quizz a
left join quizz_result countt
on countt.QUIZZ_ID = a.ID
group by
a.ID
And it's good now.
Try this:
SELECT q.*,
(
SELECT COUNT(*)
FROM quizz_results qr
WHERE qr.quizz_id = q.id
) AS total_played,
(
SELECT result
FROM qr.quizz_id = q.id
AND user_id = 683735
) AS current_user_won
FROM quizz q

optimize SQL query

What more can I do to optimize this query?
SELECT * FROM
(SELECT `item`.itemID, COUNT(`votes`.itemID) AS `votes`,
`item`.title, `item`.itemTypeID, `item`.
submitDate, `item`.deleted, `item`.ItemCat,
`item`.counter, `item`.userID, `users`.name,
TIMESTAMPDIFF(minute,`submitDate`,NOW()) AS 'timeMin' ,
`myItems`.userID as userIDFav, `myItems`.deleted as myDeleted
FROM (votes `votes` RIGHT OUTER JOIN item `item`
ON (`votes`.itemID = `item`.itemID))
INNER JOIN
users `users`
ON (`users`.userID = `item`.userID)
LEFT OUTER JOIN
myItems `myItems`
ON (`myItems`.itemID = `item`.itemID)
WHERE (`item`.deleted = 0)
GROUP BY `item`.itemID,
`votes`.itemID,
`item`.title,
`item`.itemTypeID,
`item`.submitDate,
`item`.deleted,
`item`.ItemCat,
`item`.counter,
`item`.userID,
`users`.name,
`myItems`.deleted,
`myItems`.userID
ORDER BY `item`.itemID DESC) as myTable
where myTable.userIDFav = 3 or myTable.userIDFav is null
limit 0, 20
I'm using MySQL
Thanks
What does the analyzer say for this query? Without knowledge about how many rows there are in the table you cant tell any optimization. So run the analyzer and you'll see what parts costs what.
Of course, as #theomega said, look at the execution plan.
But I'd also suggest to try and "clean up" your statement. (I don't know which one is faster - that depends on your table sizes.) Usually, I'd try to start with a clean statement and start optimizing from there. But typically, a clean statement makes it easier for the optimizer to come up with a good execution plan.
So here are some observations about your statement that might make things slow:
a couple of outer joins (makes it hard for the optimzer to figure out an index to use)
a group by
a lot of columns to group by
As far as I understand your SQL, this statement should do most of what yours is doing:
SELECT `item`.itemID, `item`.title, `item`.itemTypeID, `item`.
submitDate, `item`.deleted, `item`.ItemCat,
`item`.counter, `item`.userID, `users`.name,
TIMESTAMPDIFF(minute,`submitDate`,NOW()) AS 'timeMin'
FROM (item `item` INNER JOIN users `users`
ON (`users`.userID = `item`.userID)
WHERE
Of course, this misses the info from the tables you outer joined, I'd suggest to try to add the required columns via a subselect:
SELECT `item`.itemID,
(SELECT count (itemID)
FROM votes v
WHERE v.itemID = 'item'.itemID) as 'votes', <etc.>
This way, you can get rid of one outer join and the group by. The outer join is replaced by the subselect, so there is a trade-off which may be bad for the "cleaner" statement.
Depending on the cardinality between item and myItems, you can do the same or you'd have to stick with the outer join (but no need to reintroduce the group by).
Hope this helps.
Some quick semi-random thoughts:
Are your itemID and userID columns indexed?
What happens if you add "EXPLAIN " to the start of the query and run it? Does it use indexes? Are they sensible?
DO you need to run the whole inner query and filter on it, or could you put move the where myTable.userIDFav = 3 or myTable.userIDFav is null part into the inner query?
You do seem to have too many fields in the Group By list, since one of them is itemID, I suspect that you could use an inner SELECT to preform the grouping and an outer SELECT to return the set of fields desired.
Can't you add the where clause myTable.userIDFav = 3 or myTable.userIDFav is null to WHERE (item.deleted = 0)?
Regards
Lieven
Look at the way your query is built. You join a lot of stuff, then limit the output to 20 rows. You should have the outer join on items and myitems, since your conditions only apply to these two tables, limit the output to the first 20 rows, then join and aggregate. Here you are performing a lot of work that is going to be discarded.