This statement appears inefficient because only one one out of 10 records are selected and only 1 of 100 entries contain comments.
What can I do to improve it ?
$query = "SELECT
A,B,C,
(SELECT COUNT(*)
FROM comments
WHERE comments.nid = header_file.nid)
as my_comment_count
FROM header_file
Where A = 'admin' "
edit: I want header records even if no comments are found.
You can add index on a A and nid column.
I am using an inner join here because it sounds like you only want header_file records that contain comments. If this is not the case, change it to a left outer join:
select h.a, h.b, h.c, c.Count
from header_file h
inner join (
select nid, count(*) as Count
from comments
group by nid
) c on c.nid = h.nid
where h.a = 'admin'
From question:
This statement appears inefficient ...
How do you know it is inefficient?
Do you have execution plan?
Did you measure execution times?
Are you sure it uses index?
You commented Peter Lang's answer: ... not sure if any performance gain here - is based on what?
Basic thing you should know about query execution:
Most modern RDBMS-s have query optimizer that analyzes your SQL and determines optimal execution plan.
Your feeling that some query is "bad" doesn't mean anything. You need to check execution plan and then you see if there is anything you can do to improve performance.
For MySql, see article: 7.2.1. Optimizing Queries with EXPLAIN
You could also run SQL from answers and compare execution plans to see if any of proposed solution gives better performance.
You can try to use a Left Join, this could allow better optimization:
Select a, b, c, Count(*) As my_comment_count
From header_file h
Left Outer Join comments c On ( c.nid = h.nid )
Group By a, b, c
Where A = 'admin'
Insert the comment count number directly in the table, count(*) isn't very efficient.
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/
Related
I'm working on an oracle query that is doing a select on a huge table, however the joins with other tables seem to be costing a lot in terms of time of processing.
I'm looking for tips on how to improve the working of this query.
I'm attaching a version of the query and the explain plan of it.
Query
SELECT
l.gl_date,
l.REST_OF_TABLES
(
SELECT
MAX(tt.task_id)
FROM
bbb.jeg_pa_tasks tt
WHERE
l.project_id = tt.project_id
AND l.task_number = tt.task_number
) task_id
FROM
aaa.jeg_labor_history l,
bbb.jeg_pa_projects_all p
WHERE
p.org_id = 2165
AND l.project_id = p.project_id
AND p.project_status_code = '1000'
Something to mention:
This query takes data from oracle to send it to a sql server database, so I need it to be this big, I can't narrow the scope of the query.
the purpose is to set it to a sql server job with SSIS so it runs periodically
One obvious suggestion is not to use sub query in select clause.
Instead, you can try to join the tables.
SELECT
l.gl_date,
l.REST_OF_TABLES
t.task_id
FROM
aaa.jeg_labor_history l
Join bbb.jeg_pa_projects_all p
On (l.project_id = p.project_id)
Left join (SELECT
tt.project_id,
tt.task_number,
MAX(tt.task_id) task_id
FROM
bbb.jeg_pa_tasks tt
Group by tt.project_id, tt.task_number) t
On (l.project_id = t.project_id
AND l.task_number = t.task_number)
WHERE
p.org_id = 2165
AND p.project_status_code = '1000';
Cheers!!
As I don't know exactly how many rows this query is returning or how many rows this table/view has.
I can provide you few simple tips which might be helpful for you for better query performance:
Check Indexes. There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement.
Limit the size of your working data set.
Only select columns you need.
Remove unnecessary tables.
Remove calculated columns in JOIN and WHERE clauses.
Use inner join, instead of outer join if possible.
You view contains lot of data so you can also break down and limit only the information you need from this view
Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan
Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.
Is the following the most efficient in SQL to achieve its result:
SELECT *
FROM Customers
WHERE Customer_ID NOT IN (SELECT Cust_ID FROM SUBSCRIBERS)
Could some use of joins be better and achieve the same result?
Any mature enough SQL database should be able to execute that just as effectively as the equivalent JOIN. Use whatever is more readable to you.
One reason why you might prefer to use a JOIN rather than NOT IN is that if the Values in the NOT IN clause contain any NULLs you will always get back no results. If you do use NOT IN remember to always consider whether the sub query might bring back a NULL value!
RE: Question in Comments
'x' NOT IN (NULL,'a','b')
≡ 'x' <> NULL and 'x' <> 'a' and 'x' <>
'b'
≡ Unknown and True and True
≡ Unknown
Maybe try this
Select cust.*
From dbo.Customers cust
Left Join dbo.Subscribers subs on cust.Customer_ID = subs.Customer_ID
Where subs.Customer_Id Is Null
SELECT Customers.*
FROM Customers
WHERE NOT EXISTS (
SELECT *
FROM SUBSCRIBERS AS s
JOIN s.Cust_ID = Customers.Customer_ID)
When using “NOT IN”, the query performs nested full table scans, whereas for “NOT EXISTS”, the query can use an index within the sub-query.
If you want to know which is more effective, you should try looking at the estimated query plans, or the actual query plans after execution. It'll tell you the costs of the queries (I find CPU and IO cost to be interesting). I wouldn't be surprised much if there's little to no difference, but you never know. I've seen certain queries use multiple cores on our database server, while a rewritten version of that same query would only use one core (needless to say, the query that used all 4 cores was a good 3 times faster). Never really quite put my finger on why that is, but if you're working with large result sets, such differences can occur without your knowing about it.
Basically I'm trying to pull a random poll question that a user has not yet responded to from a database. This query takes about 10-20 seconds to execute, which is obviously no good! The responses table is about 30K rows and the database also has about 300 questions.
SELECT questions.id
FROM questions
LEFT JOIN responses ON ( questions.id = responses.questionID
AND responses.username = 'someuser' )
WHERE
responses.username IS NULL
ORDER BY RAND() ASC
LIMIT 1
PK for questions and reponses tables is 'id' if that matters.
Any advice would be greatly appreciated.
You most likely need an index on
responses.questionID
responses.username
Without the index searching through 30k rows will always be slow.
Here's a different approach to the query which might be faster:
SELECT q.id
FROM questions q
WHERE q.id NOT IN (
SELECT r.questionID
FROM responses r
WHERE r.username = 'someuser'
)
Make sure there is an index on r.username and that should be pretty quick.
The above will return all the unanswered questios. To choose the random one, you could go with the inefficient (but easy) ORDER BY RAND() LIMIT 1, or use the method suggested by Tom Leys.
The problem is probably not the join, it's almost certainly sorting 30k rows by order rand()
See: Do not order by rand
He suggests (replace quotes in this example with your query)
SELECT COUNT(*) AS cnt FROM quotes
-- generate random number between 0 and cnt-1 in your programming language and run
-- the query:
SELECT quote FROM quotes LIMIT $generated_number, 1
Of course you could probably make the first statement a subselect inside the second.
Is OP even sure the original query returns the correct result set?
I assume the "AND responses.username = 'someuser'" clause was added to join specification with intention that join will then generate null rightside columns for only the id's that someuser has not answered.
My question: won't that join generate null rightside columns for every question.id that has not been answered by all users? The left join works such that, "If any row from the target table does not match the join expression, then NULL values are generated for all column references to the target table in the SELECT column list."
In any case, nickf's suggestion looks good to me.
What more can I do to optimize this query?
SELECT * FROM
(SELECT `item`.itemID, COUNT(`votes`.itemID) AS `votes`,
`item`.title, `item`.itemTypeID, `item`.
submitDate, `item`.deleted, `item`.ItemCat,
`item`.counter, `item`.userID, `users`.name,
TIMESTAMPDIFF(minute,`submitDate`,NOW()) AS 'timeMin' ,
`myItems`.userID as userIDFav, `myItems`.deleted as myDeleted
FROM (votes `votes` RIGHT OUTER JOIN item `item`
ON (`votes`.itemID = `item`.itemID))
INNER JOIN
users `users`
ON (`users`.userID = `item`.userID)
LEFT OUTER JOIN
myItems `myItems`
ON (`myItems`.itemID = `item`.itemID)
WHERE (`item`.deleted = 0)
GROUP BY `item`.itemID,
`votes`.itemID,
`item`.title,
`item`.itemTypeID,
`item`.submitDate,
`item`.deleted,
`item`.ItemCat,
`item`.counter,
`item`.userID,
`users`.name,
`myItems`.deleted,
`myItems`.userID
ORDER BY `item`.itemID DESC) as myTable
where myTable.userIDFav = 3 or myTable.userIDFav is null
limit 0, 20
I'm using MySQL
Thanks
What does the analyzer say for this query? Without knowledge about how many rows there are in the table you cant tell any optimization. So run the analyzer and you'll see what parts costs what.
Of course, as #theomega said, look at the execution plan.
But I'd also suggest to try and "clean up" your statement. (I don't know which one is faster - that depends on your table sizes.) Usually, I'd try to start with a clean statement and start optimizing from there. But typically, a clean statement makes it easier for the optimizer to come up with a good execution plan.
So here are some observations about your statement that might make things slow:
a couple of outer joins (makes it hard for the optimzer to figure out an index to use)
a group by
a lot of columns to group by
As far as I understand your SQL, this statement should do most of what yours is doing:
SELECT `item`.itemID, `item`.title, `item`.itemTypeID, `item`.
submitDate, `item`.deleted, `item`.ItemCat,
`item`.counter, `item`.userID, `users`.name,
TIMESTAMPDIFF(minute,`submitDate`,NOW()) AS 'timeMin'
FROM (item `item` INNER JOIN users `users`
ON (`users`.userID = `item`.userID)
WHERE
Of course, this misses the info from the tables you outer joined, I'd suggest to try to add the required columns via a subselect:
SELECT `item`.itemID,
(SELECT count (itemID)
FROM votes v
WHERE v.itemID = 'item'.itemID) as 'votes', <etc.>
This way, you can get rid of one outer join and the group by. The outer join is replaced by the subselect, so there is a trade-off which may be bad for the "cleaner" statement.
Depending on the cardinality between item and myItems, you can do the same or you'd have to stick with the outer join (but no need to reintroduce the group by).
Hope this helps.
Some quick semi-random thoughts:
Are your itemID and userID columns indexed?
What happens if you add "EXPLAIN " to the start of the query and run it? Does it use indexes? Are they sensible?
DO you need to run the whole inner query and filter on it, or could you put move the where myTable.userIDFav = 3 or myTable.userIDFav is null part into the inner query?
You do seem to have too many fields in the Group By list, since one of them is itemID, I suspect that you could use an inner SELECT to preform the grouping and an outer SELECT to return the set of fields desired.
Can't you add the where clause myTable.userIDFav = 3 or myTable.userIDFav is null to WHERE (item.deleted = 0)?
Regards
Lieven
Look at the way your query is built. You join a lot of stuff, then limit the output to 20 rows. You should have the outer join on items and myitems, since your conditions only apply to these two tables, limit the output to the first 20 rows, then join and aggregate. Here you are performing a lot of work that is going to be discarded.