I often see something like...
SELECT events.id, events.begin_on, events.name
FROM events
WHERE events.user_id IN ( SELECT contacts.user_id
FROM contacts
WHERE contacts.contact_id = '1')
OR events.user_id IN ( SELECT contacts.contact_id
FROM contacts
WHERE contacts.user_id = '1')
Is it okay to have query in query? Is it "inner query"? "Sub-query"? Does it counts as three queries (my example)? If its bad to do so... how can I rewrite my example?
Your example isn't too bad. The biggest problems usually come from cases where there is what's called a "correlated subquery". That's when the subquery is dependent on a column from the outer query. These are particularly bad because the subquery effectively needs to be rerun for every row in the potential results.
You can rewrite your subqueries using joins and GROUP BY, but as you have it performance can vary, especially depending on your RDBMS.
It varies from database to database, especially if the columns compared are
indexed or not
nullable or not
..., but generally if your query is not using columns from the table joined to -- you should be using either IN or EXISTS:
SELECT e.id, e.begin_on, e.name
FROM EVENTS e
WHERE EXISTS (SELECT NULL
FROM CONTACTS c
WHERE ( c.contact_id = '1' AND c.user_id = e.user_id )
OR ( c.user_id = '1' AND c.contact_id = e.user_id )
Using a JOIN (INNER or OUTER) can inflate records if the child table has more than one record related to a parent table record. That's fine if you need that information, but if not then you need to use either GROUP BY or DISTINCT to get a result set of unique values -- and that can cost you when you review the query costs.
EXISTS
Though EXISTS clauses look like correlated subqueries, they do not execute as such (RBAR: Row By Agonizing Row). EXISTS returns a boolean based on the criteria provided, and exits on the first instance that is true -- this can make it faster than IN when dealing with duplicates in a child table.
You could JOIN to the Contacts table instead:
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
-- exercise: without the GROUP BY, how many duplicate rows can you end up with?
This leaves the following question up to the database: "Should we look through all the contacts table and find all the '1's in the various columns, or do something else?" where your original SQL didn't give it much choice.
The most common term for this sort of query is "subquery." There is nothing inherently wrong in using them, and can make your life easier. However, performance can often be improved by rewriting queries w/ subqueries to use JOINs instead, because the server can find optimizations.
In your example, three queries are executed: the main SELECT query, and the two SELECT subqueries.
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
In your case, I believe the JOIN version will be better as you can avoid two SELECT queries on contacts, opting for the JOIN instead.
See the mysql docs on the topic.
Related
I am confused about when to use HAVING and when to use WHERE. I need to
Find all of the bugs on software Debugger that pertain to the /main.html
This is my query
select Tickets.TicketID, b.Data
from Bugs b
Inner Join Tickets
On b.TicketID = Tickets.TicketID
Inner Join Softwares
on Software.SoftwareId = Tickets.SoftwareID
where Software.URL = 'http://debugger.com' and Tickets.Title = '/main.html'
NOTE: THIS GIVES ME DESIRED RESULT
But I want to make sure I am not missing anything important here. Maybe should I use HAVING somewhere here?
Also in order to make the query perform better on a large dataset, I have created an index on foreign keys
create nonclustered index IX_Tickets_SoftwareId
on [dbo].[Tickets] ([SoftwareId])
go
create nonclustered index IX_Bugs_TicketsId
on [dbo].[Bugs] ([TicketsId])
Am doing allright?
Your query is fine. You want to filter individual records, which is what the WHERE clause does.
The HAVING clause comes into play in aggregate queries - queries that use GROUP BY, and its purpose is to filter groups of records, using aggregate functions (such as SUM(), MAX() or the-like). It makes no sense for your query, that does not use aggregation.
Incidently, I note that your are not returning anything from the softwares table, so that join is used for filtering only. In such situation, I find that exists is more appropriate, because it is explicit about its purpose:
select t.ticketid, b.data
from bugs b
inner join tickets t on b.ticketid = t.ticketid
where t.title = '/main.html' and exists (
select 1
from softwares s
where s.softwareid = t.softwareid and s.url = 'http://debugger.com'
)
For performance, consider an index on softwares(softwareid, url), so the subquery execute efficiently. An index on tickets(ticketid, title) might also help.
WHERE is used to filter records before any groupings take place. HAVING is used to filter values after they have been groups. Only columns or expressions in the group can be included in the HAVING clause's
What is the difference between semi-joins and a subquery? I am currently taking a course on this on DataCamp and i'm having a hard time making a distinction between the two.
Thanks in advance.
A join or a semi join is required whenever you want to combine two or more entities records based on some common conditional attributes.
Unlike, Subquery is required whenever you want to have a lookup or a reference on same table or other tables
In short, when your requirement is to get additional reference columns added to existing tables attributes then go for join else when you want to have a lookup on records from the same table or other tables but keeping the same existing columns as o/p go for subquery
Also, In case of semi join it can act/used as a subquery because most of the times we dont actually join the right table instead we mantain a check via subquery to limit records in the existing hence semijoin but just that it isnt a subquery by itself
I don't really think of a subquery and a semi-join as anything similar. A subquery is nothing more interesting than a query that is used inside another query:
select * -- this is often called the "outer" query
from (
select columnA -- this is the subquery inside the parentheses
from mytable
where columnB = 'Y'
)
A semi-join is a concept based on join. Of course, joining tables will combine both tables and return the combined rows based on the join criteria. From there you select the columns you want from either table based on further where criteria (and of course whatever else you want to do). The concept of a semi-join is when you want to return rows from the first table only, but you need the 2nd table to decide which rows to return. Example: you want to return the people in a class:
select p.FirstName, p.LastName, p.DOB
from people p
inner join classes c on c.pID = p.pID
where c.ClassName = 'SQL 101'
group by p.pID
This accomplishes the concept of a semi-join. We are only returning columns from the first table (people). The use of the group by is necessary for the concept of a semi-join because a true join can return duplicate rows from the first table (depending on the join criteria). The above example is not often referred to as a semi-join, and is not the most typical way to accomplish it. The following query is a more common method of accomplishing a semi-join:
select FirstName, LastName, DOB
from people
where pID in (select pID
from class
where ClassName = 'SQL 101'
)
There is no formal join here. But we're using the 2nd table to determine which rows from the first table to return. It's a lot like saying if we did join the 2nd table to the first table, what rows from the first table would match?
For performance, exists is typically preferred:
select FirstName, LastName, DOB
from people p
where exists (select pID
from class c
where c.pID = p.pID
and c.ClassName = 'SQL 101'
)
In my opinion, this is the most direct way to understand the semi-join. There is still no formal join, but you can see the idea of a join hinted at by the usage of directly matching the first table's pID column to the 2nd table's pID column.
Final note. The last 2 queries above each use a subquery to accomplish the concept of a semi-join.
I have a SQL query that works in Access 2016:
SELECT
Count(*) AS total_tests,
Sum(IIf(score>=securing_threshold And score<mastering_threshold,1,0)) AS total_securing,
Sum(IIf(score>=mastering_threshold,1,0)) AS total_mastering,
total_securing/Count(*) AS percent_securing,
total_mastering/Count(*) AS percent_mastering,
(Count(*)-total_securing-total_mastering)/Count(*) AS percent_below,
subjects.subject,
students.year_entered,
IIf(Month(Date())<9,Year(Date())-students.year_entered+6,Year(Date())-students.year_entered+7) AS current_form,
groups.group
FROM
((subjects
INNER JOIN tests ON subjects.ID = tests.subject)
INNER JOIN (students
INNER JOIN test_results ON students.ID = test_results.student) ON tests.ID = test_results.test)
LEFT JOIN
(SELECT * FROM group_membership LEFT JOIN groups ON group_membership.group = groups.ID) As g
ON students.ID = g.student
GROUP BY subjects.subject, students.year_entered, groups.group;
However, I wish to filter out irrelevant groups before joining them to my table. The table groups has a column subject which is a foreign key.
When I try changing ON students.ID = g.student to ON students.ID = g.student And subjects.ID = g.subject I get the error 'JOIN expression not supported'.
Alternatively, when I try adding WHERE subjects.ID = groups.subject to the subquery, it asks me for the parameter value of subjects.ID, although it is a column in the parent query.
Googling reveals many similar errors but they were all resolved by changing the brackets. That didn't help.
Just in case the table relationships help:
Thank you.
EDIT: Sample database at https://www.dropbox.com/s/yh80oooem6gsni7/student%20tracker.ACCDB?dl=0
MS Access queries with many joins are difficult to update by SQL alone as parenetheses pairings are required unlike other RDBMS's and these pairings must follow an order. Moreover, some pairings can even be nested. Hence, for beginners it is advised to build queries with complex and many joins using the query design GUI in the MS Office application and let it build out the SQL.
For a simple filter on the g derived table, you could filter subject on the derived table, g, but likely you want want all subjects:
...
(SELECT * FROM group_membership
LEFT JOIN groups ON group_membership.group = groups.ID
WHERE groups.subject='Earth Science') As g
...
So for all subjects, consider re-building query from scratch in GUI that nearly mirrors your table relationships which actually auto-links joins in the GUI. Then, drop unneeded tables.
Usually you want to begin with the join table or set like groups and group_membership or tests and test_results. In fact, consider saving the g derived table as its own query.
Then add the distinct record primary source tables like students and subjects.
You may even need to play around with order in FROM and JOIN clauses to attain desired results, and maybe even add the same table in query. And be careful with adding join tables like group_membership (two one-to-many links), to GROUP BY queries as it leads to the duplicate record aggregation. So you may need to join aggregates queries by subject.
Unless you can post content of all tables, from our perspective it is difficult to help from here.
Your subquery g uses a LEFT JOIN, but there is a enforced 1:n relation between the two tables, so there will always be a matching group. Use a INNER JOIN instead.
With g.subject you are trying to join on a column that is on the right side of a left join, that cannot really work.
Also you shouldn't use SELECT * on a join of tables with identical column names. Include only the qualified column names that you need.
LEFT JOIN
(SELECT group_membership.student, groups.group, groups.subject
FROM group_membership INNER JOIN groups
ON group_membership.group = groups.ID) As g
ON (students.ID = g.student AND subjects.ID = g.subject)
I would call the columns in group_membership group_ID and student_ID to avoid confusion.
I don't have the database to test, but I would use subject table as subquery:
(SELECT * FROM subject WHERE filter out what you don't need) Subj
Then INNER JOIN this new Subj Table in your query which would exclude irrelevant groups.
Also I would never create join in WHERE clause (WHERE subjects.ID = groups.subject), what this does it creates cartesian product (table with all the possible combinations of subjects.ID and groups.subject) then it filters out records to satisfy your join. When dealing with huge data it might take forever or crash.
Error related to "Join expression may not be supported"; do datatypes match in those fields?
I solved it by (a lot of trial and error and) taking the advice here to make queries in the GUI and joining them. The end result is 4 queries deep! If the database was bigger, performance would be awful... but now its working I can tweak it eventually.
Thank you everybody!
I have heard joins should be preferred over nested queries. Is it true in general? Or there might be scenarios where one would be faster than other:
for e.g. which is more efficient way to write a query?:
Select emp.salary
from employee emp
where emp.id = (select s.id from sap s where s.id = 111)
OR
Select emp.salary
from employee emp
INNER JOIN sap s ON emp.id = s.id
WHERE s.id = 111
I have heard joins should be preferred over nested queries. Is it true in general?
It depends on the requirements, and the data.
Using a JOIN risks duplicating the information in the resultset for the parent table if there are more than one child records related to it, because a JOIN returns the rows that match. Which means if you want unique values from the parent table while using JOINs, you need to look at using either DISTINCT or a GROUP BY clause. But none of this is a concern if a subquery is used.
Also, subqueries are not all the same. There's the straight evaluation, like your example:
where emp.id = (select s.id from sap s where s.id = 111)
...and the IN clause:
where emp.id IN (select s.id from sap s where s.id = 111)
...which will match any of the value(s) returned by the subquery when the straight evaluation will throw an error if s.id returns more than one value. But there's also the EXISTS clause...
WHERE EXISTS(SELECT NULL
FROM SAP s
WHERE emp.id = s.id
AND s.id = 111)
The EXISTS is different in that:
the SELECT clause doesn't get evaluated - you can change it to SELECT 1/0, which should trigger a divide-by-zero error but won't
it returns true/false; true based on the first instance the criteria is satisfied so it's faster when dealing with duplicates.
unlike the IN clause, EXISTS supports comparing two or more column comparisons at the same time, but some databases do support tuple comparison with the IN.
it's more readable
If the queries are logically equivalent, then the query optimizer should be able to make the same (best) execution plan from each one. In that case, query style should support what can be understood the best (that's subqueries for me).
It is much faster (and easier to write) to join two tables on an index than to run two separate queries (even a subquery).
What more can I do to optimize this query?
SELECT * FROM
(SELECT `item`.itemID, COUNT(`votes`.itemID) AS `votes`,
`item`.title, `item`.itemTypeID, `item`.
submitDate, `item`.deleted, `item`.ItemCat,
`item`.counter, `item`.userID, `users`.name,
TIMESTAMPDIFF(minute,`submitDate`,NOW()) AS 'timeMin' ,
`myItems`.userID as userIDFav, `myItems`.deleted as myDeleted
FROM (votes `votes` RIGHT OUTER JOIN item `item`
ON (`votes`.itemID = `item`.itemID))
INNER JOIN
users `users`
ON (`users`.userID = `item`.userID)
LEFT OUTER JOIN
myItems `myItems`
ON (`myItems`.itemID = `item`.itemID)
WHERE (`item`.deleted = 0)
GROUP BY `item`.itemID,
`votes`.itemID,
`item`.title,
`item`.itemTypeID,
`item`.submitDate,
`item`.deleted,
`item`.ItemCat,
`item`.counter,
`item`.userID,
`users`.name,
`myItems`.deleted,
`myItems`.userID
ORDER BY `item`.itemID DESC) as myTable
where myTable.userIDFav = 3 or myTable.userIDFav is null
limit 0, 20
I'm using MySQL
Thanks
What does the analyzer say for this query? Without knowledge about how many rows there are in the table you cant tell any optimization. So run the analyzer and you'll see what parts costs what.
Of course, as #theomega said, look at the execution plan.
But I'd also suggest to try and "clean up" your statement. (I don't know which one is faster - that depends on your table sizes.) Usually, I'd try to start with a clean statement and start optimizing from there. But typically, a clean statement makes it easier for the optimizer to come up with a good execution plan.
So here are some observations about your statement that might make things slow:
a couple of outer joins (makes it hard for the optimzer to figure out an index to use)
a group by
a lot of columns to group by
As far as I understand your SQL, this statement should do most of what yours is doing:
SELECT `item`.itemID, `item`.title, `item`.itemTypeID, `item`.
submitDate, `item`.deleted, `item`.ItemCat,
`item`.counter, `item`.userID, `users`.name,
TIMESTAMPDIFF(minute,`submitDate`,NOW()) AS 'timeMin'
FROM (item `item` INNER JOIN users `users`
ON (`users`.userID = `item`.userID)
WHERE
Of course, this misses the info from the tables you outer joined, I'd suggest to try to add the required columns via a subselect:
SELECT `item`.itemID,
(SELECT count (itemID)
FROM votes v
WHERE v.itemID = 'item'.itemID) as 'votes', <etc.>
This way, you can get rid of one outer join and the group by. The outer join is replaced by the subselect, so there is a trade-off which may be bad for the "cleaner" statement.
Depending on the cardinality between item and myItems, you can do the same or you'd have to stick with the outer join (but no need to reintroduce the group by).
Hope this helps.
Some quick semi-random thoughts:
Are your itemID and userID columns indexed?
What happens if you add "EXPLAIN " to the start of the query and run it? Does it use indexes? Are they sensible?
DO you need to run the whole inner query and filter on it, or could you put move the where myTable.userIDFav = 3 or myTable.userIDFav is null part into the inner query?
You do seem to have too many fields in the Group By list, since one of them is itemID, I suspect that you could use an inner SELECT to preform the grouping and an outer SELECT to return the set of fields desired.
Can't you add the where clause myTable.userIDFav = 3 or myTable.userIDFav is null to WHERE (item.deleted = 0)?
Regards
Lieven
Look at the way your query is built. You join a lot of stuff, then limit the output to 20 rows. You should have the outer join on items and myitems, since your conditions only apply to these two tables, limit the output to the first 20 rows, then join and aggregate. Here you are performing a lot of work that is going to be discarded.