Subqueries vs Inner joins - Which one executes faster? - sql

I hope it's ok to make a posting like this.
I have been using SQL for quite some time and people at work have been using 2 different ways to return the same number or rows in a database.
For example:
SELECT Name
FROM
Employees
WHERE
DepartmentID IN (SELECT DepartmentID
FROM
Departments
WHERE
Department LIKE '%Engineering')
SELECT Employees.Name
FROM
Departments
INNER JOIN Employees
ON Departments.DepartmentID = Employees.DepartmentID
WHERE
Departments.Department LIKE '%Engineering'
Both return the same data. People have been telling me that using subqueries is the best way to do it.
My question is this:
Which of these 2 will execute faster? My guess would be the one with the inner join but I may be wrong.
Thanks.

In this case, both are equivalent. However, for the one with the INNER JOIN, if there is more than 1 department selected with the same DepartmentID, employees with that DepartmentID will be returned multiple times.
SQL is a declarative language, which means that the language is not supposed to say how the query should be performed, only what result should be found. It is up to the DMBS to work out how to perform it.
A decent SQL database will probably optimize them to both do the same or similar things.
To check if they are doing the same thing, run EXPLAIN on the query.
If they have the same steps, obviously they will take the same amount of time to run. Otherwise, you will see what ways the database is treating the differently. How the DBMS optimizes it is implementation dependent. So the best way is to use EXPLAIN.
Note: EXPLAIN is an SQL command, just like SELECT etc. See http://www.sql.org/sql-database/postgresql/manual/sql-explain.html.
Assuming it executes with the same steps, the way you write it will be the way you think appears clearest what the intention is.

These two queries are not equivalent. The equivalent to the "IN" query is the following:
SELECT Name
FROM Employees e join
(SELECT distinct DepartmentID
FROM Departments
WHERE Department LIKE '%Engineering'
) d
on e.DepartmentID = d.DepartmentId
In other words, the "IN" version can never return duplicate rows. The join version can return duplicate rows. In this case, I don't think it would because departmentid is probably unique in the table.
The next question is about optimization. SQL engines really should optimize the two versions equivalently. That means that the "IN" version is easier to read.
However, there is one big exception to this. MySQL is definitely suboptimal when it comes to IN. When using that database engine, it is much better to use the join version.
By the way, there are other ways to do this, such as:
select name
from employees e
where exists (select 1 from Departments d
where e.Departmentid = d.DepartmentId and
d.Department LIKE '%Engineering'
)
Which works best . . . you need to measure on whatever database you are using.

The query optimizers in most modern RDBMS will probably internally work with them the same way, so there won't be a performance difference.
Older RDBMS might probably show differences, but this depends on the RDBMS itself.

Related

SQL Sub Query within a join

Is there a difference between the results of the two sets of code below?
If there isn't, I don't understand why my teachers keep teaching sub queries. When would they be useful in basic SQL commands?
Select soh.Total, c.*
From SalesLT.Customer As c
Inner join (select oh.CustomerID Sum(oh.TotalDue) As Total
From SalesLT.SalesOrderHeader As oh Group by oh.CustomerID
Having Sum(oh.totaldue) > 90000) As soh on c.CustomerID = soh.CustomerID
VS
Select A.*, C.*
From Sales as A inner join Customer as C on A.customerID=C.customerID
Group by A.CustomerID
Having Sum(C.totaldue) > 90000
Is there a difference? Well, obviously. The two are constructed differently.
Do they produce the same result? Obviously not. In fact, the second one will produce an error in almost all databases, because the columns from A are not aggregated.
In addition, the number of columns is likely to differ between the two queries, unless Customer has exactly two columns.
I would suggest that you study SQL a bit harder. If your teachers are suggesting that you need to understand subqueries, then that is probably because they are an important part of the language.
Homework: Write a reasonable second query that doesn't use subqueries.
Sub queries always took more time in term of performance and return results.
Where as inner joins provide faster way to fetch results and process queries.
So this is always good to user inner joins and avoid sub queries as much as possible, it effect execution time. To test more, try to add Execution Plan before running query in query panel.
This will show you difference of results and time took to execute.

Using "From Multiple Tables" or "Join" performance difference [duplicate]

Most SQL dialects accept both the following queries:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x = b.x
SELECT a.foo, b.foo
FROM a
LEFT JOIN b ON a.x = b.x
Now obviously when you need an outer join, the second syntax is required. But when doing an inner join why should I prefer the second syntax to the first (or vice versa)?
The old syntax, with just listing the tables, and using the WHERE clause to specify the join criteria, is being deprecated in most modern databases.
It's not just for show, the old syntax has the possibility of being ambiguous when you use both INNER and OUTER joins in the same query.
Let me give you an example.
Let's suppose you have 3 tables in your system:
Company
Department
Employee
Each table contain numerous rows, linked together. You got multiple companies, and each company can have multiple departments, and each department can have multiple employees.
Ok, so now you want to do the following:
List all the companies, and include all their departments, and all their employees. Note that some companies don't have any departments yet, but make sure you include them as well. Make sure you only retrieve departments that have employees, but always list all companies.
So you do this:
SELECT * -- for simplicity
FROM Company, Department, Employee
WHERE Company.ID *= Department.CompanyID
AND Department.ID = Employee.DepartmentID
Note that the last one there is an inner join, in order to fulfill the criteria that you only want departments with people.
Ok, so what happens now. Well, the problem is, it depends on the database engine, the query optimizer, indexes, and table statistics. Let me explain.
If the query optimizer determines that the way to do this is to first take a company, then find the departments, and then do an inner join with employees, you're not going to get any companies that don't have departments.
The reason for this is that the WHERE clause determines which rows end up in the final result, not individual parts of the rows.
And in this case, due to the left join, the Department.ID column will be NULL, and thus when it comes to the INNER JOIN to Employee, there's no way to fulfill that constraint for the Employee row, and so it won't appear.
On the other hand, if the query optimizer decides to tackle the department-employee join first, and then do a left join with the companies, you will see them.
So the old syntax is ambiguous. There's no way to specify what you want, without dealing with query hints, and some databases have no way at all.
Enter the new syntax, with this you can choose.
For instance, if you want all companies, as the problem description stated, this is what you would write:
SELECT *
FROM Company
LEFT JOIN (
Department INNER JOIN Employee ON Department.ID = Employee.DepartmentID
) ON Company.ID = Department.CompanyID
Here you specify that you want the department-employee join to be done as one join, and then left join the results of that with the companies.
Additionally, let's say you only want departments that contains the letter X in their name. Again, with old style joins, you risk losing the company as well, if it doesn't have any departments with an X in its name, but with the new syntax, you can do this:
SELECT *
FROM Company
LEFT JOIN (
Department INNER JOIN Employee ON Department.ID = Employee.DepartmentID
) ON Company.ID = Department.CompanyID AND Department.Name LIKE '%X%'
This extra clause is used for the joining, but is not a filter for the entire row. So the row might appear with company information, but might have NULLs in all the department and employee columns for that row, because there is no department with an X in its name for that company. This is hard with the old syntax.
This is why, amongst other vendors, Microsoft has deprecated the old outer join syntax, but not the old inner join syntax, since SQL Server 2005 and upwards. The only way to talk to a database running on Microsoft SQL Server 2005 or 2008, using the old style outer join syntax, is to set that database in 8.0 compatibility mode (aka SQL Server 2000).
Additionally, the old way, by throwing a bunch of tables at the query optimizer, with a bunch of WHERE clauses, was akin to saying "here you are, do the best you can". With the new syntax, the query optimizer has less work to do in order to figure out what parts goes together.
So there you have it.
LEFT and INNER JOIN is the wave of the future.
The JOIN syntax keeps conditions near the table they apply to. This is especially useful when you join a large amount of tables.
By the way, you can do an outer join with the first syntax too:
WHERE a.x = b.x(+)
Or
WHERE a.x *= b.x
Or
WHERE a.x = b.x or a.x not in (select x from b)
Basically, when your FROM clause lists tables like so:
SELECT * FROM
tableA, tableB, tableC
the result is a cross product of all the rows in tables A, B, C. Then you apply the restriction WHERE tableA.id = tableB.a_id which will throw away a huge number of rows, then further ... AND tableB.id = tableC.b_id and you should then get only those rows you are really interested in.
DBMSs know how to optimise this SQL so that the performance difference to writing this using JOINs is negligible (if any). Using the JOIN notation makes the SQL statement more readable (IMHO, not using joins turns the statement into a mess). Using the cross product, you need to provide join criteria in the WHERE clause, and that's the problem with the notation. You are crowding your WHERE clause with stuff like
tableA.id = tableB.a_id
AND tableB.id = tableC.b_id
which is only used to restrict the cross product. WHERE clause should only contain RESTRICTIONS to the resultset. If you mix table join criteria with resultset restrictions, you (and others) will find your query harder to read. You should definitely use JOINs and keep the FROM clause a FROM clause, and the WHERE clause a WHERE clause.
The first way is the older standard. The second method was introduced in SQL-92, http://en.wikipedia.org/wiki/SQL. The complete standard can be viewed at http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt .
It took many years before database companies adopted the SQL-92 standard.
So the reason why the second method is preferred, it is the SQL standard according the ANSI and ISO standards committee.
The second is preferred because it is far less likely to result in an accidental cross join by forgetting to put inthe where clause. A join with no on clause will fail the syntax check, an old style join with no where clause will not fail, it will do a cross join.
Additionally when you later have to a left join, it is helpful for maintenance that they all be in the same structure. And the old syntax has been out of date since 1992, it is well past time to stop using it.
Plus I have found that many people who exclusively use the first syntax don't really understand joins and understanding joins is critical to getting correct results when querying.
I think there are some good reasons on this page to adopt the second method -using explicit JOINs. The clincher though is that when the JOIN criteria are removed from the WHERE clause it becomes much easier to see the remaining selection criteria in the WHERE clause.
In really complex SELECT statements it becomes much easier for a reader to understand what is going on.
The SELECT * FROM table1, table2, ... syntax is ok for a couple of tables, but it becomes exponentially (not necessarily a mathematically accurate statement) harder and harder to read as the number of tables increases.
The JOIN syntax is harder to write (at the beginning), but it makes it explicit what criteria affects which tables. This makes it much harder to make a mistake.
Also, if all the joins are INNER, then both versions are equivalent. However, the moment you have an OUTER join anywhere in the statement, things get much more complicated and it's virtually guarantee that what you write won't be querying what you think you wrote.
When you need an outer join the second syntax is not always required:
Oracle:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x = b.x(+)
MSSQLServer (although it's been deprecated in 2000 version)/Sybase:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x *= b.x
But returning to your question. I don't know the answer, but it is probably related to the fact that a join is more natural (syntactically, at least) than adding an expression to a where clause when you are doing exactly that: joining.
I hear a lot of people complain the first one is too difficult to understand and that it is unclear. I don't see a problem with it, but after having that discussion, I use the second one even on INNER JOINS for clarity.
To the database, they end up being the same. For you, though, you'll have to use that second syntax in some situations. For the sake of editing queries that end up having to use it (finding out you needed a left join where you had a straight join), and for consistency, I'd pattern only on the 2nd method. It'll make reading queries easier.
Well the first and second queries may yield different results because a LEFT JOIN includes all records from the first table, even if there are no corresponding records in the right table.
If both are inner joins there is no difference in the semantics or the execution of the SQL or performance. Both are ANSI Standard SQL It is purely a matter of preference, of coding standards within your work group.
Over the last 25 years, I've developed the habit that if I have a fairly complicated SQL I will use the INNER JOIN syntax because it is easier for the reader to pick out the structure of the query at a glance. It also gives more clarity by singling out the join conditions from the residual conditions, which can save time (and mistakes) if you ever come back to your query months later.
However for outer joins, for the purpose of clarity I would not under any circumstances use non-ansi extensions.

SQL Implicit Join joining 3 tables

I understand implicit JOINS are outdated but it's for a homework assignment. All his examples for implicit joins only join 2 tables, and I can't find an example anywhere that joins 3.
SELECT Name
FROM Employee
JOIN EmployeeSkills
ON (EmployeeID=E.ID)
JOIN Skill ON (SkillID=S.ID)
WHERE Title=’DBA’;
Here is the explicit version of what I want. How would I write this implicitly?
Thanks
Here's how I'd write it:
SELECT E.Name FROM Employee E,EmployeeSkills ES,Skill S
WHERE E.ID = ES.EmployeeID
AND ES.SkillID = S.ID
AND S.Title=’DBA’;
Pretty much the same answer as the first you received, but, for clarity, once I'd start using aliases I'd use them throughout, and make sure you define them. In the above example, both the employee and the skill could have a Name, and the Title could be an employee's title or the name of a skill. Using the table name (or alias) makes it easy to see what's going on even if you're not familiar with the schema.
Also, might have been the database I was using years ago, but it had a performance hit (very slight, but big difference on big data) if you didn't write the joins in the order you introduced the tables.
It's pretty much the same as the two table examples:
SELECT a.Name
FROM Employee a,EmployeeSkills b ,Skill c
WHERE a.EmployeeID = b.ID
AND b.SkillID = c.ID
AND Title=’DBA’;
Edit: Good point made by Everett Warren to alias, best practice is not using implicit joins as they were deprecated long ago.

Queries within queries: Is there a better way?

As I build bigger, more advanced web applications, I'm finding myself writing extremely long and complex queries. I tend to write queries within queries a lot because I feel making one call to the database from PHP is better than making several and correlating the data.
However, anyone who knows anything about SQL knows about JOINs. Personally, I've used a JOIN or two before, but quickly stopped when I discovered using subqueries because it felt easier and quicker for me to write and maintain.
Commonly, I'll do subqueries that may contain one or more subqueries from relative tables.
Consider this example:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Rarely, I'll do subqueries after the WHERE clause.
Consider this example:
SELECT
user_id,
(SELECT name FROM organizations WHERE (SELECT organization FROM locations WHERE records.location = location_id) = organization_id) AS organization_name
FROM records
ORDER BY in_timestamp
In these two cases, would I see any sort of improvement if I decided to rewrite the queries using a JOIN?
As more of a blanket question, what are the advantages/disadvantages of using subqueries or a JOIN? Is one way more correct or accepted than the other?
In simple cases, the query optimiser should be able to produce identical plans for a simple join versus a simple sub-select.
But in general (and where appropriate), you should favour joins over sub-selects.
Plus, you should avoid correlated subqueries (a query in which the inner expression refer to the outer), as they are effectively a for loop within a for loop). In most cases a correlated subquery can be written as a join.
JOINs are preferable to separate [sub]queries.
If the subselect (AKA subquery) is not correlated to the outer query, it's very likely the optimizer will scan the table(s) in the subselect once because the value isn't likely to change. When you have correlation, like in the example provided, the likelihood of single pass optimization becomes very unlikely. In the past, it's been believed that correlated subqueries execute, RBAR -- Row By Agonizing Row. With a JOIN, the same result can be achieved while ensuring a single pass over the table.
This is a proper re-write of the query provided:
SELECT u.username,
u.last_name||', '|| u.first_name AS name,
r.in_timestamp,
r.out_timestamp
FROM RECORDS r
LEFT JOIN USERS u ON u.user_id = r.user_id
ORDER BY r.in_timestamp
...because the subselect can return NULL if the user_id doesn't exist in the USERS table. Otherwise, you could use an INNER JOIN:
SELECT u.username,
u.last_name ||', '|| u.first_name AS name,
r.in_timestamp,
r.out_timestamp
FROM RECORDS r
JOIN USERS u ON u.user_id = r.user_id
ORDER BY r.in_timestamp
Derived tables/inline views are also possible using JOIN syntax.
a) I'd start by pointing out that the two are not necessarily interchangable. Nesting as you have requires there to be 0 or 1 matching value otherwise you will get an error. A join puts no such requirement and may exclude the record or introduce more depending on your data and type of join.
b) In terms of performance, you will need to check the query plans but your nested examples are unlikely to be more efficient than a table join. Typically sub-queries are executed once per row but that very much depends on your database, unique constraints, foriegn keys, not null etc. Maybe the DB can rewrite more efficiently but joins can use a wider variety of techniques, drive the data from different tables etc because they do different things (though you may not observe any difference in your output depending on your data).
c) Most DB aware programmers I know would look at your nested queries and rewrite using joins, subject to the data being suitably 'clean'.
d) Regarding "correctness" - I would favour joins backed up with proper constraints on your data where necessary (e.g. a unique user ID). You as a human may make certain assumptions but the DB engine cannot unless you tell it. The more it knows, the better job it (and you) can do.
Joins in most cases will be much more faster.
Lets take this with an example.
Lets use your first query:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Now consider we have 100 records in records and 100 records in user.(Assuming we dont have index on user_id)
So if we understand your algorithm it says:
For each record
Scan all 100 records in users to find out username
Scan all 100 records in users to find out last name and first name
So its like we scanned users table 100*100*2 time. Is it really worth. If we consider index on user_id it will make thing better, but is it still worth.
Now consider a join (nested loop will almost produce same result as above, but consider a hash join):
Its like.
Make a hash map of user.
For each record
Find a mapping record in Hashmap. Which will be certainly much more faster then looping and finding a record.
So clearly, joins should be favorable.
NOTE: Example used of 100 record may produce identical plan, but the idea is to analyze how it can effect the performance.

Avoiding a problematic nested query in MySQL

I have this SQL query which due to my own lack of knowledge and problem with mysql handling nested queries, is really slow to process. The query is...
SELECT DISTINCT PrintJobs.UserName
FROM PrintJobs
LEFT JOIN Printers
ON PrintJobs.PrinterName = Printers.PrinterName
WHERE Printers.PrinterGroup
IN (
SELECT DISTINCT Printers.PrinterGroup
FROM PrintJobs
LEFT JOIN Printers
ON PrintJobs.PrinterName = Printers.PrinterName
WHERE PrintJobs.UserName='<username/>'
);
I would like to avoid splitting this into two queries and inserting the values of the subquery into the main query progamatically.
This is probably not exactly what you are looking for however, i will contribute my 2 cents. First off you should show us your schema and exactly what you are trying to accomplish with that query. However from the looks of it you are not using numeric IDs in the table and are instead using varchar fields to join tables, this is not really a good idea performance wise. Also i am not sure why you are doing:
(select PrinterName, UserName
from PrintJobs) AS Table1
instead of just joining on PrintJobs? Similar stuff for this one:
(select
PrinterName,
PrinterGroup
from Printers) as Table1
Maybe i am just not seeing it right. I would recommend that you simplify the query as much as possible and try it. Also tell us what exactly you are hoping to accomplish with the query and give us some schema to work with.
Removed the bad query from the answer.
This query you have is pretty messed up, not sure if this will handle everything you need but simplifying like this kills all the nested queries and it way faster. You can also use the EXPLAIN command to know how mysql will fetch your query.
SELECT DISTINCT PrintJobs.UserName
FROM PrintJobs
LEFT JOIN Printers ON PrintJobs.PrinterName = Printers.PrinterName
AND Printers.Username = '<username/>'
;