Natural Join Explanation - sql

I am a student trying to learn Microsoft SQL and I am frustrated that I cannot understand how natural join works. I have a problem with a solution but I cannot understand how this solution was created through natural join. My friend sent my his old solution but cant remember how he got the result. I really want to figure out how this natural join works. Can someone please explain how this answer was achieved?
The question taken from "Database Design, Application, and Administration 6th Ed.:
Show the result of a natural join that combines the Customer and OrderTbl tables.
Edit: The book doesn't give a query statement on how to get the results for the natural join in the chapters I have read so far. Which makes things much more confusing for me as I cannot simply add the table and send a query.
Edit2: The reason why some entries in the order table have null values is because the question is implying that some orders were processed through the internet. A friend of mine told me that was the reason why he got the question right opposed to some people who argued against it. :S

If I read your question correctly, you are trying to learn two concepts at the same time. The first is INNER JOIN (sometimes called equijoin). The second is natural join.
It's easier to explain natural join, assuming you already know how INNER JOIN works. A natural join is simply an equijoin where the column names indicate to you what the join condition ought to be. In your case, the fact that CustNo appears in both tables is the only clue you need in order to devise the correct join condition. You also include the join field only once in the result.
Column names are actually quite arbitrary, and could have been made very different in this case. for example, if the column Customer.CustNo had been named Customer.ID instead, you wouldn't be able to do a natural join.
for a correct solution in your case, see the answer provided by JamieC.

If you simply want the query which will result in the final table in your question here it is,
SELECT
o.OrdNo,
o.OrdDate,
o.EmpNo,
o.CustNo,
c.CustFirstName
c.CustLastName,
c.CustCity,
c.CustSatate,
c.CustZip,
c.CustBal
FROM OrderTbl o
INNER JOIN Customer c
ON o.CustNo = c.CustNo
ORDER BY c.CustNo
So, by way of explanation; this query selects all data from Customer and OrderTbl joining the two using CustNo which is the primary key (presumably) in Customer and a foreign key in OrderTbl. The ordering of the result is a little more tricky, and based almost purely on guesswork, I suspect the result is ordered by CustNo as well.
The Employee table does not feature at all in the result, however as the OrderTbl table has some blanks for EmpNo, you would almost certainly want a LEFT JOIN/RIGHT JOIN (as appropriate) if you wanted to retrieve any information about the employee from the orders table.

MS SQL Server doesn't support NATURAL JOIN. However, if you were using a platform that would support it, a simple:
SELECT * FROM Customer NATURAL JOIN OrderTbl;
should do the trick.
https://en.wikipedia.org/wiki/Join_(SQL)#Natural_join is quite good.

Related

Using "From Multiple Tables" or "Join" performance difference [duplicate]

Most SQL dialects accept both the following queries:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x = b.x
SELECT a.foo, b.foo
FROM a
LEFT JOIN b ON a.x = b.x
Now obviously when you need an outer join, the second syntax is required. But when doing an inner join why should I prefer the second syntax to the first (or vice versa)?
The old syntax, with just listing the tables, and using the WHERE clause to specify the join criteria, is being deprecated in most modern databases.
It's not just for show, the old syntax has the possibility of being ambiguous when you use both INNER and OUTER joins in the same query.
Let me give you an example.
Let's suppose you have 3 tables in your system:
Company
Department
Employee
Each table contain numerous rows, linked together. You got multiple companies, and each company can have multiple departments, and each department can have multiple employees.
Ok, so now you want to do the following:
List all the companies, and include all their departments, and all their employees. Note that some companies don't have any departments yet, but make sure you include them as well. Make sure you only retrieve departments that have employees, but always list all companies.
So you do this:
SELECT * -- for simplicity
FROM Company, Department, Employee
WHERE Company.ID *= Department.CompanyID
AND Department.ID = Employee.DepartmentID
Note that the last one there is an inner join, in order to fulfill the criteria that you only want departments with people.
Ok, so what happens now. Well, the problem is, it depends on the database engine, the query optimizer, indexes, and table statistics. Let me explain.
If the query optimizer determines that the way to do this is to first take a company, then find the departments, and then do an inner join with employees, you're not going to get any companies that don't have departments.
The reason for this is that the WHERE clause determines which rows end up in the final result, not individual parts of the rows.
And in this case, due to the left join, the Department.ID column will be NULL, and thus when it comes to the INNER JOIN to Employee, there's no way to fulfill that constraint for the Employee row, and so it won't appear.
On the other hand, if the query optimizer decides to tackle the department-employee join first, and then do a left join with the companies, you will see them.
So the old syntax is ambiguous. There's no way to specify what you want, without dealing with query hints, and some databases have no way at all.
Enter the new syntax, with this you can choose.
For instance, if you want all companies, as the problem description stated, this is what you would write:
SELECT *
FROM Company
LEFT JOIN (
Department INNER JOIN Employee ON Department.ID = Employee.DepartmentID
) ON Company.ID = Department.CompanyID
Here you specify that you want the department-employee join to be done as one join, and then left join the results of that with the companies.
Additionally, let's say you only want departments that contains the letter X in their name. Again, with old style joins, you risk losing the company as well, if it doesn't have any departments with an X in its name, but with the new syntax, you can do this:
SELECT *
FROM Company
LEFT JOIN (
Department INNER JOIN Employee ON Department.ID = Employee.DepartmentID
) ON Company.ID = Department.CompanyID AND Department.Name LIKE '%X%'
This extra clause is used for the joining, but is not a filter for the entire row. So the row might appear with company information, but might have NULLs in all the department and employee columns for that row, because there is no department with an X in its name for that company. This is hard with the old syntax.
This is why, amongst other vendors, Microsoft has deprecated the old outer join syntax, but not the old inner join syntax, since SQL Server 2005 and upwards. The only way to talk to a database running on Microsoft SQL Server 2005 or 2008, using the old style outer join syntax, is to set that database in 8.0 compatibility mode (aka SQL Server 2000).
Additionally, the old way, by throwing a bunch of tables at the query optimizer, with a bunch of WHERE clauses, was akin to saying "here you are, do the best you can". With the new syntax, the query optimizer has less work to do in order to figure out what parts goes together.
So there you have it.
LEFT and INNER JOIN is the wave of the future.
The JOIN syntax keeps conditions near the table they apply to. This is especially useful when you join a large amount of tables.
By the way, you can do an outer join with the first syntax too:
WHERE a.x = b.x(+)
Or
WHERE a.x *= b.x
Or
WHERE a.x = b.x or a.x not in (select x from b)
Basically, when your FROM clause lists tables like so:
SELECT * FROM
tableA, tableB, tableC
the result is a cross product of all the rows in tables A, B, C. Then you apply the restriction WHERE tableA.id = tableB.a_id which will throw away a huge number of rows, then further ... AND tableB.id = tableC.b_id and you should then get only those rows you are really interested in.
DBMSs know how to optimise this SQL so that the performance difference to writing this using JOINs is negligible (if any). Using the JOIN notation makes the SQL statement more readable (IMHO, not using joins turns the statement into a mess). Using the cross product, you need to provide join criteria in the WHERE clause, and that's the problem with the notation. You are crowding your WHERE clause with stuff like
tableA.id = tableB.a_id
AND tableB.id = tableC.b_id
which is only used to restrict the cross product. WHERE clause should only contain RESTRICTIONS to the resultset. If you mix table join criteria with resultset restrictions, you (and others) will find your query harder to read. You should definitely use JOINs and keep the FROM clause a FROM clause, and the WHERE clause a WHERE clause.
The first way is the older standard. The second method was introduced in SQL-92, http://en.wikipedia.org/wiki/SQL. The complete standard can be viewed at http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt .
It took many years before database companies adopted the SQL-92 standard.
So the reason why the second method is preferred, it is the SQL standard according the ANSI and ISO standards committee.
The second is preferred because it is far less likely to result in an accidental cross join by forgetting to put inthe where clause. A join with no on clause will fail the syntax check, an old style join with no where clause will not fail, it will do a cross join.
Additionally when you later have to a left join, it is helpful for maintenance that they all be in the same structure. And the old syntax has been out of date since 1992, it is well past time to stop using it.
Plus I have found that many people who exclusively use the first syntax don't really understand joins and understanding joins is critical to getting correct results when querying.
I think there are some good reasons on this page to adopt the second method -using explicit JOINs. The clincher though is that when the JOIN criteria are removed from the WHERE clause it becomes much easier to see the remaining selection criteria in the WHERE clause.
In really complex SELECT statements it becomes much easier for a reader to understand what is going on.
The SELECT * FROM table1, table2, ... syntax is ok for a couple of tables, but it becomes exponentially (not necessarily a mathematically accurate statement) harder and harder to read as the number of tables increases.
The JOIN syntax is harder to write (at the beginning), but it makes it explicit what criteria affects which tables. This makes it much harder to make a mistake.
Also, if all the joins are INNER, then both versions are equivalent. However, the moment you have an OUTER join anywhere in the statement, things get much more complicated and it's virtually guarantee that what you write won't be querying what you think you wrote.
When you need an outer join the second syntax is not always required:
Oracle:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x = b.x(+)
MSSQLServer (although it's been deprecated in 2000 version)/Sybase:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x *= b.x
But returning to your question. I don't know the answer, but it is probably related to the fact that a join is more natural (syntactically, at least) than adding an expression to a where clause when you are doing exactly that: joining.
I hear a lot of people complain the first one is too difficult to understand and that it is unclear. I don't see a problem with it, but after having that discussion, I use the second one even on INNER JOINS for clarity.
To the database, they end up being the same. For you, though, you'll have to use that second syntax in some situations. For the sake of editing queries that end up having to use it (finding out you needed a left join where you had a straight join), and for consistency, I'd pattern only on the 2nd method. It'll make reading queries easier.
Well the first and second queries may yield different results because a LEFT JOIN includes all records from the first table, even if there are no corresponding records in the right table.
If both are inner joins there is no difference in the semantics or the execution of the SQL or performance. Both are ANSI Standard SQL It is purely a matter of preference, of coding standards within your work group.
Over the last 25 years, I've developed the habit that if I have a fairly complicated SQL I will use the INNER JOIN syntax because it is easier for the reader to pick out the structure of the query at a glance. It also gives more clarity by singling out the join conditions from the residual conditions, which can save time (and mistakes) if you ever come back to your query months later.
However for outer joins, for the purpose of clarity I would not under any circumstances use non-ansi extensions.

Do subselects do an implicit join?

I have a sql query that seems to work but I dont really understand why. Therefore I would very much appreciate if someone could help explain whats going on:
THE QUERY RETURNS: All organisations that dont have any comments that were not created by the consultant who created the organisation record.
SELECT \"organisations\".*
FROM \"organisations\"
WHERE \"organisations\".\"id\" NOT IN
(SELECT \"comments\".\"commentable_id\"
FROM \"comments\"
WHERE \"comments\".\"commentable_type\" = 'Organisation'
AND (comments.author_id != organisations.consultant_id)
ORDER BY \"comments\".\"created_at\" ASC
)
It seems to do so correctly.
The part I dont understand is why (comments.author_id != organisations.consultant_id) is working!? I dont understand how postgres even knows what "organisations" is inside that subselect? It is not defined in here.
If this was written as a join where I had joined comments to organisations then I would totally understand how you could do something like this but in this case its a subselect. How does it know how to map the comments and organisations table and exclude the ones where (comments.author_id != organisations.consultant_id)
That subselect happens in a row so it can see all columns of that row. You will probably get better performance with this
select organisations.*
from organisations
where not exists (
select 1
from comments
where
commentable_type = 'organisation' and
author_id != organisations.consultant_id
)
Notice that it is not necessary to qualify commentable_type since the one in comments has priority over any other outside the subselect. And if comments does not have a consultant_id column then it would be possible to take its qualifier out, although not recommended for better legibility.
The order by in your query buys you nothing, just added cost.
You are running a correlated subquery. http://technet.microsoft.com/en-us/library/ms187638(v=sql.105).aspx
This is commonly used in all databases. A subquery in the WHERE clause can refer to tables used in the parent query, and they often do.
That being said, your current query could likely be written better.
Here is one way, using an outer join with comments, where no matches are found based on your criteria -
select o.*
from organizations o
left join comments c
on c.commentable_type <> 'Organisation'
and c.author_id = o.consultant_id
where c.commentable_id is null

SQL and implicit join

w3schools.com has the following SQL statement example:
SELECT
Product_Orders.OrderID, Persons.LastName, Persons.FirstName
FROM
Persons,
Product_Orders
WHERE
Persons.LastName='Hansen' AND Persons.FirstName='Ola'
What is not clear to me is how the product order table is joined with the persons table. Is this SQL proper, and what does it imply about the result set sent back?
It's technically proper, though I would avoid using that type of syntax for a couple reasons which I won't go into right here.
That syntax is colloquially called the "old style" join syntax and is equivalent to:
SELECT Product_Orders.OrderID, Persons.LastName, Persons.FirstName
FROM Persons
CROSS JOIN Product_Orders
WHERE Persons.LastName='Hansen' AND Persons.FirstName='Ola'
The answer of how they "join" together is that they don't. The result of the query is the cross product of the entire Product_Orders table with the "Ola Hanson" row(s) in the Persons table. It doesn't make much sense in most of the use cases I can think of, to be quite honest with you.
The really strange thing about the results you'll get here is that the OrderID from the Product_Orders table will not necessarily align with the person on the order. It looks to be a mistake of omission on the page -- though to be fair the example intends to demonstrate aliasing, not joins.
W3Schools doesn't have a good reputation around here.
Firs of all you have to understand how Joins works. From you query:
A cross product is done between the two tables supplied in the SQL statement, you can see this from this FROM Persons, Product_Orders.
For every record in Persons table is joined to each record in the Product_Orders table. This will give you a huge temp table if both tables are big.
Then your filter i.e. the qualifying condition will be applied to this temp table to give you your result.
I hope this helps.

INNER JOIN vs multiple table names in "FROM" [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
INNER JOIN versus WHERE clause — any difference?
What is the difference between an INNER JOIN query and an implicit join query (i.e. listing multiple tables after the FROM keyword)?
For example, given the following two tables:
CREATE TABLE Statuses(
id INT PRIMARY KEY,
description VARCHAR(50)
);
INSERT INTO Statuses VALUES (1, 'status');
CREATE TABLE Documents(
id INT PRIMARY KEY,
statusId INT REFERENCES Statuses(id)
);
INSERT INTO Documents VALUES (9, 1);
What is the difference between the below two SQL queries?
From the testing I've done, they return the same result. Do they do the same thing? Are there situations where they will return different result sets?
-- Using implicit join (listing multiple tables)
SELECT s.description
FROM Documents d, Statuses s
WHERE d.statusId = s.id
AND d.id = 9;
-- Using INNER JOIN
SELECT s.description
FROM Documents d
INNER JOIN Statuses s ON d.statusId = s.id
WHERE d.id = 9;
There is no reason to ever use an implicit join (the one with the commas). Yes for inner joins it will return the same results. However, it is subject to inadvertent cross joins especially in complex queries and it is harder for maintenance because the left/right outer join syntax (deprecated in SQL Server, where it doesn't work correctly right now anyway) differs from vendor to vendor. Since you shouldn't mix implicit and explict joins in the same query (you can get wrong results), needing to change something to a left join means rewriting the entire query.
If you do it the first way, people under the age of 30 will probably chuckle at you, but as long as you're doing an inner join, they produce the same result and the optimizer will generate the same execution plan (at least as far as I've ever been able to tell).
This does of course presume that the where clause in the first query is how you would be joining in the second query.
This will probably get closed as a duplicate, btw.
The nice part of the second method is that it helps separates the join condition (on ...) from the filter condition (where ...). This can help make the intent of the query more readable.
The join condition will typically be more descriptive of the structure of the database and the relation between the tables. e.g., the salary table is related to the employee table by the EmployeeID column, and queries involving those two tables will probably always join on that column.
The filter condition is more descriptive of the specific task being performed by the query. If the query is FindRichPeople, the where clause might be "where salaries.Salary > 1000000"... thats describing the task at hand, not the database structure.
Note that the SQL compiler doesn't see it that way... if it decides that it will be faster to cross join and then filter the results, it will cross join and filter the results. It doesn't care what is in the ON clause and whats in the WHERE clause. But, that typically wont happen if the on clause matches a foreign key or joins to a primary key or indexed column. As far as operating correctly, they are identical; as far as writing readable, maintainable code, the second way is probably a little better.
there is no difference as far as I know is the second one with the inner join the new way to write such statements and the first one the old method.
The first one does a Cartesian product on all record within those two tables then filters by the where clause.
The second only joins on records that meet the requirements of your ON clause.
EDIT: As others have indicated, the optimization engine will take care of an attempt on a Cartesian product and will result in the same query more or less.
A bit same. Can help you out.
Left join vs multiple tables in SQL (a)
Left join vs multiple tables in SQL (b)
In the example you've given, the queries are equivalent; if you're using SQL Server, run the query and display the actual exection plan to see what the server's doing internally.

What's the optimal solution for tag/keyword matching?

I'm looking for the optimal solution for keyword matching between different records in the database. It's a classic problem, I've found similar questions, but nothing concretely.
I've done it with full text searches, joins and subqueries, temp tables, ... so i'd really like to see how you guys are solving such a common problem.
So, let's say I have two tables; Products and Keywords and they are linked with the third table, Products_Keywords in a classic many-to-many relationship.
If I show one Product record on the page and would like to show top n related products, what would be the best option?
We should take into account that records might share several keywords and this fact should determine the ordering of the top related product.
I'm open for other ideas as well, but T-SQL would be preferable solution due to the performance reasons.
My first shot would be something like:
SELECT
P.product_id,
COUNT(*)
FROM
Product_Keywords PK1
INNER JOIN Product_Keywords PK2 ON
PK2.keyword_id = PK1.keyword_id
INNER JOIN Products P ON
P.product_id = PK.product_id
WHERE
PK1.product_id = #product_id
GROUP BY
P.product_id
ORDER BY
COUNT(*) DESC
The join of Product_Keywords to Product_Keywords (PK2 to PK1) might be rough, so I can't speak to performance. This is where I would start though and then look at optimization.
One thing to consider, as a follow-up to Assaf's comment, is that you could add a "weight" to the Product_Keywords and SUM(PK1.weight) + SUM(PK2.weight) for ranking. Just a thought.
EDIT: To elaborate on the weighting... you may decide that you want to allow keywords to be weighted. The actual method used to determine the weighting would be a business decision though, so I can't really give you too much guidance there.
As an example though, this question is about "programming", "keyword matching", and "SQL". Programming is pretty generic, so if two questions had that in common it still might not mean that they are that related so maybe you only weight it as 1. SQL is a little more specific, so that you might weight as a 5. Keyword matching is both the main focus of the question AND it's pretty specific, so you might weight that with a 10.
This is just an example of course and as I said, the exact determination of the weights as well as how you score it are dependent on the specific business. You might decide that matching the number of keywords is more important than the weights so maybe the weighting is only used as a tie-breaker, etc. HTH.
Well maybe something like the follwing:
select p.productId, p.name, r.rank
from products p inner join (
/* this inner select should bring in only products that have at least one keyword
=> shared with the requested product, and will count the actual number shared (for ranking)*/
select related.productId, count(related.productId) as rank
from
products_keywords related inner join
products_keywords pk ON (pk.productId = #productId AND related.keywordId = pk.keywordId)
where related.productId <> #productId
group by related.productId
) r on p.productId = r.productId
order by r.rank DESC /* added DESC (not in orignal solution, but needed to put higher ranked on top)*/
Now I seriously doubt that's an optimal sql statement, but it should get the job done. I can't verify it though since I just wrote it from scratch with no actual backing tables, or data to test against.