Clarification on PostgreSQL update join statement? - sql

I can't seem to find any documentation that clearly elaborate how certain update-join statements work in PostgreSQL. Suppose there are three tables in a database: professors, classes, and classrooms. In the professors table, one of the attributes is a class_id, a foreign key referencing the classes table. In the classes table, there is a classroom_id, referencing the classrooms table. This is the command I'm interested in:
UPDATE classes c SET year = 2
FROM classes cl
JOIN professors on cl.class_id = professors.class_id
JOIN classrooms on cl.classroom_id = classrooms.classroom_id
WHERE cl.class_id = c.class_id
This seems to calculate X = inner join(inner join(classes, professors), classrooms) and updates year = 2 for every class that is included in X. Is this correct? Furthermore, I don't understand how the WHERE clause accomplishes this. Why does this work, and why don't I have to use the IN keyword to accomplish this task?
I would appreciate a simple and systematic explanation of what an UPDATE... FROM... JOIN... WHERE statement does in PostgreSQL. Thank you so much!

I don't get your data model, because it seems that a professor could teach more than one course.
That said, your query is
UPDATE classes c
SET year = 2
FROM classes c2 JOIN
professors p
ON c2.class_id = p.class_id JOIN
classrooms cr
ON c2.classroom_id = cr.classroom_id
WHERE c2.class_id = c.class_id;
The FROM clause is doing a JOIN, just as you would expect. You could check the results by using SELECT:
SELECT *
FROM . . . <the FROM clause here>
What else is happening? Well, the UPDATE is telling Postgres to update the table classes. Ignore the fact that classes is in the FROM clause (that is a different reference).
How does it know which records to update? Well, the WHERE clause is saying to update the rows that match the FROM. In this case, the JOINs are doing the filtering.
You can also express this logic with IN or EXISTS:
UPDATE classes c
SET year = 2
WHERE EXISTS (SELECT 1 FROM professors p WHERE c.class_id = p.class_id) AND
EXISTS (SELECT 1 FROM classrooms cr WHERE c.classroom_id = cr.classroom_id);
Personally I think the intention is clearer with this version.

I'm not sure this explanation is really systematic, but here goes.
When you UPDATE classes, it already knows what table you are dealing with. In practice I don't generally reference the table again after FROM but I'm sure there are cases where it might be preferable to do a join than use WHERE.
The WHERE clause in your example simply makes sure that both instances of the classes table are dealing with the same rows. Otherwise I guess it would be an outer join.
In your example, there seems no point to the other JOINS. Often there is, when you might want to constrain the update to certain professors, as in WHERE professors.professor='Einstein'.
Finally, as I mentioned, I don't usually use the updated table a second time, rather I'd use.
UPDATE classes c SET year = 2
FROM professors, classrooms
WHERE on c.class_id = professors.class_id
AND c.classroom_id = classrooms.classroom_id
Works fine for me, but there may be cases where the other syntax is preferable, especially if you want a left join or some such.

Related

Using "From Multiple Tables" or "Join" performance difference [duplicate]

Most SQL dialects accept both the following queries:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x = b.x
SELECT a.foo, b.foo
FROM a
LEFT JOIN b ON a.x = b.x
Now obviously when you need an outer join, the second syntax is required. But when doing an inner join why should I prefer the second syntax to the first (or vice versa)?
The old syntax, with just listing the tables, and using the WHERE clause to specify the join criteria, is being deprecated in most modern databases.
It's not just for show, the old syntax has the possibility of being ambiguous when you use both INNER and OUTER joins in the same query.
Let me give you an example.
Let's suppose you have 3 tables in your system:
Company
Department
Employee
Each table contain numerous rows, linked together. You got multiple companies, and each company can have multiple departments, and each department can have multiple employees.
Ok, so now you want to do the following:
List all the companies, and include all their departments, and all their employees. Note that some companies don't have any departments yet, but make sure you include them as well. Make sure you only retrieve departments that have employees, but always list all companies.
So you do this:
SELECT * -- for simplicity
FROM Company, Department, Employee
WHERE Company.ID *= Department.CompanyID
AND Department.ID = Employee.DepartmentID
Note that the last one there is an inner join, in order to fulfill the criteria that you only want departments with people.
Ok, so what happens now. Well, the problem is, it depends on the database engine, the query optimizer, indexes, and table statistics. Let me explain.
If the query optimizer determines that the way to do this is to first take a company, then find the departments, and then do an inner join with employees, you're not going to get any companies that don't have departments.
The reason for this is that the WHERE clause determines which rows end up in the final result, not individual parts of the rows.
And in this case, due to the left join, the Department.ID column will be NULL, and thus when it comes to the INNER JOIN to Employee, there's no way to fulfill that constraint for the Employee row, and so it won't appear.
On the other hand, if the query optimizer decides to tackle the department-employee join first, and then do a left join with the companies, you will see them.
So the old syntax is ambiguous. There's no way to specify what you want, without dealing with query hints, and some databases have no way at all.
Enter the new syntax, with this you can choose.
For instance, if you want all companies, as the problem description stated, this is what you would write:
SELECT *
FROM Company
LEFT JOIN (
Department INNER JOIN Employee ON Department.ID = Employee.DepartmentID
) ON Company.ID = Department.CompanyID
Here you specify that you want the department-employee join to be done as one join, and then left join the results of that with the companies.
Additionally, let's say you only want departments that contains the letter X in their name. Again, with old style joins, you risk losing the company as well, if it doesn't have any departments with an X in its name, but with the new syntax, you can do this:
SELECT *
FROM Company
LEFT JOIN (
Department INNER JOIN Employee ON Department.ID = Employee.DepartmentID
) ON Company.ID = Department.CompanyID AND Department.Name LIKE '%X%'
This extra clause is used for the joining, but is not a filter for the entire row. So the row might appear with company information, but might have NULLs in all the department and employee columns for that row, because there is no department with an X in its name for that company. This is hard with the old syntax.
This is why, amongst other vendors, Microsoft has deprecated the old outer join syntax, but not the old inner join syntax, since SQL Server 2005 and upwards. The only way to talk to a database running on Microsoft SQL Server 2005 or 2008, using the old style outer join syntax, is to set that database in 8.0 compatibility mode (aka SQL Server 2000).
Additionally, the old way, by throwing a bunch of tables at the query optimizer, with a bunch of WHERE clauses, was akin to saying "here you are, do the best you can". With the new syntax, the query optimizer has less work to do in order to figure out what parts goes together.
So there you have it.
LEFT and INNER JOIN is the wave of the future.
The JOIN syntax keeps conditions near the table they apply to. This is especially useful when you join a large amount of tables.
By the way, you can do an outer join with the first syntax too:
WHERE a.x = b.x(+)
Or
WHERE a.x *= b.x
Or
WHERE a.x = b.x or a.x not in (select x from b)
Basically, when your FROM clause lists tables like so:
SELECT * FROM
tableA, tableB, tableC
the result is a cross product of all the rows in tables A, B, C. Then you apply the restriction WHERE tableA.id = tableB.a_id which will throw away a huge number of rows, then further ... AND tableB.id = tableC.b_id and you should then get only those rows you are really interested in.
DBMSs know how to optimise this SQL so that the performance difference to writing this using JOINs is negligible (if any). Using the JOIN notation makes the SQL statement more readable (IMHO, not using joins turns the statement into a mess). Using the cross product, you need to provide join criteria in the WHERE clause, and that's the problem with the notation. You are crowding your WHERE clause with stuff like
tableA.id = tableB.a_id
AND tableB.id = tableC.b_id
which is only used to restrict the cross product. WHERE clause should only contain RESTRICTIONS to the resultset. If you mix table join criteria with resultset restrictions, you (and others) will find your query harder to read. You should definitely use JOINs and keep the FROM clause a FROM clause, and the WHERE clause a WHERE clause.
The first way is the older standard. The second method was introduced in SQL-92, http://en.wikipedia.org/wiki/SQL. The complete standard can be viewed at http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt .
It took many years before database companies adopted the SQL-92 standard.
So the reason why the second method is preferred, it is the SQL standard according the ANSI and ISO standards committee.
The second is preferred because it is far less likely to result in an accidental cross join by forgetting to put inthe where clause. A join with no on clause will fail the syntax check, an old style join with no where clause will not fail, it will do a cross join.
Additionally when you later have to a left join, it is helpful for maintenance that they all be in the same structure. And the old syntax has been out of date since 1992, it is well past time to stop using it.
Plus I have found that many people who exclusively use the first syntax don't really understand joins and understanding joins is critical to getting correct results when querying.
I think there are some good reasons on this page to adopt the second method -using explicit JOINs. The clincher though is that when the JOIN criteria are removed from the WHERE clause it becomes much easier to see the remaining selection criteria in the WHERE clause.
In really complex SELECT statements it becomes much easier for a reader to understand what is going on.
The SELECT * FROM table1, table2, ... syntax is ok for a couple of tables, but it becomes exponentially (not necessarily a mathematically accurate statement) harder and harder to read as the number of tables increases.
The JOIN syntax is harder to write (at the beginning), but it makes it explicit what criteria affects which tables. This makes it much harder to make a mistake.
Also, if all the joins are INNER, then both versions are equivalent. However, the moment you have an OUTER join anywhere in the statement, things get much more complicated and it's virtually guarantee that what you write won't be querying what you think you wrote.
When you need an outer join the second syntax is not always required:
Oracle:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x = b.x(+)
MSSQLServer (although it's been deprecated in 2000 version)/Sybase:
SELECT a.foo, b.foo
FROM a, b
WHERE a.x *= b.x
But returning to your question. I don't know the answer, but it is probably related to the fact that a join is more natural (syntactically, at least) than adding an expression to a where clause when you are doing exactly that: joining.
I hear a lot of people complain the first one is too difficult to understand and that it is unclear. I don't see a problem with it, but after having that discussion, I use the second one even on INNER JOINS for clarity.
To the database, they end up being the same. For you, though, you'll have to use that second syntax in some situations. For the sake of editing queries that end up having to use it (finding out you needed a left join where you had a straight join), and for consistency, I'd pattern only on the 2nd method. It'll make reading queries easier.
Well the first and second queries may yield different results because a LEFT JOIN includes all records from the first table, even if there are no corresponding records in the right table.
If both are inner joins there is no difference in the semantics or the execution of the SQL or performance. Both are ANSI Standard SQL It is purely a matter of preference, of coding standards within your work group.
Over the last 25 years, I've developed the habit that if I have a fairly complicated SQL I will use the INNER JOIN syntax because it is easier for the reader to pick out the structure of the query at a glance. It also gives more clarity by singling out the join conditions from the residual conditions, which can save time (and mistakes) if you ever come back to your query months later.
However for outer joins, for the purpose of clarity I would not under any circumstances use non-ansi extensions.

SQL Implicit Join joining 3 tables

I understand implicit JOINS are outdated but it's for a homework assignment. All his examples for implicit joins only join 2 tables, and I can't find an example anywhere that joins 3.
SELECT Name
FROM Employee
JOIN EmployeeSkills
ON (EmployeeID=E.ID)
JOIN Skill ON (SkillID=S.ID)
WHERE Title=’DBA’;
Here is the explicit version of what I want. How would I write this implicitly?
Thanks
Here's how I'd write it:
SELECT E.Name FROM Employee E,EmployeeSkills ES,Skill S
WHERE E.ID = ES.EmployeeID
AND ES.SkillID = S.ID
AND S.Title=’DBA’;
Pretty much the same answer as the first you received, but, for clarity, once I'd start using aliases I'd use them throughout, and make sure you define them. In the above example, both the employee and the skill could have a Name, and the Title could be an employee's title or the name of a skill. Using the table name (or alias) makes it easy to see what's going on even if you're not familiar with the schema.
Also, might have been the database I was using years ago, but it had a performance hit (very slight, but big difference on big data) if you didn't write the joins in the order you introduced the tables.
It's pretty much the same as the two table examples:
SELECT a.Name
FROM Employee a,EmployeeSkills b ,Skill c
WHERE a.EmployeeID = b.ID
AND b.SkillID = c.ID
AND Title=’DBA’;
Edit: Good point made by Everett Warren to alias, best practice is not using implicit joins as they were deprecated long ago.

Translating Oracle SQL to Access Jet SQL, Left Join

There must be something I'm missing here. I have this nice, pretty Oracle SQL statement in Toad that gives me back a list of all active personnel with the IDs that I want:
SELECT PERSONNEL.PERSON_ID,
PERSONNEL.NAME_LAST_KEY,
PERSONNEL.NAME_FIRST_KEY,
PA_EID.ALIAS EID,
PA_IDTWO.ALIAS IDTWO,
PA_LIC.ALIAS LICENSENO
FROM PERSONNEL
LEFT JOIN PERSONNEL_ALIAS PA_EID
ON PERSONNEL.PERSON_ID = PA_EID.PERSON_ID
AND PA_EID.PERSONNEL_ALIAS_TYPE_CD = 1086
AND PA_EID.ALIAS_POOL_CD = 3796547
AND PERSONNEL.ACTIVE_IND = 1
LEFT JOIN PERSONNEL_ALIAS PA_IDTWO
ON PERSONNEL.PERSON_ID = PA_IDTWO.PERSON_ID
AND PA_IDTWO.PERSONNEL_ALIAS_TYPE_CD = 3839085
AND PA_IDTWO.ACTIVE_IND = 1
LEFT JOIN PERSONNEL_ALIAS PA_LIC
ON PERSONNEL.PERSON_ID = PA_LIC.PERSON_ID
AND PA_LIC.PERSONNEL_ALIAS_TYPE_CD = 1087
AND PA_LIC.ALIAS_POOL_CD = 683988
AND PA_LIC.ACTIVE_IND = 1
WHERE PERSONNEL.ACTIVE_IND = 1 AND PERSONNEL.PHYSICIAN_IND = 1;
This works very nicely. Where I run into problems is when I put it into Access. I know, I know, Access Sucks. Sometimes one needs to use it, especially if one has multiple database types that they just want to store a few queries in, and especially if one's boss only knows Access. Anyway, I was having trouble with the ANDs inside the FROM, so I moved those to the WHERE, but for some odd reason, Access isn't doing the LEFT JOINs, returning only those personnel with EID, IDTWO, and LICENSENO's. Not everybody has all three of these.
Best shot in Access so far is:
SELECT PERSONNEL.PERSON_ID,
PERSONNEL.NAME_LAST_KEY,
PERSONNEL.NAME_FIRST_KEY,
PA_EID.ALIAS AS EID,
PA_IDTWO.ALIAS AS ID2,
PA_LIC.ALIAS AS LICENSENO
FROM ((PERSONNEL
LEFT JOIN PERSONNEL_ALIAS AS PA_EID ON PERSONNEL.PERSON_ID=PA_EID.PERSON_ID)
LEFT JOIN PERSONNEL_ALIAS AS PA_IDTWO ON PERSONNEL.PERSON_ID=PA_IDTWO.PERSON_ID)
LEFT JOIN PERSONNEL_ALIAS AS PA_LIC ON PERSONNEL.PERSON_ID=PA_LIC.PERSON_ID
WHERE (((PERSONNEL.ACTIVE_IND)=1)
AND ((PERSONNEL.PHYSICIAN_IND)=1)
AND ((PA_EID.PRSNL_ALIAS_TYPE_CD)=1086)
AND ((PA_EID.ALIAS_POOL_CD)=3796547)
AND ((PA_IDTWO.PRSNL_ALIAS_TYPE_CD)=3839085)
AND ((PA_IDTWO.ACTIVE_IND)=1)
AND ((PA_LIC.PRSNL_ALIAS_TYPE_CD)=1087)
AND ((PA_LIC.ALIAS_POOL_CD)=683988)
AND ((PA_LIC.ACTIVE_IND)=1));
I think that part of the problem could be that I'm using the same alias (lookup) table for all three joins. Maybe there's a more efficient way of doing this? Still new to SQL land, so any tips as far as that goes would be great. I feel like these should be equivalent, but the Toad query gives me back many many tens of thousands of imperfect rows, and Access gives me fewer than 500. I need to find everybody so that nobody is left out. It's almost as if the LEFT JOINs aren't working at all in Access.
To understand what you are doing, let's look at simplified version of your query:
SELECT PERSONNEL.PERSON_ID,
PA_EID.ALIAS AS EID
FROM PERSONNEL
LEFT JOIN PERSONNEL_ALIAS AS PA_EID ON PERSONNEL.PERSON_ID=PA_EID.PERSON_ID
WHERE PERSONNEL.ACTIVE_IND=1
AND PERSONNEL.PHYSICIAN_IND=1
AND PA_EID.PRSNL_ALIAS_TYPE_CD=1086
AND PA_EID.ALIAS_POOL_CD=3796547
If the LEFT JOIN finds match, your row might look like this:
Person_ID EID
12345 JDB
If it doesn't find a match, (disregard the WHERE clause for a second), it could look like:
Person_ID EID
12345 NULL
When you add the WHERE clauses above, you are telling it to only find records in the PERSONNEL_ALIAS table that meet the condition, but if no records are found, then the values are considered NULL, so they will never satisfy the WHERE condition and no records will come back...
As Joe Stefanelli said in his comment, adding a WHERE clause to a LEFT JOIN'ed table make it act as an INNER JOIN instead...
Further to #Sparky's answer, to get the equivalent of what you're doing in Oracle, you need to filter rows from the tables on the "outer" side of the joins before you join them. One way to do this might be:
For each table on the "outer" side of a join that you need to filter rows from (that is, the three instances of PERSONNEL_ALIAS), create a query that filters the rows you want. For example, the first query (say, named PA_EID) might look something like this:SELECT PERSONNEL_ALIAS.* FROM PERSONNEL_ALIAS WHERE PERSONNEL_ALIAS.PERSONNEL_ALIAS_TYPE_CD = 1086 AND PERSONNEL_ALIAS.ALIAS_POOL_CD = 3796547
In your "best shot in Access so far" query in the original post: a) replace each instance of PERSONNEL_ALIAS with the corresponding query created in Step 1, and, b) remove the corresponding conditions (on PA_EID, PA_IDTWO, and PA_LIC) from the WHERE clause.

INNER JOIN vs multiple table names in "FROM" [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
INNER JOIN versus WHERE clause — any difference?
What is the difference between an INNER JOIN query and an implicit join query (i.e. listing multiple tables after the FROM keyword)?
For example, given the following two tables:
CREATE TABLE Statuses(
id INT PRIMARY KEY,
description VARCHAR(50)
);
INSERT INTO Statuses VALUES (1, 'status');
CREATE TABLE Documents(
id INT PRIMARY KEY,
statusId INT REFERENCES Statuses(id)
);
INSERT INTO Documents VALUES (9, 1);
What is the difference between the below two SQL queries?
From the testing I've done, they return the same result. Do they do the same thing? Are there situations where they will return different result sets?
-- Using implicit join (listing multiple tables)
SELECT s.description
FROM Documents d, Statuses s
WHERE d.statusId = s.id
AND d.id = 9;
-- Using INNER JOIN
SELECT s.description
FROM Documents d
INNER JOIN Statuses s ON d.statusId = s.id
WHERE d.id = 9;
There is no reason to ever use an implicit join (the one with the commas). Yes for inner joins it will return the same results. However, it is subject to inadvertent cross joins especially in complex queries and it is harder for maintenance because the left/right outer join syntax (deprecated in SQL Server, where it doesn't work correctly right now anyway) differs from vendor to vendor. Since you shouldn't mix implicit and explict joins in the same query (you can get wrong results), needing to change something to a left join means rewriting the entire query.
If you do it the first way, people under the age of 30 will probably chuckle at you, but as long as you're doing an inner join, they produce the same result and the optimizer will generate the same execution plan (at least as far as I've ever been able to tell).
This does of course presume that the where clause in the first query is how you would be joining in the second query.
This will probably get closed as a duplicate, btw.
The nice part of the second method is that it helps separates the join condition (on ...) from the filter condition (where ...). This can help make the intent of the query more readable.
The join condition will typically be more descriptive of the structure of the database and the relation between the tables. e.g., the salary table is related to the employee table by the EmployeeID column, and queries involving those two tables will probably always join on that column.
The filter condition is more descriptive of the specific task being performed by the query. If the query is FindRichPeople, the where clause might be "where salaries.Salary > 1000000"... thats describing the task at hand, not the database structure.
Note that the SQL compiler doesn't see it that way... if it decides that it will be faster to cross join and then filter the results, it will cross join and filter the results. It doesn't care what is in the ON clause and whats in the WHERE clause. But, that typically wont happen if the on clause matches a foreign key or joins to a primary key or indexed column. As far as operating correctly, they are identical; as far as writing readable, maintainable code, the second way is probably a little better.
there is no difference as far as I know is the second one with the inner join the new way to write such statements and the first one the old method.
The first one does a Cartesian product on all record within those two tables then filters by the where clause.
The second only joins on records that meet the requirements of your ON clause.
EDIT: As others have indicated, the optimization engine will take care of an attempt on a Cartesian product and will result in the same query more or less.
A bit same. Can help you out.
Left join vs multiple tables in SQL (a)
Left join vs multiple tables in SQL (b)
In the example you've given, the queries are equivalent; if you're using SQL Server, run the query and display the actual exection plan to see what the server's doing internally.

SQL join syntax, when do you use "AS"?

I am working on an SQL statement and just wanted to clarify my understanding of join syntax in MS SQL.
Say I have the SQL...
select t.year from HOUSE AS h
LEFT OUTER JOIN SUBJECT K
ON H.Name = K.Name
INNER JOIN TESTCASE AS T
ON k.year IS NOT NULL
Sorry for the confusing example but basically - why am I able to use LEFT OUTER JOIN SUBJECT K without the keyword AS whereas with an INNER JOIN, I need to use a keyword of AS for INNER JOIN TESTCASE AS T?
'AS' is not required in either of these cases, but I prefer it personally from the point of view of readability, as it conveys exactly what you were meaning to do.
As far as I know the AS is only for easy coding. You can create a smaller or more clear name for the table. So House as H and further in the query you can use H.Name instead of typing House.Name
I believe 'AS' used to be a required keyword, but is no longer needed, only for readability. I find 'As' is often used when creating a name for a data item in the SELECT, but not when giving a name to a table in a JOIN. For example:
SELECT t.ID as TestID FROM [House] h JOIN [TestCase] t ON t.ID=h.TestID
I think this is probably because a set of data items displayed with no 'As' could become confusing as to how many items there are, and how many are just names applied to them.
interesting though where the AS keyword is useful... lets say two tables have name and you get this back from a query in code then it is cool to change the column name to something unique.