What is the difference between implicit/explicit joins? [duplicate] - sql

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Explicit vs implicit SQL joins
I understand that lots of people will shout at me now. But from my understanding
Say I have two tables
STUDENTS
student_id
firstname
surname
COURSES
course_id
name
student_id
So the courses table has a foreign key STUDENT_ID meaning that ONE student can have MANY courses yes?
OKAY.
From my understanding, if I want to select all the course associated with ONE student I could do either these:
SELECT *
FROM courses AS c, students AS s
WHERE c.student_id = s.student_id
AND s.student_id = 1;
OR
SELECT *
FROM courses AS c
JOIN students AS s ON c.student_id = s.student_id AND s.student_id = 1;
So what's the point in the JOIN when its essentially EXACTLY the same as the WHERE?
I know my understanding is WRONG but I cannot find a simple answer.
Please enlighten me!

FOREIGN_KEY makes your life simple when you try to insert something by checking for the integrity of the data. It was never meant to help you while retrieving the data from the relations.
e.g. If you try to insert a student with course_id = 10 when no such course exists, then foreign key constraint wouldn't allow you to have such a student.
JOIN is exactly the same as using WHERE. Have a look at this question.

In short: there is no difference.
Longer explanation: the relational model is based on the "cartesian product". In the query
SELECT a.x , b.y
FROM table_a a, table_b b
;
, every possible combination of rows form table_a and table_b is produced. If a contains 10 rows, and b 100 rows, you would get 1000 rows. Everything you add to the WHERE-clause restricts these results to only the pairs of rows that satisfy the WHERE-clause. So in
SELECT a.x , b.y, ...
FROM table_a a, table_b b
WHERE a.x = b.y
;
you would get everything, except the rows for which `NOT (a.x = b.y)'
In practice, there are two kinds of WHERE-clause elements: those that relate two tables, and those that compare a column-expression to a constant. The JOIN-clause is a way to specify the first kind of restrictions.
There are some minor differences and complications (NULLs, outer joins), but for the time being the two constructs are equivalent.

Related

'for all' Queries in sql

In class, professor said that SQL language does not provide 'for all' operator.
In order to use 'for all' you have to use 'not exist( X except Y)'
At this point, I can't figure out why 'for all' is same meaning as 'not exist( X except Y)'
I give you example relation:
course (cID,title,deptName,credit),
teaches (pID,cID,semester,year,classroom),
student (sID,name,gender,deptName)
Q: Find all student names who have taken all courses offered in 'CS' department
The answer is:
Select distinct
S.sid, S.name
from
student as S
where
not exists (
(select cID from course where deptName = 'CS')
except
(select T.cID from takes as T where S.sID = T.sID)
);
Can you give me specific explain about that?
ps. Sorry for my English skill
You professor is right. SQL has no direct way to query all records that have all possible relations of a certain type.
It's easy to query which relations of a certain type a record has. Just INNER JOIN the two tables and you are done.
But in an M:N relationship like "students" to "taken courses" it's not that simple.
To answer the question "which student has taken all possible courses" you must find out which relations could possibly exist and then make sure that all of them do actually exist.
select distinct
S.sid, S.name
from
student as S
where
not exists (
(select cID from course where deptName = 'CS')
except
(select T.cID from takes as T where S.sID = T.sID)
);
can be translated as
give me all students SELECT
for whom it is true: WHERE
that the following set is empty NOT EXISTS
(any course in 'CS') "all relations that can possibly exist"
minus EXCEPT
(all courses the student has taken) "the ones that do actually exist"
In other words: Of all possible relations there is no relation that does not exist.
There are other ways of expressing the same thought that can be used in database systems without support for EXCEPT.
For example
select
S.sid,
S.name
from
student as S
inner join takes as T on T.sID = S.sID
inner join course as C on C. cID = T. cID
where
c. deptName = 'CS'
group by
S.sid,
S.name
having
count(*) = (select count(*) from course where deptName = 'CS');
From your table definition and requirement its not clear what is the use of teaches table. You want the list of students names those have taken all courses offered by 'CS' department. For this students and course table is enough.
SELECT name
FROM
(
SELECT B.name, A.cid
FROM course A
INNER JOIN student B ON A.deptName = B.deptName
WHERE A.deptName = 'CS'
GROUP BY A.cid, B.name
) A
GROUP BY name
HAVING COUNT(name) >= (SELECT COUNT(cid) FROM course WHERE deptName = 'CS')
Internal query just selects all students those have taken any course offered by 'CS' dept and with group by I just make sure that in case a student take same course twice they will be counted as one row. Next I just select those students take all course offered by 'CS' dept.
I think you have some gap to understand your requirement properly. In your requirement no relation with teaches table is specified.
Q: Find all student names who have taken all courses offered in 'CS'
department
NOT EXISTS returns true if the query passed to it contains 0 records.
In this case, your sub-query from NOT EXISTS selects all the courses offered in 'CS', and subtract from this result set all the courses taken by specific student.
If the student have taken all the courses then except will remove all and the sub-query will return 0 records, which in pair with NOT EXISTS will give you true for specific student, and it will be displayed in final result set.
Brief history: Codd invented the Relational Model (RM), some people created a DBMS loosely based on RM to prove a RM product could be performant, and the SQL language emerged based on that DBMS (i.e. not directly based on the RM).
Codd came up with a set of primitive operators to define a database as being relationally complete. His algebra included product, where two relations are 'multiplied' together to give a combined relation; this made it into SQL as CROSS JOIN. [Side note: people refer to this operator as 'Cartesian product', which results in a set of ordered pairs. However, product in RM results in a relation (as do all relational operators), and CROSS JOIN results in a table expression (loosely speaking).]
Codd's algebra also included a division operator. I guess the thinking is, we should be able to take the result of product and one of the relations and use an operator to result in the other relation. But it does has some practical use too, of course. It is commonly expressed as 'the supplier who supplies all products', after Chris Date's parts and suppliers database found in his books. SQL lacks an explicit division operator, so we need to use other operators to get the desired result.
Note there are two flavours of division, being exact division ("suppliers who supply all the parts we are interested in and no more") and division with remainder ("suppliers who supply at least all the parts we are interested in and possibly more"). I tend to be wary of the answers here that do not mention either the name 'division' or that you need to decide whether you need to deal with remainders.
The thinking behind your professor's answer is that a double negative (in mathematics and English) i.e. if the statement "there is no part I don't supply" is true for a given supplier then that supplier will be in the result.
Note there are operators that Codd omitted (e.g. rename and summerize) that can now be found in SQL, so it's a shame we are still waiting for division!

SQL difference between IN and JOIN

First I need to say that it is safe to assume that I have no formal education in SQL although I have education in relational algebra.
I am investigating what would be the best approach to the following problem.
Our database is holding texts and keywords for every text.
Articles
id | text
Keywords
id | word
Articles_keywords
id_article | id_keyword
For the sake of this question the provider of answer can assume that tables are indexed however one wants.
So the problem is getting all articles that have a specific keyword.
I have talked with 2 groups of people that solve this in 2 ways, and they both claim that the approach of other group is wrong.
First solution using the IN operator:
SELECT * FROM Articles AS a WHERE a.id IN
(SELECT id_article FROM Articles_Keywords AS ak WHERE ak.id_keyword IN
(SELECT id FROM keywords AS k WHERE k.word = 'xyz'));
Other solution is using JOIN operator of course:
SELECT * FROM Articles as a
JOIN Articles_Keywords as ak
ON a.id = ak.id_article
JOIN Keywords as k
ON k.id = ak.id_keyword
WHERE k.word = 'xyz';
Which approach is better and, above all, why?
Edit
In articles table we have an id column being unique and, just for the sake of this question we could assume that there are no duplicate texts.
The same thing goes for the keywords table.
In article_keywords table the ordered pair (id_article,id_keyword) is unique

SQLZOO #12 -- confused about multiple select & join statements

I am attempting to answer question #12 on sqlzoo.net
(http://sqlzoo.net/wiki/More_JOIN_operations). I couldn't figure out the answer on my own but I did manage to find the answer online.
12: Which were the busiest years for 'John Travolta', show the year and the number of movies he made each year for any year in which he made more than 2 movies.
Answer:
SELECT yr,COUNT(title) FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr
HAVING COUNT(title)=(SELECT MAX(c) FROM
(SELECT yr,COUNT(title) AS c FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr) AS t)
One of parts that I do not fully understand is the multiple joins:
FROM movie
JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
Is Actor being joined only with Movie, or is actor being joined with Movie JOIN Casting?
I am trying to find a website that explains complex join statements as my attempted answer was far from correct (missing many sections). I think subselect statements with multiple complex join statements is a bit confusing at the moment. But, I could not find a good website that breaks the information up to help me form my own queries.
The other part I don't fully understand is this:
(SELECT yr,COUNT(title) AS c FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr) AS t)
3. What is the above code trying to find?
Ok, glad you are not afraid to ask, and I'll do my best to help clarify what is going on... Please excuse my re-formatting of the query to my mindset of writing queries. It better shows the relationships of where things are coming from (my perspective), and may help you too.
A few other things about my rewrite. I also like to use alias references to the tables so every column is qualified with the table (or alias) it originates from. It prevents ambiguity, especially for someone who does not know your table structures and relationships between tables. (m = alias to movie, c = alias for casting, a = alias for actor tables). For the sub query, and to keep alias confusion clear, I suffixed them with 2, such as m2, c2, a2.
SELECT
m.yr,
COUNT(m.title)
FROM
movie m
JOIN casting c
ON m.id = c.movieid
JOIN actor a
ON c.actorid = a.id
WHERE
a.name = 'John Travolta'
GROUP BY
m.yr
HAVING
COUNT(m.title) = ( SELECT MAX(t.movieCount)
FROM
( SELECT m2.yr,
COUNT(m2.title) AS movieCount
FROM
movie m2
JOIN casting c2
ON m2.id = c2.movieid
JOIN actor a2
ON c2.actorid = a2.id
WHERE
a2.name='John Travolta'
GROUP BY
m2.yr ) AS t
)
First, look at the outermost query (aliases m, c, a ) and the innermost query (aliases m2, c2, a2) are virtually identical.
The query has to run from the deepest query first... in this case the m2, c2, a2 query. Look at it and see what IT is going to deliver. If you ran that, you would get every year he had a movie and the number of movies... starting result from their sample data goes from 1976 all the way to 2010. So far, nothing complex unto itself (about 20 rows). Now, since each table may have an alias, each sub query (such as this MUST have an alias, so that is why the "as t". So, there is no true table, it is wrapping the entire query's result set and assigning THAT the alias of "t".
So now, go one level up in the query also wrapped in parens...
SELECT MAX(t.movieCount)
FROM (EntireSubquery as t)
Although abbreviated, this is what the engine is doing. Looking at the subquery result given an alias of "t" and finding the maximum "movieCount" value which is the count of movies that were done in a given year. In this case, the actual number is 3 and we are almost done.
Now, to the outermost query... again, this was virtually identical to the innermost query. The only difference is the HAVING clause. This is applied after all the grouping per year is performed. Then it is comparing ITs row result set count per year to the 3 value result of the SELECT MAX( t.movieCount )...
So, all the years that had only 1 or 2 movies are excluded from the result, and only the one year that had 3 movies are included.
Now, to clarify the JOINs. Each table should have a relationship with one or more tables (also known as linking tables, such as the cast table that has both a movie and actors/actresses. So, think of the join as how to I put the tables in order so that each one can touch a piece to the other until I have them all chained together. In this case
Movie -> Casting linked by the movie ID, then Casting -> actor by the actor ID, so that is how I do it visually hierarchically... I am starting FROM the Movie table, JOINing to the cast table based ON Movie ID = Cast Movie ID. Now, from the Casting table joined to the Actor table based on the common Actor ID field
FROM
movie m
JOIN casting c
ON m.id = c.movieid
JOIN actor a
ON c.actorid = a.id
Now, this is a simple relationship, but you COULD have one primary table with multiple child-level tables. You could join multiple tables based on the respective data. Very simple sample to clarify the point. You have a student table going to a school. A student has a degree major, an ethnicity, an address state (assuming an online school and students can be from any state). If you had lookup tables for degrees, ethnicity and states, you might come up with something like...
select
s.firstname,
s.lastname,
d.DegreeDescription,
e.ethnicityDescription,
st.stateName
from
students s
join degrees d
on s.degreemajor = d.degreeID
join ethnicity e
on s.ethnicityID = e.id
join states st
on s.homeState = st.stateID
Notice the hierarchical representation that each table is directly associated under that of the student. Not all tables need to be one deeper than the last.
So, there are many sites out there, such as the w3schools as offered by Mark, but learn to dissect small pieces at a time... what are the bare minimum tables to get from point-A to point-Z and draw the relationships. THEN, tare down based on requirement criteria you are looking for.
The correct answer would be:
SELECT yr, COUNT(title)
FROM movie m
JOIN casting c ON m.id=c.movieid JOIN actor a ON c.actorid=a.id
WHERE name='John Travolta'
GROUP BY yr
HAVING COUNT(title) > 2;
The answer you found (which seems to be a mistake on the sqlzoo site) is looking for any year that has a count equal to the year with the highest count.
I used table aliases in the query above to clear up how the tables are joined. Movie is joined to casting and casting is joined to actor.
The subquery that confuses you is listing each year and a count of movies for that year that star John Travolta. It's not needed if you're answering the question as written.
As for learning resources, make sure you have the basics down. Understand everything at http://w3schools.com/sql. Try searching for "sql joining multiple tables" in your favorite search engine when you're ready for more.

Cannot find correct number of values in a table that are not in another table, though I can do otherwise

I want to retrieve the course_id in table course that is not in the table takes. Table takes only contains course_id of courses taken by students. The problem is that if I have:
select count (distinct course.course_id)
from course, takes
where course.course_id = (takes.course_id);
the result is 85 which is smaller than the total number of course_id in table course, which is 200. The result is correct.
But I want to find the number of course_id that are not in the table takes, and I have:
select count (distinct course.course_id)
from course, takes
where course.course_id != (takes.course_id);
The result is 200, which is equal the number of course_id in table course. What is wrong with my code?
This SQL will give you the count of course_id in table course that aren't in the table takes:
select count (*)
from course c
where not exists (select *
from takes t
where c.course_id = t.course_id);
You didn't specify your DBMS, however, this SQL is pretty standard so it should work in the popular DBMSs.
There are a few different ways to accomplish what you're looking for. My personal favorite is the LEFT JOIN condition. Let me walk you through it:
Fact One: You want to return a list of courses
Fact Two: You want to
filter that list to not include anything in the Takes table.
I'd go about this by first mentally selecting a list of courses:
SELECT c.Course_ID
FROM Course c
and then filtering out the ones I don't want. One way to do this is to use a LEFT JOIN to get all the rows from the first table, along with any that happen to match in the second table, and then filter out the rows that actually do match, like so:
SELECT c.Course_ID
FROM
Course c
LEFT JOIN -- note the syntax: 'comma joins' are a bad idea.
Takes t ON
c.Course_ID = t.Course_ID -- at this point, you have all courses
WHERE t.Course_ID IS NULL -- this phrase means that none of the matching records will be returned.
Another note: as mentioned above, comma joins should be avoided. Instead, use the syntax I demonstrated above (INNER JOIN or LEFT JOIN, followed by the table name and an ON condition).

Efficient way to select records missing in another table

I have 3 tables. Below is the structure:
student (id int, name varchar(20))
course (course_id int, subject varchar(10))
student_course (st_id int, course_id int) -> contains name of students who enrolled for a course
Now, I want to write a query to find out students who did not enroll for any course. As I could figure out there are multiple ways to fetching this information. Could you please let me know which one of these is the most efficient and also, why. Also, if there could be any other better way of executing same, please let me know.
db2 => select distinct name from student inner join student_course on id not in (select st_id from student_course)
db2 => select name from student minus (select name from student inner join student_course on id=st_id)
db2 => select name from student where id not in (select st_id from student_course)
Thanks in advance!!
The subqueries you use, whether it is not in, minus or whatever, are generally inefficient. Common way to do this is left join:
select name
from student
left join student_course on id = st_id
where st_id is NULL
Using join is "normal" and preffered solution.
The canonical (maybe even synoptic) idiom is (IMHO) to use NOT EXISTS :
SELECT *
FROM student st
WHERE NOT EXISTS (
SELECT *
FROM student_course
WHERE st.id = nx.st_id
);
Advantages:
NOT EXISTS(...) is very old, and most optimisers will know how to handle it
, thus it will probably be present on all platforms
the nx. correlation name is not leaked into the outer query: the select * in the outer query will only yield fields from the student table, and not the (null) rows from the student_course table, like in the LEFT JOIN ... WHERE ... IS NULL case. This is especially useful in queries with a large number of range table entries.
(NOT) IN is error prone (NULLs), and it might perform bad on some implementations (duplicates and NULLs have to be removed from the result of the uncorrelated subquery)
Using "not in" is generally slow. That makes your second query the most efficient. You probably don't need the brackets though.
Just as a comment: I would suggest to select student Id (which are unique) and not names.
As another query option you might want to join the two tables, group by student_id, count(course_id) having count(course_id) = 0.
Also, I agree that indexes will be more important.