Don't understand why inner join is necessary for filtering in sql - sql

I have the following tables:
Basically I have a many2many relation between students and courses using the junction table students_courses
Here is some data populated into the tables:
students:
courses
students_courses:
So basically I would like to select the full_name and c_id for a given student. So for example for student with id=3 i would have Aurica 5 and Aurica 6.
My first approach was to write:
select s.full_name,sc.c_id from students s, students_courses sc
where sc.s_id=3
But i obtain this:
Aurica 5
Aurica 6
Aurica 5
Aurica 6
Aurica 5
Aurica 6
So it is duplicated by the number of rows of the students_courses table. Now I'm not sure why this happens.
If I would be an SQL parser, I would parse it like this:
"take the c_id from students_courses, full_name from students, and display them if the students_course row respects the where filter"
Not it works using join, but I don't really understand why the inner join is necessary.
select s.full_name, sc.c_id from students s
inner join students_courses sc
on sc.s_id=s.id and s.id=3;
Explain a bit how is the first sql interpreted by the SQL parser and why with join works.
Thanks,

When you select information from two tables what it does is a cross product of all the records and then it looks to the all of the records that satisfy the where clause. You have 3 records in the Students table
id | full_name
---+----------
3 | Aurica
4 | Aurica
5 | Aurica
And 6 records in the student_courses table.
s_is | c_id
-----+-----
3 | 5
3 | 6
4 | 7
4 | 8
5 | 9
5 | 10
So before your where statement it creates 18 different records. So it is easy to see I will include all of the columns.
s.id | s.full_name | sc.s_id | sc.c_id
-----+-------------+---------+--------
3 | Aurica | 3 | 5
3 | Aurica | 3 | 6
3 | Aurica | 4 | 7
3 | Aurica | 4 | 8
3 | Aurica | 5 | 9
3 | Aurica | 5 | 10
4 | Aurica | 3 | 5
4 | Aurica | 3 | 6
4 | Aurica | 4 | 7
4 | Aurica | 4 | 8
4 | Aurica | 5 | 9
4 | Aurica | 5 | 10
5 | Aurica | 3 | 5
5 | Aurica | 3 | 6
5 | Aurica | 4 | 7
5 | Aurica | 4 | 8
5 | Aurica | 5 | 9
5 | Aurica | 5 | 10
From there it only displays the ones where cs.id=3
s.full_name | sc.c_id
------------+--------
Aurica | 5
Aurica | 6
Aurica | 5
Aurica | 6
Aurica | 5
Aurica | 6
The second query you had compared the value of sc.s_id=s.id and only displays the ones where those values are the same, as well as the c_id=3

The SQL parser doesn't try to guess how your two tables are related. It would seem like the database engine has enough information to figure this out itself by following constraints, but SQL intentionally doesn't use the FK relationships to decide how to join your tables; you might want to remove constraints at at future date for some reason (such as in order to improve performance), and you wouldn't want dropping a constraint to alter how joins were made. The DBA needs freedom to change indexes and constraints without having to worry about having changed what results are returned by queries.
Since it can't count on having complete information to go on, the SQL engine is not in the business of deducing/guessing relationships. It's up to the person writing the SQL to specify what they are joining on. If you don't give it any instructions telling it how to hook up the tables (using a JOIN ON clause or WHERE clause) then it creates a cross join, which gives you the duplicated results.

First of all, SQL is a set-based language, you operate on sets of data, not on single (rows of) data.
If I would be an SQL parser, I would parse it like this: "take the
c_id from students_courses, full_name from students, and display them
if the students_course row respects the where filter"
Here, you're overlooking the sets students_courses and students, and just thinking about each row of data, like if this rows respects the filter, give me all the informations.
The JOIN doesn't filter data (that's what WHERE does), but instead it puts it together.
When you SELECT from table A, you ask for the set of rows in A, all of them.
When you SELECT from table A WHERE some condition, you ask for the set of rows in A that respect the condition (so the SQL engine discards rows from A that do not belong to the set you described with your query).
When you JOIN table_a and table_b, you ask to join the set of rows in a with the set of rows in b, obtaining a new set whose rows are the "concatenation" (let me use that term) of the columns from a row in A and the columns from a row in B; this, without giving any other information about how to join the rows, simply results in each row of table_a joined with each row of table_b.
That's why you don't get what you expect.
Finally, from a conceptual point of view, I'd like to point out that the SQL engine doesn't take the columns you request from a table or another, but after (1) having joined the rows in any table you requested and (2) having filtered out any row that doesn't match the where condition, it just return the columns you requested from the rows of the resulting set after (1) and (2).
In real life, RDBMS may reorder these operations, and apply any kind of optimization they find possible based on indexes and other query and tables informations they have available.
This should give you a rough idea of what's going on. But as #GordonLinoff suggested you, I think you should get a stronger basis about SQL and relational databases before you go any further, or it will get harder than this.
As a side note, what you had in your FROM clause, is a sort of implicit join, a former join syntax in which the FROM clause specifies the tables involved, and the WHERE clause the join predicate (the columns whose values should match to join the data).

If you would have done something like
select s.full_name,sc.c_id
from students s, students_courses sc
where sc.s_id = s.id --<-- you left this out
AND sc.s_id=3
You would have got the same results, Inner join is not necessary for this statement but it is a good practice to use this newer INNER JOIN Syntax to retrieve data.

Both of your queries are in fact joins, only in your first example there is no word "join" (but it is there, trust me).
However, that's an old style join and it's not recommended to use any more. In short, it's about NULL values - this old style join has an problem with interpreting NULL values and that's why you have wrong result.
For more details see here.

Related

Flatten tree structure represented in SQL [duplicate]

This question already has an answer here:
SQL Server recursive self join
(1 answer)
Closed 3 years ago.
I'm using an engineering calculation package and trying to extract some information from it in a built in reporting tool that allows SQL query
An abbreviated example SQL tables are as follows:
Id | Description | Ref
---|---------------------
1 | system 1 |
3 | block 4 | 6
3 | block 4 | 1
5 | formula1 | 3
6 | f |
7 | something | 1
9 | cheese | 5
The "Ref" column identifies rows that are subrecords of other items.
What I want to do is run a query that will produce a list that will show all items that appear on a each page. As you can see from the table above "ID" is not the unique key; each item can appear in multiple locations within the table. In the example above:
ID 5 is a subitem of ID3
ID 3 is a subitem of ID 1 AND ID 6
ID 1 and ID 6 aren't subitems of anything
So effectively it is representing a tree structure:
ID 1
+-------- ID 7
|---- ID 3
+---- ID 5
+---- ID 9
ID 6
+---- ID 3
+---- ID 5
+---- ID 9
What I'm hoping to is work out which items appear under each top level item (so the end result should be a table where in the "Ref" column only top level items appear):
Id | Description | Ref
---|---------------------
1 | system 1 |
3 | block 4 | 6
3 | block 4 | 1
5 | formula1 | 1
5 | formula1 | 6
6 | f |
9 | cheese | 1
9 | cheese | 6
7 | something | 1
The tree structure can be a total of 5 levels deep
I've been trying to use left joins to build up a list of page references, but I think I'm also going to need to union results tables (because obviously rows like ID=9, ID=5, and ID = 6 have to be duplicated in the final results set). It starts to get a bit messy!
WITH A
AS (SELECT *
FROM [RbdBlocks]),
B
AS (SELECT [x].[Id],
[x].[Description],
[x].[Page] AS Page1,
[y].[Page] AS Page2,
FROM A AS x
LEFT OUTER JOIN
A AS y
ON y.Id = x.Page)
SELECT *
FROM B
The above gives me some of the nested references, but I'm not sure if there's a better way to get this data together, and to manage the recursion rather than just duplicating the set of queries 4 times?
Have a look at Recursive Common Table Expressions (CTEs). They should be able to accomplish exactly what you need.
Have a look at Example D on the SQL Docs page.
Basically what you'd do in your case is:
In the "anchor member" of the CTE, select all top-level items
In the "recursive member" of the CTE, join all of the nested children to the top-level item
Recursive CTEs are not really trivial to understand, so be sure to read the docs carefully.

Collect ID with specific counts and complaints

I have two tables, one has complaints(case_dtl) and the other has the products and its different versions(install_dtl). user_id is a column that can be used to join these two tables.
I'm required to calculate the number of users(count) that are on specific version of the product and the total number of complaints for that version of the product.
I can calculate the count for different versions by a simple group by but I'm struggling to "concatenate" user id's with this count and then join these user id's with the user id's in case_dtl table to collect the number of complaints for that specific version of the product.
I am trying to write this query in Teradata SQL.
Here's a sample(I am terribly sorry for doing such a pathetic job in creating a table. I tried and would love any help in that too):
case_dtl table(complaints):
User_ID |Complaint
1 |Yes
2 |Yes
3 |Yes
7 |Yes
install_dtl table(software versions table):
User_ID | Version
1 | 10
2 | 11
3 | 10
4 | 11
5 | 11
6 | 10
7 | 10
8 | 10
9 | 10
10|10
And, I need output like this:
Output:
Version |Complaint Count | User Count
10 |3|7
11 |1|3
You just need an outer join:
select
t1.version,
count (t2.user_id),
count (t1.user_id)
from
install_dtl t1
left join case_dtl t2
on t1.user_id = t2.user_id

TSQL change in query to and query

I have one to many relationship table
ReviewId EffectId
1 | 2
1 | 5
1 | 8
2 | 2
2 | 5
2 | 9
2 | 3
3 | 3
3 | 2
3 | 9
In the site the users select each effect he chooses, and I get all the relevant review.
I make an in query
For example if the user select effects 2 and 5
My query: “
select reviewed from table_name where effected in(2,5)
Now I need get all the review that contain both effect
All reviews that has effect 2 and effect 5
What is the best query to make this?
Important for me that the query will run as quick as possible.
And for this I can also change the table schema (if needed ) like add a cached field that contain all the effect with comma like
Reviewed cachedEffects
1 | ,2,5,8
2 | ,2,5,9,3,
3 | ,3,2,9
You can do it this way:
select reviewid
from
tbl
where effectid in (2,5)
group by reviewid
having count(distinct effectid) > 1
Demo
count (distinct effectid) is used to ensure that the results contain only those reviewIDs which have multiple records with different values of effectID. The where clause is used to filter out based on your filter condition of having both 2 and 5.
The key thing to note here is that we are grouping by reviewID, and also using the count of distinct effectID values to ensure that only those records which have both 2 and 5 are returned. If we did not do so, the query would return all rows which have effectID equal to either 2 or 5.
For improving performance, you could create an index on reviewID.

SQL query to find list of primary keys not used

I am trying to make a drop down picker in an Access database to display all the primary keys not used, in this case a date that is limited to the first of the month.
I have 2 tables that are for this use
tblReport
pk date | Data for this record |
05/01/13 | stuff
06/01/13 | stuff
07/01/13 | stuff
08/01/13 | stuff
and
tblFutureDates
pk date | an index
05/01/13 | 1
06/01/13 | 2
07/01/13 | 3
08/01/13 | 4
09/01/13 | 5
10/01/13 | 6
11/01/13 | 7
12/01/13 | 8
I want a query that looks at these two tables and returns the dates that are in the second table that aren't in the first one. I have tried some joins but cannot figure it out. This is what I have thus far:
SELECT tblFutureDates.FutureDate
FROM tblFutureDates RIGHT JOIN tblReport
ON tblFutureDates.FutureDate = tblReport.ReportMonth;
and that returns:
05/01/13
06/01/13
07/01/13
08/01/13
Thanks
This selects dates from tblFutureDates that are NOT IN tblReport
SELECT tblFutureDates.FutureDate
FROM tblFutureDates
WHERE tblFutureDates.FutureDate
NOT IN (SELECT tblReport.ReportMonth FROM tblReport)
You can also use LEFT JOIN WHERE IS NULL and NOT EXISTS for more information about all 3 see this post.

maximum and minimum number of tuples in natural join

I came across a question that states
Consider the following relation schema pertaining to a students
database: Student (rollno, name, address)
Enroll (rollno, courseno, coursename)
where the primary keys are shown underlined. The number of tuples in the
Student and Enroll tables are 120 and 8 respectively. What are the maximum
and minimum number of tuples that can be present in (Student * Enroll),
where '*' denotes natural join ?
I have seen several solutions on Internet like this or this
As per my understanding. maximum tuples should be 8 and minimum should be 8 as well, since for each (rollnum,course) there should be a roll num in Students. Anyone who can help in this regard
I hope, you understood what Natural Join exactly is. You can review here.
If the tables R and S contains common attributes and value of that attribute in each tuple in both tables are same, then the natural join will result n*m tuples as it will return all combinations of tuples.
Consider following two tables
Table R (With attributes A and C)
A | C
----+----
1 | 2
3 | 2
Table S (With attributes B and C)
B | C
----+----
4 | 2
5 | 2
6 | 2
Result of natural join R * S (If domain of attribute C in the two tables are same )
A | B | C
---+---+----
1 | 4 | 2
1 | 5 | 2
1 | 6 | 2
3 | 4 | 2
3 | 5 | 2
3 | 6 | 2
You can see both R and S contain the attribute C whose value is 2 in each and every tuple. Table R contains 2 tuples, Table S contains 3 tuples, where Result table contains 2*3=6 tuples.
Moreover, while performing a natural join, if there were no common attributes between the two relations, Natural join will behave as Cartesian Product. In that case, you'll obviously have m x n as maximum number of tuples.
Consider following two tables
Table R (With attributes A and B)
A | B
----+----
1 | 2
3 | 2
Table S (With attributes C and D)
C | D
----+----
4 | 2
5 | 2
Result of natural join R * S
A | B | C | D
---+---+----+----
1 | 2 | 4 | 2
1 | 2 | 5 | 2
3 | 2 | 4 | 2
3 | 2 | 5 | 2
Hope this helps.
If there was a referential constraint in place ensuring that every rollno in Enroll must also appear in Student then your answer of 8 for both minimum and maximum would be correct. The question doesn't actually mention any such constraint however. There's no need to assume that the RI constraint exists just because the rollno attribute appears in both tables. So the best answer is 0 minimum and 8 maximum. If it's a multiple-choice question and 0,8 isn't one of the given answers then answer 8,8 instead - and tell your teacher that the question is unclear.
If you are asking about the maximum number of tuple that could appear in the natural join of R and S
the its the Cartesian product of both the tuples
Yes answer should be 8,8 .
Because Rollno is key in Student table and rollno,courseno are compound key .
Relationships between Student and enrol table is 1:M .
So maximum number of tuples is same as many side ie. 8
And minimum number of tuples is 8 if Foreign key exist other wise 0.
So answer is 8,8 .