SQL Table Transformation Course on Code Academy - sql

This question is from the SQL table transformation course from Code Academy. I am curious to know the difference between the following 2 queries and why the result set showed different answers:

The issue here is that airports.code may have duplicates. In this case, joining from the flights to airports table could result in duplicated rows, as a record from flights could match multiple records in airports.
If the field airports.code were distinct, i.e. there were no duplicates in that column, then both queries would have returned the same number of results. Consider the following sample data:
flights:
origin
1
2
3
airports:
code
1
1
2
3
It should be clear that the WHERE IN query (the first one) would return only three records, one for each origin value. But the second query with the join would actually return four records, since origin=1 would match twice to code=1.

Related

SQL 2 JOINS USING SINGLE REFERENCE TABLE

I'm trying to achieve 2 joins. If I run the 1st join alone it pulls 4 lots of results, which is correct. However when I add the 2nd join which queries the same reference table using the results from the select statement it pulls in additional results. Please see attached. The squared section should not be being returned
So I removed the 2nd join to try and explain better. See pic2. I'm trying to get another column which looks up InvolvedInternalID against the initial reference table IRIS.Practice.idvClient.
Your database is simply doing as you tell it. When you add in the second join (confusingly aliased as tb1 in a 3 table query) the database is finding matching rows that obey the predicate/truth statement in the ON part of the join
If you don't want those rows in there then one of two things must be the case:
1) The truth you specified in the ON clause is faulty; for example saying SELECT * FROM person INNER JOIN shoes ON person.age = shoes.size is faulty - two people with age 13 and two shoes with size 13 will produce 4 results, and shoe size has nothing to do with age anyway
2) There were rows in the table joined in that didn't apply to the results you were looking for, but you forgot to filter them out by putting some WHERE (or additional restriction in the ON) clause. Example, a table holds all historical data as well as current, and the current record is the one with a NULL in the DeletedOn column. If you forget to say WHERE deletedon IS NULL then your data will multiply as all the past rows that don't apply to your query are brought in
Don't alias tables with tbX, tbY etc.. Make the names meaningful! Not only do aliases like tbX have no relation to the original table name (so you encounter tbX, and then have to go searching the rest of the query to find where it's declared so you can say "ah, it's the addresses table") but in this case you join idvclient in twice, but give them unhelpful aliases like tb1, tb3 when really you should have aliased them with something that describes the relationship between them and the rest of the query tables
For example, ParentClient and SubClient or OriginatingClient/HandlingClient would be better names, if these tables are in some relationship with each other.
Whatever the purpose of joining this table in twice is, alias it in relation to the purpose. It may make what you've done wriong easier to spot, for example "oh, of course.. i'm missing a WHERE parentclient.type = 'parent'" (or WHERE handlingclient.handlingdate is not null etc..)
The first step to wisdom is by calling things their proper names

SQL JOIN OPTIMIZATION

I am working on a generalized problem where I am given only schema definition of multiple tables that i have.
Now i have to retrieve certain columns by joining multiple tables such that number of joins are minimized.
Example: Suppose i have 3 tables and here is the list of columns that they have.
Table 1:(1,2,3,4,5),
Table 2:(5,6,7),
Table 3:(5,6,7,8)
Now suppose I have a query in which i want all the columns 1,2,3,4,5,6,7,8.
Now i can join either table 1,table 2 and table 3 OR
table 1 and table 3.I would get the required information in both the cases but joining table 1 and table 3 would require only 1 join rather than 2 join in other case.
What i was trying was a greedy algorithm in which first i would consider table that has maximum number of required columns then eliminate the common columns between the query and table(from both query and table) and then consider updated required columns and update tables and so on.But i guess it would be slow.
So is there a generalized algorithm or if anyone can give me any hint in this direction?
first of all, I have to mention that it's not "join", but "union".
Then I have to mention that if you want to use the greedy algorithm, you have to first join the 2 most short, cause when you join a table 2 times, it would be of o(n), and so you will have 2n operations to do, and so it would be better if n be as smaller as possible.
Beside these, the following link may be useful for you:
Merging 3 tables/queries using MS Access Union Query

Relations between 3 levels and a result table

I have 3 tables that work, let's say, as levels for this purpose. Everyone of them has 2 columns, id and name. And, they combined, result on posibilities that matches the table 4.
How can I create the relationships between the first set (3 tables), and the last one with the possible results after the combination?
I did this in the past just with 2 tables, back then I created a third one having 2 fields, 2 FKs against the original tables. But this time, I have a set of 3 tables to match with a fourth, and that's what's making me wonder.
Should I simply create a 5th table with 4 fields having 4 FKs or is there another way?
Use a 5th table as an assignment table, then using a nested query with joins you can access the data in your results table
note, the 5th table would have 3 FK columns linked to the 3 other tables, and a 4th for a row id.

Why am I loosing table data from adding additional join statement in SQL Server?

So I have table X with let's say 50 rows of data in it and it is being joined to another table with only 2 rows in it. When I "normal" JOIN them together only the overlapping data will show, i.e. the two rows of data that are found in the larger collection of 50. In my case I want all 50 rows to persist so I use a LEFT JOIN and I am returned all 50 rows like planned.
Now I want to start adding and joining in other tables to get additional data about these rows and have them display togehter, whenever there is no data, I am fine with getting null. Now I'm adding a new JOIN and it's only going to find data for those two rows and not for the other 48, which is perfectly fine and I would like NULL to be displayed where no match is found. My problem is that rather than doing that, displaying null, it is instead removing the 48 rows will partial null data in their columns entirely and only showing the two that match, WHY?
I can provide code if needed, I thought this may be easier to understand.
If you inner join the third table, there are only 2 rows that can match (I assume you don't match on nulls). Only 2 rows from the second table have non-null values in them, so only those 2 can produce a result if you do an inner join with the third table.
You will need another left join in this case. This will return all the rows from the second table, including those that are all null (as the result of the first left join).

Is there any reason this simple SQL query should be so slow?

This query takes about a minute to give results:
SELECT MAX(d.docket_id), MAX(cus.docket_id) FROM docket d, Cashup_Sessions cus
Yet this one:
SELECT MAX(d.docket_id) FROM docket d UNION MAX(cus.docket_id) FROM Cashup_Sessions cus
gives its results instantly. I can't see what the first one is doing that would take so much longer - I mean they both simply check the same two lists of numbers for the greatest one and return them. What else could it be doing that I can't see?
I'm using jet SQL on an MS Access database via Java.
the first one is doing a cross join between 2 tables while the second one is not.
that's all there is to it.
The first one uses Cartesian product to form a source data, which means that every row from the first table is paired with each row from the second one. After that, it searches the source to find out the max values from the columns.
The second doesn't join tables. It just find max from the fist table and the max one from the second table and than returns two rows.
The first query makes a cross join between the tables before getting the maximums, that means that each record in one table is joined with every record in the other table.
If you have two tables with 1000 items each, you get a result with 1000000 items to go through to find the maximums.