SQL join basic questions

SQL join basic questions - sql

When I have to select a number of fields from different tables:
do I always need to join tables?
which tables do I need to join?
which fields do I have to use for the join/s?
do the joins effects reflect on fields specified in select clause or on where conditions?
Thanks in advance.

Think about joins as a way of creating a new table (just for the purposes of running the query) with data from several different sources. Absent a specific example to work with, let's imagine we have a database of cars which includes these two tables:
CREATE TABLE car (plate_number CHAR(8),
state_code CHAR(2),
make VARCHAR(128),
model VARCHAR(128),);
CREATE TABLE state (state_code CHAR(2),
state_name VARCHAR(128));
If you wanted, say, to get a list of the license plates of all the Hondas in the database, that information is already contained in the car table. You can simply SELECT * FROM car WHERE make='Honda';
Similarly, if you wanted a list of all the states beginning with "A" you can SELECT * FROM state WHERE state_name LIKE 'A%';
In either case, since you already have a table with the information you want, there's no need for a join.
You may even want a list of cars with Colorado plates, but if you already know that "CO" is the state code for Colorado you can SELECT * FROM car WHERE state_code='CO'; Once again, the information you need is all in one place, so there is no need for a join.
But suppose you want a list of Hondas including the name of the state where they're registered. That information is not already contained within a table in your database. You will need to "create" one via a join:
car INNER JOIN state ON (car.state_code = state.state_code)
Note that I've said absolutely nothing about what we're SELECTing. That's a separate question entirely. We also haven't applied a WHERE clause limiting what rows are included in the results. That too is a separate question. The only thing we're addressing with the join is getting data from two tables together. We now, in effect, have a new table called car INNER JOIN state with each row from car joined to each row in state that has the same state_code.
Now, from this new "table" we can apply some conditions and select some specific fields:
SELECT plate_number, make, model, state_name
FROM car
INNER JOIN state ON (car.state_code = state.state_code)
WHERE make = 'Honda'
So, to answer your questions more directly, do you always need to join tables? Yes, if you intend to select data from both of them. You cannot select fields from car that are not in the car table. You must first join in the other tables you need.
Which tables do you need to join? Whichever tables contain the data you're interested in querying.
Which fields do you have to use? Whichever fields are relevant. In this case, the relationship between cars and states is through the state_code field in both table. I could just as easily have written
car INNER JOIN state ON (state.state_code = car.plate_number)
This would, for each car, show any states whose abbreviations happen to match the car's license plate number. This is, of course, nonsensical and likely to find no results, but as far as your database is concerned it's perfectly valid. Only you know that state_code is what's relevant.
And does the join affect SELECTed fields or WHERE conditions? Not really. You can still select whatever fields you want and you can still limit the results to whichever rows you want. There are two caveats.
First, if you have the same column name in both tables (e.g., state_code) you cannot select it without clarifying which table you want it from. In this case I might write SELECT car.state_code ...
Second, when you're using an INNER JOIN (or on many database engines just a JOIN), only rows where your join conditions are met will be returned. So in my nonsensical example of looking for a state code that matches a car's license plate, there probably won't be any states that match. No rows will be returned. So while you can still use the WHERE clause however you'd like, if you have an INNER JOIN your results may already be limited by that condition.

Very broad question, i would suggest doing some reading on it first but in summary:
1. joins can make life much easier and queries faster, in a nut shell try to
2. the ones with the data you are looking for
3. a field that is in both tables and generally is unique in at least one
4. yes, essentially you are createing one larger table with joins. if there are two fields with the same name, you will need to reference them by table name.columnname

do I always need to join tables?
No - you could perform multiple selects if you wished
which tables do I need to join?
Any that you want the data from (and need to be related to each other)
which fields do I have to use for the
join/s?
Any that are the same in any tables within the join (usually primary key)
do the joins effects reflect on fields specified in select clause or on where conditions?
No, however outerjoins can cause problems

(1) what else but tables would you want to join in mySQL?
(2) those from which you want to correlate and retrieve fields (=data)
(3) best use indexed fields (unique identifiers) to join as this is fast. e.g. join
retrieve user-email and all the users comments in a 2 table db
(with tables: tableA=user_settings, tableB=comments) and both having the column uid to indetify the user by
select * from user_settings as uset join comments as c on uset.uid = c.uid where uset.email = "test#stackoverflow.com";
(4) both...

Related

How to understand this query?

SELECT DISTINCT
...
...
...
FROM Reviews Rev
INNER JOIN Reviews SubRev ON Subrev.W_ID=Rev.ID
WHERE Rev.Status='Approved'
This is a small part of a long query that I've been trying to understand for a day now. What is happening with the join? Reviews table appears to be joined with itself, under different aliases. Why is this done? What does it achieve? Also, ID field of the Reviews table is null for the entries that are nevertheless selected and returned. This is correct, but I don't understand how that can happen if the W_ID field is not null.

It allows you to join one row from the table to a different row in the table.
I've both seen this done, and used it myself, in cases where you maybe have a relationship between those rows.
Real-world examples:
An old version of a record and a newer version
Some sort of hierarchical relationship (e.g. if the table contains records of people, you can record that someone is a parent of someone else). There are probably plenty of other possible use cases, too.
SQL allows you to create a foreign key which relates between two different columns in the same table.

SQL 2 JOINS USING SINGLE REFERENCE TABLE

I'm trying to achieve 2 joins. If I run the 1st join alone it pulls 4 lots of results, which is correct. However when I add the 2nd join which queries the same reference table using the results from the select statement it pulls in additional results. Please see attached. The squared section should not be being returned
So I removed the 2nd join to try and explain better. See pic2. I'm trying to get another column which looks up InvolvedInternalID against the initial reference table IRIS.Practice.idvClient.

Your database is simply doing as you tell it. When you add in the second join (confusingly aliased as tb1 in a 3 table query) the database is finding matching rows that obey the predicate/truth statement in the ON part of the join
If you don't want those rows in there then one of two things must be the case:
1) The truth you specified in the ON clause is faulty; for example saying SELECT * FROM person INNER JOIN shoes ON person.age = shoes.size is faulty - two people with age 13 and two shoes with size 13 will produce 4 results, and shoe size has nothing to do with age anyway
2) There were rows in the table joined in that didn't apply to the results you were looking for, but you forgot to filter them out by putting some WHERE (or additional restriction in the ON) clause. Example, a table holds all historical data as well as current, and the current record is the one with a NULL in the DeletedOn column. If you forget to say WHERE deletedon IS NULL then your data will multiply as all the past rows that don't apply to your query are brought in
Don't alias tables with tbX, tbY etc.. Make the names meaningful! Not only do aliases like tbX have no relation to the original table name (so you encounter tbX, and then have to go searching the rest of the query to find where it's declared so you can say "ah, it's the addresses table") but in this case you join idvclient in twice, but give them unhelpful aliases like tb1, tb3 when really you should have aliased them with something that describes the relationship between them and the rest of the query tables
For example, ParentClient and SubClient or OriginatingClient/HandlingClient would be better names, if these tables are in some relationship with each other.
Whatever the purpose of joining this table in twice is, alias it in relation to the purpose. It may make what you've done wriong easier to spot, for example "oh, of course.. i'm missing a WHERE parentclient.type = 'parent'" (or WHERE handlingclient.handlingdate is not null etc..)
The first step to wisdom is by calling things their proper names

Reproduce certain rows in PostgreSQL

I have a table with IDs that represent Individuals (call it id_table) and another table with characteristics an individual can have (call it char_table).
Now I want to add a column to the id_table containing certain values from the char_table. The difficulty, for me, is, that some ID shall get only one characteristic (simple case) and some ID shall get several characteristics. For that reason the rows with these ID that get several characteristics must be reproduced so often until I have matched every characteristic to that ID.
For Example:
ID '001' shall get the characteristics 'a', 'b', and 'c'. So the row ID='001' must be reproduced 2 times (to get the same row 3 times) and each of these rows shall get one of these 3 characteristics.
I hope I explained intelligible enough.
Anyone an idea how to do this?
Thanks.

In almost all cases, you want to do this with an association or junction table. This table would like like:
create table IndividualCharacteristicsr (
individualId int not null,
characterId int not null
);
(It might also have a unique id itself.)
Each characteristic an individual has would be on a separate row. So, three characteristics mean three rows. A typical query using this information would join all three tables:
from id_table i left outer join
IndividualCharacteristicsr ic
on i.individualId = ic.individualId left outer join
char_table c
on c.characteristicId = ic.characteristicId
(The left outer join includes individuals with no characteristics.)
In Postgres, you could store the characteristics in a string or an array. Neither is very convenient for joining back to id_char. The better approach is an additional table.

When is it required to give a table name an alias in SQL?

I noticed when doing a query with multiple JOINs that my query didn't work unless I gave one of the table names an alias.
Here's a simple example to explain the point:
This doesn't work:
SELECT subject
from items
join purchases on items.folder_id=purchases.item_id
join purchases on items.date=purchases.purchase_date
group by folder_id
This does:
SELECT subject
from items
join purchases on items.folder_id=purchases.item_id
join purchases as p on items.date=p.purchase_date
group by folder_id
Can someone explain this?

You are using the same table Purchases twice in the query. You need to differentiate them by giving a different name.
You need to give an alias:
When the same table name is referenced multiple times
Imagine two people having the exact same John Doe. If you call John, both will respond to your call. You can't give the same name to two people and assume that they will know who you are calling. Similarly, when you give the same resultset named exactly the same, SQL cannot identify which one to take values from. You need to give different names to distinguish the result sets so SQL engine doesn't get confused.
Script 1: t1 and t2 are the alias names here
SELECT t1.col2
FROM table1 t1
INNER JOIN table1 t2
ON t1.col1 = t2.col1
When there is a derived table/sub query output
If a person doesn't have a name, you call them and since you can't call that person, they won't respond to you. Similarly, when you generate a derived table output or sub query output, it is something unknown to the SQL engine and it won't what to call. So, you need to give a name to the derived output so that SQL engine can appropriately deal with that derived output.
Script 2: t1 is the alias name here.
SELECT col1
FROM
(
SELECT col1
FROM table1
) t1

The only time it is REQUIRED to provide an alias is when you reference the table multiple times and when you have derived outputs (sub-queries acting as tables) (thanks for catching that out Siva). This is so that you can get rid of ambiguities between which table reference to use in the rest of your query.
To elaborate further, in your example:
SELECT subject
from items
join purchases on items.folder_id=purchases.item_id
join purchases on items.date=purchases.purchase_date
group by folder_id
My assumption is that you feel that each join and its corresponding on will use the correlating table, however you can use whichever table reference you want. So, what happens is that when you say on items.date=purchases.purchase_date, the SQL engine gets confused as to whether you mean the first purchases table, or the second one.
By adding the alias, you now get rid of the ambiguities by being more explicit. The SQL engine can now say with 100% certainty which version of purchases that you want to use. If it has to guess between two equal choices, then it will always throw an error asking for you to be more explicit.

It is required to give them a name when the same table is used twice in a query. In your case, the query wouldn't know what table to choose purchases.purchase_date from.

In this case it's simply that you've specified purchases twice and the SQL engine needs to be able to refer to each dataset in the join in a unique way, hence the alias is needed.
As a side point, do you really need to join into purchases twice? Would this not work:
SELECT
subject
from
items
join purchases
on items.folder_id=purchases.item_id
and items.date=purchases.purchase_date
group by folder_id

The alias are necessary to disambiguate the table from which to get a column.
So, if the column's name is unique in the list of all possible columns available in the tables in the from list, then you can use the coulmn name directly.
If the column's name is repeated in several of the tables available in the from list, then the DB server has no way to guess which is the right table to get the column.
In your sample query all the columns names are duplicated because you're getting "two instances" of the same table (purchases), so the server needs to know from which of the instance to take the column. SO you must specify it.
In fact, I'd recommend you to always use an alias, unless there's a single table. This way you'll avoid lots of problems, and make the query much more clear to understand.

You can't use the same table name in the same query UNLESS it is aliased as something else to prevent an ambiguous join condition. That's why its not allowed. I should note, it's also better to use always qualify table.field or alias.field so other developers behind you don't have to guess which columns are coming from which tables.
When writing a query, YOU know what you are working with, but how about the person behind you in development. If someone is not used to what columns come from what table, it can be ambiguous to follow, especially out here at S/O. By always qualifying by using the table reference and field, or alias reference and field, its much easier to follow.
select
SomeField,
AnotherField
from
OneOfMyTables
Join SecondTable
on SomeID = SecondID
compare that to
select
T1.SomeField,
T2.AnotherField
from
OneOfMyTables T1
JOIN SecondTable T2
on T1.SomeID = T2.SecondID
In these two scenarios, which would you prefer reading... Notice, I've simplified the query using shorter aliases "T1" and "T2", but they could be anything, even an acronym or abbreviated alias of the table names... "oomt" (one of my tables) and "st" (second table). Or, as something super long as has been in other posts...
Select * from ContractPurchaseOffice_AgencyLookupTable
vs
Select * from ContractPurchaseOffice_AgencyLookupTable AgencyLkup
If you had to keep qualifying joins, or field columns, which would you prefer looking at.
Hope this clarifies your question.

When is a good situation to use a full outer join?

I'm always discouraged from using one, but is there a circumstance when it's the best approach?

It's rare, but I have a few cases where it's used. Typically in exception reports or ETL or other very peculiar situations where both sides have data you are trying to combine.
The alternative is to use an INNER JOIN, a LEFT JOIN (with right side IS NULL) and a RIGHT JOIN (with left side IS NULL) and do a UNION - sometimes this approach is better because you can customize each individual join more obviously (and add a derived column to indicate which side is found or whether it's found in both and which one is going to win).

I noticed that the wikipedia page provides an example.
For example, this allows us to see
each employee who is in a department
and each department that has an
employee, but also see each employee
who is not part of a department and
each department which doesn't have an
employee.
Note that I never encountered the need of a full outer join in practice...

I've used full outer joins when attempting to find mismatched, orphaned data, from both of my tables and wanted all of my result set, not just matches.

Just today I had to use Full Outer Join. It is handy in situations where you're comparing two tables. For example, the two tables I was comparing were from different systems so I wanted to get following information:
Table A has any rows that are not in Table B
Table B has any rows that are not in Table A
Duplicates in either Table A or Table B
For matching rows whether values are different (Example: The table A and Table B both have Acct# 12345, LoanID abc123, but Interest Rate or Loan Amount is different
In addition, I created an additional field in SELECT statement that uses a CASE statement to 'comment' why I am flagging this row. Example: Interest Rate does not match / The Acct doesn't exist in System A, etc.
Then saved it as a view. Now, I can use this view to either create a report and send it to users for data correction/entry or use it to pull specific population by 'comment' field I created using a CASE statement (example: all records with non-matching interest rates) in my stored procedure and automate correction, etc.
If you want to see an example, let me know.

The rare times i have used it has been around testing for NULLs on both sides of the join in case i think data is missing from the initial INNER JOIN used in the SQL i'm testing on.

They're handy for finding orphaned data but I rarely use then in production code. I wouldn't be "always discouraged from using one" but I think in the real world they are less frequently the best solution compared to inners and left/right outers.

In the rare times that I used Full Outer Join it was for data analysis and comparison purpose such as when comparing two customers tables from different databases to find out duplicates in each table or to compare the two tables structures, or to find out null values in one table compared to the other, or finding missing information in one tables compared to the other.

For example, suppose you have two tables: one containing customer data and another containing order data. A full outer join would allow you to see all customers and all orders, even if some customers have no orders or some orders have no corresponding customer. This can help you identify any gaps in the data and ensure that all relevant information is included in the result set.
It's important to note that a full outer join can produce a huge result set since it includes all rows from both tables. This can be inefficient in terms of performance, so it's best to use a full outer join only when it is necessary to include all rows from both tables.
SELECT *
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name;
This will return all rows from both table1 and table2, filling in NULL values for missing matches on either side.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL join basic questions - sql

When I have to select a number of fields from different tables: do I always need to join tables? which tables do I need to join? which fields do I have to use for the join/s? do the joins effects reflect on fields specified in select clause or on where conditions? Thanks in advance.

Related

How to understand this query?

SQL 2 JOINS USING SINGLE REFERENCE TABLE

Reproduce certain rows in PostgreSQL

When is it required to give a table name an alias in SQL?

When is a good situation to use a full outer join?

Categories

Resources