SQL: Ambiguity on key fields - sql

I don't understand why the interpreter cannot handle the following:
SELECT id
FROM a
INNER JOIN b ON a.id = b.id
This query wil result in an error: Ambiguous column name 'id'
Which makes sense because the column in defined in multiple tables in my query. However, I clearly stated to only return the rows where the id's of both table are the same. So it wouldn't matter what table the id is from.
So just out of curiosity: Is there a reason why the interpreter demands a table for the field?
(My example is from SQLServer, not sure if other interpreters CAN handle this?)

Let's be clear about a few things. First, it is always a good idea to include table aliases when referring to columns. This makes the SQL easier to understand.
Second, you are assuming that because of the = in the on condition, the two fields are the same. This is not true. The values are the same.
For instance, one field could be int and the other float (I do not recommend using float for join keys, but it is allowed). What is the type of id? SQL wants to assign a type to all columns, and it is not clear what type to assign.
More common examples abound. One id might be a primary key and defined NOT NULL. The other might be a foreign keys and quite nullable. What is the nullability of just id?
In other words, SQL is doing the right thing. This is not about whether SQL can recognize something obvious, which sometimes it does. This is about a column being genuinely ambiguous and the SQL compiler not knowing how to define the result in the SELECT clause.

How do you exepect the interpreter to know which column to use ?
Since it doesn't have a real brain (sadly..!), you need to explicitly specify the table where you want the id from.
In this example it could be :
SELECT a.id
FROM a
INNER JOIN b ON a.id=b.id
Even if the id values are the same, the column still has to come from one of the tables which the interpreter cannot choose for you ;-)

The SELECT id, should be SELECT a.id since id is in both tables it does not know "which one" you referring to.

Related

How does the dot means in sql when the where clause is different

i express me
i saw recently this piece of code
SELECT a.id a.name FROM department
the issue is that 'departement' is a table name.
and 'a' is also a table name which contains id and name fileds.
why we use 2 differnt table name ? it is not the same thing... it's difficult to get brain for me, so explain me.
lower in the code there is a left join, maybe it can have a rapport ?
Like others have said, "a" is a table alias. It doesn't appear that "a" is the alias for the table: "departments". Aliases are assigned in the FROM statements and JOIN statements. If you can show us all of the code, we can tell you exactly what "a" represents.
In the meantime, I would suggest giving this a read: https://www.w3schools.com/sql/sql_alias.asp
No. It wasn't an alias. it was a foreign key in the table department. You lied.

Table field naming convention and SQL statements

I have a practical question regarding naming table fields in a database. For example, I have two tables:
student (id int; name varchar(30))
teacher (id int, s_id int; name varchar(30))
There are both 'id' and "name" in two tables. In SQL statement, it will be ambiguous for the two if no table names are prefixed. Two options:
use Table name as prefix of a field in SQL 'where' clause
use prefixed field names in tables so that no prefix will be used in 'where' clause.
Which one is better?
Without a doubt, go with option 1. This is valid sql in any type of database and considered the proper and most readable format. It's good habit to prefix the table name to a column, and very necessary when doing a join. The only exception to this I've most often seen is prefixing the id column with the table name, but I still wouldn't do that.
If you go with option 2, seasoned DBA's will probably point and laugh at you.
For further proof, see #2 here: https://www.periscopedata.com/blog/better-sql-schema.html
And here. Rule 1b - http://www.isbe.net/ILDS/pdf/SQL_server_standards.pdf
As TT mentions, you'll make your life much easier if you learn how to use an alias for the table name. It's as simple as using SomeTableNameThatsWayTooLong as long_table in your query, such as:
SELECT LT.Id FROM SomeTableNameThatsWayTooLong AS LT
For queries that aren't ad-hoc, you should always prefix every field with either the table name or table alias, even if the field name isn't ambiguous. This prevents the query from breaking later if someone adds a new column to one of the tables that introduces ambiguity.
So that would make "id" and "name" unambiguous. But I still recommend naming the primary key with something more specific than "id". In your example, I would use student_id and teacher_id. This helps prevent mistakes in joins. You will need more specific names anyway when you run into tables with more than one unique key, or multi-part keys.
It's worth thinking these things through, but in the end consistency may be the more important factor. I can deal with tables built around id instead of student_id, but I'm currently working with an inconsistent schema that uses all of the following: id, sid, systemid and specific names like taskid. That's the worst of both worlds.
I would use aliases rather than table names.
You can assign an alias to a table in a query, one that is shorter than the table name. That makes the query a lot more readable. Example:
SELECT
t.name AS teacher_name,
s.name AS student_name
FROM
teacher AS t
INNER JOIN student AS s ON
s.id=t.s_id;
You can of course use the table name if you don't use aliases, and that would be preferred over your option 2.
If it doesn't get too long, I prefer prefixing in the table themselves, e.g. teacher.teacher_id, student.student_name. That way, you are always sure which name or id your are talking about, even if you for get to prefix the table name.

How to select all fields in SQL joins without getting duplicate columns names?

Suppose I have one table A, with 10 fields. And Table B, with 5 fields.
B links to A via a column named "key", that exists both in A, and in B, with the same name ("key").
I am generating a generic piece of SQL, that queries from a main table A, and receives a table name parameter to join to, and select all A fields + B.
In this case, I will get all the 15 fields I want, or more precisely - 16, because I get "key" twice, once from A and once from B.
What I want is to get only 15 fields (all fields from the main table + the ones existing in the generic table), without getting "key" twice.
Of course I can explicit the fields I want in the SELECT itself, but that thwarts my very objective of building a generic SQL.
It really depends on which RDBMS you're using it against, and how you're assembling your dynamic SQL. For instance, if you're using Oracle and it's a PL/SQL procedure putting together your SQL, you're presumably querying USER_TAB_COLS or something like that. In that case, you could get your final list of columns names like
SELECT DISTINCT(column_name)
FROM user_tab_cols
WHERE table_name IN ('tableA', 'tableB');
but basically, we're going to need to know a lot more about how you're building your dynamic SQL.
Re-thinking about what I asked makes me conclude that this is not plausible. Selecting columns in a SELECT statement picks the columns we are interested in from the list of tables provided. In cases where the same column name exists in more than one of the tables involved, which are the cases my question is addressing, it would, ideally, be nice if the DB engine could return a unique list of fields - BUT - for that it would have to decide itself which column (and from which table) to choose, from all the matches - which is something that the DB cannot do, because it is solely dependent in the user's choice.

When is it required to give a table name an alias in SQL?

I noticed when doing a query with multiple JOINs that my query didn't work unless I gave one of the table names an alias.
Here's a simple example to explain the point:
This doesn't work:
SELECT subject
from items
join purchases on items.folder_id=purchases.item_id
join purchases on items.date=purchases.purchase_date
group by folder_id
This does:
SELECT subject
from items
join purchases on items.folder_id=purchases.item_id
join purchases as p on items.date=p.purchase_date
group by folder_id
Can someone explain this?
You are using the same table Purchases twice in the query. You need to differentiate them by giving a different name.
You need to give an alias:
When the same table name is referenced multiple times
Imagine two people having the exact same John Doe. If you call John, both will respond to your call. You can't give the same name to two people and assume that they will know who you are calling. Similarly, when you give the same resultset named exactly the same, SQL cannot identify which one to take values from. You need to give different names to distinguish the result sets so SQL engine doesn't get confused.
Script 1: t1 and t2 are the alias names here
SELECT t1.col2
FROM table1 t1
INNER JOIN table1 t2
ON t1.col1 = t2.col1
When there is a derived table/sub query output
If a person doesn't have a name, you call them and since you can't call that person, they won't respond to you. Similarly, when you generate a derived table output or sub query output, it is something unknown to the SQL engine and it won't what to call. So, you need to give a name to the derived output so that SQL engine can appropriately deal with that derived output.
Script 2: t1 is the alias name here.
SELECT col1
FROM
(
SELECT col1
FROM table1
) t1
The only time it is REQUIRED to provide an alias is when you reference the table multiple times and when you have derived outputs (sub-queries acting as tables) (thanks for catching that out Siva). This is so that you can get rid of ambiguities between which table reference to use in the rest of your query.
To elaborate further, in your example:
SELECT subject
from items
join purchases on items.folder_id=purchases.item_id
join purchases on items.date=purchases.purchase_date
group by folder_id
My assumption is that you feel that each join and its corresponding on will use the correlating table, however you can use whichever table reference you want. So, what happens is that when you say on items.date=purchases.purchase_date, the SQL engine gets confused as to whether you mean the first purchases table, or the second one.
By adding the alias, you now get rid of the ambiguities by being more explicit. The SQL engine can now say with 100% certainty which version of purchases that you want to use. If it has to guess between two equal choices, then it will always throw an error asking for you to be more explicit.
It is required to give them a name when the same table is used twice in a query. In your case, the query wouldn't know what table to choose purchases.purchase_date from.
In this case it's simply that you've specified purchases twice and the SQL engine needs to be able to refer to each dataset in the join in a unique way, hence the alias is needed.
As a side point, do you really need to join into purchases twice? Would this not work:
SELECT
subject
from
items
join purchases
on items.folder_id=purchases.item_id
and items.date=purchases.purchase_date
group by folder_id
The alias are necessary to disambiguate the table from which to get a column.
So, if the column's name is unique in the list of all possible columns available in the tables in the from list, then you can use the coulmn name directly.
If the column's name is repeated in several of the tables available in the from list, then the DB server has no way to guess which is the right table to get the column.
In your sample query all the columns names are duplicated because you're getting "two instances" of the same table (purchases), so the server needs to know from which of the instance to take the column. SO you must specify it.
In fact, I'd recommend you to always use an alias, unless there's a single table. This way you'll avoid lots of problems, and make the query much more clear to understand.
You can't use the same table name in the same query UNLESS it is aliased as something else to prevent an ambiguous join condition. That's why its not allowed. I should note, it's also better to use always qualify table.field or alias.field so other developers behind you don't have to guess which columns are coming from which tables.
When writing a query, YOU know what you are working with, but how about the person behind you in development. If someone is not used to what columns come from what table, it can be ambiguous to follow, especially out here at S/O. By always qualifying by using the table reference and field, or alias reference and field, its much easier to follow.
select
SomeField,
AnotherField
from
OneOfMyTables
Join SecondTable
on SomeID = SecondID
compare that to
select
T1.SomeField,
T2.AnotherField
from
OneOfMyTables T1
JOIN SecondTable T2
on T1.SomeID = T2.SecondID
In these two scenarios, which would you prefer reading... Notice, I've simplified the query using shorter aliases "T1" and "T2", but they could be anything, even an acronym or abbreviated alias of the table names... "oomt" (one of my tables) and "st" (second table). Or, as something super long as has been in other posts...
Select * from ContractPurchaseOffice_AgencyLookupTable
vs
Select * from ContractPurchaseOffice_AgencyLookupTable AgencyLkup
If you had to keep qualifying joins, or field columns, which would you prefer looking at.
Hope this clarifies your question.

Is using Null to represent a Value Bad Practice?

If I use null as a representation of everything in a database table is that bad practice ?
i.e.
I have the tables: myTable(ID) and myRelatedTable(ID,myTableID)
myRelatedTable.myTableID is a FK of myTable.ID
What I want to accomplish is: if myRelatedTable.myTableID is null then my business logic will interpret that as being linked to all myTable rows.
The reason I want to do this is because I have an uknown amount of rows that could be inserted into myTable after the myRelatedTable row is created and some rows in the myRelatedTable need to reference all existing rows in myTable.
I think you might agree that it would be bad to use the number 3 to represent a value other an 3.
By the same reasoning it is therefore a bad idea to use NULL to represent anything other than the absence of a value.
If you disagree and twist NULL to some other purpose, the maintenance programmers that come after you will not be grateful.
Not a good idea, because then you cannot use the "related to all entries" fact in SQL queries at all. At some point, you'll probably want/need to do this.
Ideally there should be no nulls at all. There should be another table to represent the relation.
If you are going to assign special meanings however NULL should only ever mean "not assigned" - ie no relationship exists, use negative numbers, ie -1 if you want to trigger some business layer trickery. It should be obvious to any developers that come across this in the future that -1 is an extraordinary value that should not be treated as normal.
I don't think NULL is the best way to do it but you might use a separate tinyInt column to indicate that the row in MyRelatedTable is related to everything in MyTable, e.g. MyRelatedTable.RelatedAll. That would make it more explicit for other that have to maintain it. Then you could do some sort of Union query e.g.
SELECT M.ID, R.ID AS RelatedTableID,....
FROM MyTable M INNER JOIN MyRelated Table R ON R.myTableId = M.Id
UNION
SELECT M.ID, R.ID AS RelatedTableID,....
FROM MyTable M, MyRelatedTable R
WHERE R.RelatedAll = 1
Yes, for the simple reason that NULL represents no value. Not a special value; not a blank value, but nothing.
If the foreign key is just a simple integer, and it's generated automatically, then you could use 0 to represent the "magic" value.
What you posted, namely that a NULL in a foreign key asserts a relationship with all the rows in the referenced table, is very non standard. Off the top of my head, I think it's fraught with dangers.
What most people who use NULLs in FKs mean by it is that it asserts a relationship to NONE of the rows in the referenced table. This is common in the case of optional relationships, ones that can occur zero times.
Example: We have an HR database, with a table called "EMPLOYEES". We have two columns, called "EmpID" and "SupervisorID". (Many people call the first column simply "ID"). Every employee in the table has an entry under SupervisorID with the sole exception of the CEO of the company. THe CEO has a NULL in the SupervisorID column, meaning that the CEO has no supervisor. The CEO is accountable to the BOD, but that isn't represented in SupervisorID.
What you might mean by a relationship with ALL the rows in the refernced table is this: There's a POSSIBLE relationship between the row in question and ANY ONE of the rows in the reference table. When you start to get into the questions of the facts that are true in the real world but unknown to the database you open a whole big can of worms.