Why can we join on a nonconstraint? - sql

I'm working with a database with poor design. Most tables do have a PK, but FK's do not exist. This makes it difficult to visualize the relationships between the tables. This leads me to wonder, why does SQL let us join tables with no constraint? If I am going to join Table A on Table B with EmployeeId shouldn't the knee jerk reaction be to setup a FK?

Because constraints are optional.
I agree that not using constraints is usually a bad idea, but in general, one doesn't have to have constraints for a valid database design to be created.
Some applications require that no constraints are used say for a hard delete of a "current" record that still has an audit trail table (which some reports will join on).
In such a scenario, not being able to join would hamper flexibility and the ability to use the DB.
This of course leave aside the fact that many databases are created by people (business or developers) that don't know about or care about referential integrity.

In some cases u might join on dates etc where u dont want to set up Fk or it might be impossible. Its ur decision if u want to define fk or not

The philosophy of SQL is that the person writing the query just needs to know the tables and columns; they don't need to know any of the indexes or table relationships. The database is supposed to be smart enough to figure out how to perform the query efficiently.
For ad-hoc queries this is very nice.

Foreign key affects performance since each time you insert a record it needs to ensure that the constraint is not being violated. In high-performance systems, database would often have FK in development and QA environment but not in production to speed up the inserts. The difference in performance can be huge if a lot of inserts are being performed and the database tables are large.
This is only one of many reasons SQL allows you to join without constraints.

You may not join directly related tables
Numbers table
Cross joins
table to grandparent, bypassing parent table
...

GASP! You can even join on nothing!

Related

Why is it necessary to explicitly specify the foreign keys when joining tables in SQL?

SQL server is aware of table dependencies based on foreign keys, so why is it necessary to explicitly specify a JOIN ON foreign keys?
Real world working example (This query works):
SELECT * FROM users
INNER JOIN roles ON users.role_id=roles.id
Implicit example (This query doesn't work):
SELECT * FROM users
INNER JOIN roles
Shouldn't SQL implicitly and correctly assume that if no ON keyword is specified joining should be done on the foreign keys?
I understand that the benefit of this may be trivial, but after leveraging this feature in SQL APIs such as Java Hibernate's query language I can't see why this wouldn't be built in to SQL.
EDIT
Thanks for the answers so far. Although they are interesting, none of them answer the original question regarding SQL Server.
SQL does sort-of support this notion. The standard includes natural join, which SQL Server has happily not implemented. This allows you to do:
SELECT *
FROM users u NATURAL JOIN
roles r;
A natural join has no on clause.
Alas, it does something slightly different from what you suggest. Instead of using foreign keys, it simply uses columns with the same name. I consider this an abomination, because SQL does have explicit foreign key declarations and this would be the right place to use them.
It's quite normal for tables to have multiple foreign key relationships between them.
In this case, what would you expect the database to do when there are two FKs between two tables. Pick one at random?
Typical example:
A table CLIENT.
A table PURCHASE. Since a purchase happens between two clients it needs two FKs to CLIENT: seller_id, and buyer_id, both pointing to CLIENT.
Every time there are implicit operations happening it makes it more difficult to do exactly what you want. Having a natural join seems fast at first but if you have to debug it then you will have to go figure out exactly how the join worked. Was it using foreign keys? Was it just columns with the same name? I'm using NHIBERNATE right now and it seems everyone in my team has a much harder time with it to do exactly what is needed.

What is happening under the hood when a relationship is established between tables?

This question is not limited to Power BI, but it will help me explain my problem.
If you have more than one table in Power BI, you can establish a relationship between them by dragging a column from one table to the other like this:
And you can edit that relationship by clicking the occuring line:
And by the way, here are the structures of the two tables:
# Table1
A,B
1,abc
2,def
3,ghi
4,jkl
# Table2
A,C
1,abc
1,def
2,ghi
3,ghit
This works fine since column A in Table1 consists of unique values and can work as a primary key. And now you can head over to the Report tab, set up two tables, and slice and dice at your hearts desire either by clicking directly under A in Table1, or by introducing a slicer:
But the thing is that you can do that without having established a relationship between the tables. Delete the relationshiop under Relationships and go back to Report and select Home > Manage Relationships to see what I mean:
As the dialog box says 'There are no relationships defined yet.' But you can still subset one table by making selections in the other just like before (EDIT: This statement has been proven wrong in the answer from RADO) . I do know that you can highlight the slicer and select Format > Edit Interactions and deselect the tables associated with the slicer. But I'm still puzzled by the whole thing.
So is there something happening under the hood here that I'm not aware of? Or is the relationship between tables really defined by the very contents of the tables - in the sence that the existence of related values accross tables with the existence of a potential primary key (be it natural or synthetic) makes it possible to query them using SQL, dplyr verbs or any other form of querying techniques. And that you really do not need an explicitly defined relationship?
Or put in another way, does the establishment of a Power BI table relationship have a SQL equivalent? Perhaps like the following:
CREATE TABLE Persons (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int,
PRIMARY KEY (ID)
);
I'm sorry If I'm rambling a bit here, but I'm just very confused. And googling has so far only added to the confusion. So thank you for any insights!
Your statement "But you can still subset one table by making selections in the other just like before" is not correct. This is a key issue here.
Relations enable propagation of filter context in Power BI. That's a very loaded phrase, and you will have to learn what it means if you plan to use Power BI. It's the most important concept to understand.
To see what I mean, you will need to write DAX measures and try to manipulate them using your tables. You will immediately see the difference when you have or don't have relations.
How the whole system works (simplified):
PowerBI contains a language called "DAX". You will create measures in DAX, and PowerBI will then translate them into its internal language called xmSQL, which is a special flavor of SQL. In xmSQL, regular connection is translated into LEFT OUTER JOIN, like this:
SELECT SUM(Sales.Amount)
FROM Sales
LEFT OUTER JOIN Customer
ON Sales.Customer_Key = Customer.Customer_Key
By-directional relations are a bit more complex, but conceptually similar.
Overall, when you create relations between tables, you are telling PowerBI engine how to join the tables. The engine then also adds some optimizations to speed up the queries.
Every time you execute a DAX measure, click a slicer or a visual, PowerBI generates multiple xmSQL statements in the background, executes them, and then renders their results as visuals. You can see these SQL queries with some tools such as DAX Studio.
Note that it's not strictly necessary to establish relations between tables in PowerBI. You can imitate the same behavior using DAX (programmatically), but such "virtual" relations are more complex and can be substantially slower.
In the RM (relational model) & ERM (entity-relationship model) tables represent relation(ship)s/association. Hence, relational in "RM" & relationship in "ERM".
FKs (foreign keys) get erroneously called "relationships" in pseudo-ERM methods. A SQL FK constraint says subrows appear elsewhere as PK (primary key) or UNIQUE. A DBMS uses them to disallow invalid updates & to optimize queries.
Power BI "relationships" are not FKs. They are instructions on how to build queries.
When there is a FK we do often want to join on it. So we often want a Power BI relationship when there is a FK.
Create and manage relationships in Power BI Desktop
(See also its Download PDF link for Developer.)
PS We do not need constraints to hold or be declared or be known to query. The constraints (Including PKs, FKs, UNIQUE & cardinalities) are determined by the table meanings--(characteristic) predicates--& what business situations can arise. If constraints hold then we just sometimes get fewer rows than otherwise & some query pairs always return the same results when otherwise they wouldn't.
Foreign keys are not needed to join tables!
Is there any rule of thumb to construct SQL query from a human-readable description?
PS Cross join is inner join with a TRUE condition (or no condition in some DBMSs), period. Whether there is a "relationship" aka FK is irrelevant. If the condition is FK=PK or anything else other than TRUE then it's not a cross join; otherwise it is a cross join whether or not there is a FK between the tables. It's just that we frequently want PK=FK in a condition & tools can & do use the presence of a FK towards a default condition.
CROSS JOIN vs INNER JOIN in SQL Server 2008
You asked "What is happening under the hood?"
The simple answer is "Statements about relationships."
Many well meaning people draw ER diagrams and seem to either forget or be unaware of the fact that their ER diagrams are really "pictures of statements in language."
The problem is ambiguity.
Many well meaning people jump straight to ER diagrams without also expressing the logical statements on which their ER diagrams are based. In effect, this means that the person who draws the ER diagram seems to expect that the "reader" of the ER diagram will be able reconstruct the statements from which the ER diagram was drawn.
Here is an example to illustrate what I mean. My purpose is to show the linguistic basis of the "under the covers" relationship between Students and their Addresses.
So, what's under the covers is language!
A simple diagram
The statements from which the diagram is derived.
A more complex diagram
The statements from which the diagram is derived.

SQL and storing foreign keys on multiple tables vs joining through tables for queries

What's best practice for querying multiple tables and foreign keys?
For instance:
User has many Tasks has many Reminders
I would assume best practice would be to keep foreign keys to their direct relatives, ie:
tasks table:
id, user_id
reminders table:
id, task_id
...to avoid situations where you have a bunch of foreign keys connecting tables to all possible relatives.
Let's say I want to retrieve all of a user's reminders.
SELECT reminders.* FROM users
LEFT JOIN tasks ON users.id = tasks.user_id
LEFT JOIN reminders ON tasks.id = reminders.task_id
WHERE tasks.user_id = {#YourUserIdVariable}
Is there a point at which doing joins to retrieve a record outweighs the integrity of the db? (assuming that storing reminders.user_id would be a bad practive in the first place?
5 tables? 10?
The standard is not to store unecessarily redundant information, including redundant Foreign Key information that could be derived through relations/joins.
In practice this means, storing only the direct relationships' FKeys, and not secondary or implicit FKey info. For performance reasons you may at some point decide to try "short-circuiting" a relation, but technically that's a form of de-normalization. You can do it, but you only should do it when the need is apparent, in other words, proper normalization should always be the default, anything else needs to significantly justify itself.
As for "Is there a point at which doing joins to retrieve a record outweighs the integrity of the db?": Possibly, but this can only be determined on a case-by-case basis. It's not possible to reduce a business trade-off like this to a data-only technical rule.
How much data do you expect to store in your database? If there are relatively small amount of rows in each table for the joins the execution time between a normalized and non-normalized database will be negligible.
So instead just go with a design that makes sense to you and you can normalize and de-normalize tables where it makes sense.
Here's a good read on the subject:http://www.codinghorror.com/blog/2008/07/maybe-normalizing-isnt-normal.html

DB: advantages of relations

I always think that the relations between tables are needed to perform cross-table operations, such as join. But I noticed that I can inner join two tables that are not linked at all (hasn't any foreign keys).
So, my questions:
Are some differences (such as speed) in joining linked and not-linked tables?
What are the advantages/disadvantages of using relations bwtween tables?
Thank you in advance.
The primary advantage is that foreign key constraints ensure the relational integrity of the data.. ie it stops you from deleting something that has a related entry in another table
You only get a performance advantage if you create an index on your FK
The FK/PK relationship is a logical feature of the data that would exist even if it were not declared in a given database. You include FKs in a table precisely to establish these logical relationships and to make them visible in a way that makes useful inner joins possible. Declaring an FK as referencing a given PK has the advantage, as said in other answers, of preventing orphaned references, rows that reference a non existent PK.
Indexes can speed up joins. In a complicated query, the optimizer may have a lot of strategies to evaluate, and most of these will not use every available index. Good database systems have good optimizers. In most database systems, declaring a PK will create an index behind the scenes. Sometimes, but not always, creating an index on the FK with the same structure as the index n the PK will enable the optimizer to use a strategy called a merge-join. In certain circumstances a merge-join can be much faster than the alternatives.
When you join tables that are apprently unrelated, there are several cases.
One case is where you end up matching every row from table A with every row from table B. This is called a cartesian join. It takes a long time, and nearly always produces unintended results. One time in ten years I did an intentional cartesian join.
Another case is where both tables contain the same FK, and you match along those two FK. An example might be matching by ZIPCODE. Zipcodes are really FKs to some master zipcode table somewhere out there in post office land, even though most people who use zipcodes never realize that fact.
A third case is where there is a third table, a junction table, containing FKs that reference each of the two tables in question. This implements a many-to-many relationship. In this case, what you probably want to be doing is a three way join with two inner joins each of which has an FK/PK matchup as the join condition.
Either I'm telling a lot that you already know, or you would benefit by going through a basic tutorial on relational databases.
In relational database terms a relation is (more or less) the data structure you have called a table - it is not something that exists "between" tables. A important advantage of the relational model is that there are no predefined links or other navigational structures that limit the way data can be joined or otherwise combined. You are free to join relations (tables) in a query however you like.
What you are asking about is actually called a foreign key constraint. A foreign key is a type of constraint that helps ensure data integrity by preventing inconsistent values being populated in the database.

Why is a primary-foreign key relation required when we can join without it?

If we can get data from two tables without having primary and foreign key relation, then why we need this rule? Can you please explain me clearly, with suitable example?
It's a test database, don't mind the bad structure.
Tables' structure:
**
table - 'test1'
columns - id,lname,fname,dob
no primary and foreign key and also not unique(without any constraints)
**
**table - 'test2'
columns- id,native_city
again, no relations and no constraints**
I can still join these tables with same columns 'id',
so if there's no primary-foreign key, then what is the use of that?
The main reason for primary and foreign keys is to enforce data consistency.
A primary key enforces the consistency of uniqueness of values over one or more columns. If an ID column has a primary key then it is impossible to have two rows with the same ID value. Without that primary key, many rows could have the same ID value and you wouldn't be able to distinguish between them based on the ID value alone.
A foreign key enforces the consistency of data that points elsewhere. It ensures that the data which is pointed to actually exists. In a typical parent-child relationship, a foreign key ensures that every child always points at a parent and that the parent actually exists. Without the foreign key you could have "orphaned" children that point at a parent that doesn't exist.
You need two columns of the same type, one on each table, to JOIN on. Whether they're primary and foreign keys or not doesn't matter.
You don't need a FK, you can join arbitrary columns.
But having a foreign key ensures that the join will actually succeed in finding something.
Foreign key give you certain guarantees that would be extremely difficult and error prone to implement otherwise.
For example, if you don't have a foreign key, you might insert a detail record in the system and just after you checked that the matching master record is present somebody else deletes it. So in order to prevent this you need to lock the master table, when ever you modify the detail table (and vice versa). If you don't need/want that guarantee, screw the FKs.
Depending on your RDBMS a foreign key also might improve performance of select (but also degrades performance of updates, inserts and deletes)
I know its late to post, but I use the site for my own reference and so I wanted to put an answer here for myself to reference in the future too. I hope you (and others) find it helpful.
Lets pretend a bunch of super Einstein experts designed our database. Our super perfect database has 3 tables, and the following relationships defined between them:
TblA 1:M TblB
TblB 1:M TblC
Notice there is no relationship between TblA and TblC
In most scenarios such a simple database is easy to navigate but in commercial databases it is usually impossible to be able to tell at the design stage all the possible uses and combination of uses for data, tables, and even whole databases, especially as systems get built upon and other systems get integrated or switched around or out. This simple fact has spawned a whole industry built on top of databases called Business Intelligence. But I digress...
In the above case, the structure is so simple to understand that its easy to see you can join from TblA, through to B, and through to C and vice versa to get at what you need. It also very vaguely highlights some of the problems with doing it. Now expand this simple chain to 10 or 20 or 50 relationships long. Now all of a sudden you start to envision a need for exactly your scenario. In simple terms, a join from A to C or vice versa or A to F or B to Z or whatever as our system grows.
There are many ways this can indeed be done. The one mentioned above being the most popular, that is driving through all the links. The major problem is that its very slow. And gets progressively slower the more tables you add to the chain, the more those tables grow, and the further you want to go through it.
Solution 1: Look for a common link. It must be there if you taught of a reason to join A to C. If it is not obvious, create a relationship and then join on it. i.e. To join A through B through C there must be some commonality or your join would either produce zero results or a massive number or results (Cartesian product). If you know this commonality, simply add the needed columns to A and C and link them directly.
The rule for relationships is that they simply must have a reason to exist. Nothing more. If you can find a good reason to link from A to C then do it. But you must ensure your reason is not redundant (i.e. its already handled in some other way).
Now a word of warning. There are some pitfalls. But I don't do a good job of explaining them so I will refer you to my source instead of talking about it here. But remember, this is getting into some heavy stuff, so this video about fan and chasm traps is really only a starting point. You can join without relationships. But I advise watching this video first as this goes beyond what most people learn in college and well into the territory of the BI and SAP guys. These guys, while they can program, their day job is to specialise in exactly this kind of thing. How to get massive amounts of data to talk to each other and make sense.
This video is one of the better videos I have come across on the subject. And it's worth looking over some of his other videos. I learned a lot from him.
A primary key is not required. A foreign key is not required either. You can construct a query joining two tables on any column you wish as long as the datatypes either match or are converted to match. No relationship needs to explicitly exist.
To do this you use an outer join:
select tablea.code, tablea.name, tableb.location from tablea left outer join
tableb on tablea.code = tableb.code
join with out relation
SQL join