What is happening under the hood when a relationship is established between tables? - sql

This question is not limited to Power BI, but it will help me explain my problem.
If you have more than one table in Power BI, you can establish a relationship between them by dragging a column from one table to the other like this:
And you can edit that relationship by clicking the occuring line:
And by the way, here are the structures of the two tables:
# Table1
A,B
1,abc
2,def
3,ghi
4,jkl
# Table2
A,C
1,abc
1,def
2,ghi
3,ghit
This works fine since column A in Table1 consists of unique values and can work as a primary key. And now you can head over to the Report tab, set up two tables, and slice and dice at your hearts desire either by clicking directly under A in Table1, or by introducing a slicer:
But the thing is that you can do that without having established a relationship between the tables. Delete the relationshiop under Relationships and go back to Report and select Home > Manage Relationships to see what I mean:
As the dialog box says 'There are no relationships defined yet.' But you can still subset one table by making selections in the other just like before (EDIT: This statement has been proven wrong in the answer from RADO) . I do know that you can highlight the slicer and select Format > Edit Interactions and deselect the tables associated with the slicer. But I'm still puzzled by the whole thing.
So is there something happening under the hood here that I'm not aware of? Or is the relationship between tables really defined by the very contents of the tables - in the sence that the existence of related values accross tables with the existence of a potential primary key (be it natural or synthetic) makes it possible to query them using SQL, dplyr verbs or any other form of querying techniques. And that you really do not need an explicitly defined relationship?
Or put in another way, does the establishment of a Power BI table relationship have a SQL equivalent? Perhaps like the following:
CREATE TABLE Persons (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int,
PRIMARY KEY (ID)
);
I'm sorry If I'm rambling a bit here, but I'm just very confused. And googling has so far only added to the confusion. So thank you for any insights!

Your statement "But you can still subset one table by making selections in the other just like before" is not correct. This is a key issue here.
Relations enable propagation of filter context in Power BI. That's a very loaded phrase, and you will have to learn what it means if you plan to use Power BI. It's the most important concept to understand.
To see what I mean, you will need to write DAX measures and try to manipulate them using your tables. You will immediately see the difference when you have or don't have relations.
How the whole system works (simplified):
PowerBI contains a language called "DAX". You will create measures in DAX, and PowerBI will then translate them into its internal language called xmSQL, which is a special flavor of SQL. In xmSQL, regular connection is translated into LEFT OUTER JOIN, like this:
SELECT SUM(Sales.Amount)
FROM Sales
LEFT OUTER JOIN Customer
ON Sales.Customer_Key = Customer.Customer_Key
By-directional relations are a bit more complex, but conceptually similar.
Overall, when you create relations between tables, you are telling PowerBI engine how to join the tables. The engine then also adds some optimizations to speed up the queries.
Every time you execute a DAX measure, click a slicer or a visual, PowerBI generates multiple xmSQL statements in the background, executes them, and then renders their results as visuals. You can see these SQL queries with some tools such as DAX Studio.
Note that it's not strictly necessary to establish relations between tables in PowerBI. You can imitate the same behavior using DAX (programmatically), but such "virtual" relations are more complex and can be substantially slower.

In the RM (relational model) & ERM (entity-relationship model) tables represent relation(ship)s/association. Hence, relational in "RM" & relationship in "ERM".
FKs (foreign keys) get erroneously called "relationships" in pseudo-ERM methods. A SQL FK constraint says subrows appear elsewhere as PK (primary key) or UNIQUE. A DBMS uses them to disallow invalid updates & to optimize queries.
Power BI "relationships" are not FKs. They are instructions on how to build queries.
When there is a FK we do often want to join on it. So we often want a Power BI relationship when there is a FK.
Create and manage relationships in Power BI Desktop
(See also its Download PDF link for Developer.)
PS We do not need constraints to hold or be declared or be known to query. The constraints (Including PKs, FKs, UNIQUE & cardinalities) are determined by the table meanings--(characteristic) predicates--& what business situations can arise. If constraints hold then we just sometimes get fewer rows than otherwise & some query pairs always return the same results when otherwise they wouldn't.
Foreign keys are not needed to join tables!
Is there any rule of thumb to construct SQL query from a human-readable description?
PS Cross join is inner join with a TRUE condition (or no condition in some DBMSs), period. Whether there is a "relationship" aka FK is irrelevant. If the condition is FK=PK or anything else other than TRUE then it's not a cross join; otherwise it is a cross join whether or not there is a FK between the tables. It's just that we frequently want PK=FK in a condition & tools can & do use the presence of a FK towards a default condition.
CROSS JOIN vs INNER JOIN in SQL Server 2008

You asked "What is happening under the hood?"
The simple answer is "Statements about relationships."
Many well meaning people draw ER diagrams and seem to either forget or be unaware of the fact that their ER diagrams are really "pictures of statements in language."
The problem is ambiguity.
Many well meaning people jump straight to ER diagrams without also expressing the logical statements on which their ER diagrams are based. In effect, this means that the person who draws the ER diagram seems to expect that the "reader" of the ER diagram will be able reconstruct the statements from which the ER diagram was drawn.
Here is an example to illustrate what I mean. My purpose is to show the linguistic basis of the "under the covers" relationship between Students and their Addresses.
So, what's under the covers is language!
A simple diagram
The statements from which the diagram is derived.
A more complex diagram
The statements from which the diagram is derived.

Related

Is a "join table" the result of a SQL JOIN, or the table between a many-to-many

This is a question about correctly "naming things". Specifically, how do you distinguish between:
the "between" table in a many-to-many relationship (e.g. users, users_questions, questions)
the (temporary) table that is created during a SQL JOIN (e.g. 'SELECT * FROM users INNER JOIN users_questions.user_id ON users.id WHERE users_question.question_id=37016694;`)
Lots of database designers use the term join table in your first sense: to implement a many-to-many relationship between entities. It's also called a junction table, association table, and other things. More info: https://en.wikipedia.org/wiki/Associative_entity
I've never heard the second sense used. (But, hey, I don't get out much. :-) If you're writing documentation, or teaching, I suggest you reserve the word table to mean an actual, physical, table. Avoid using the word table for a resultset unless you qualify it by saying virtual table or some such phrase. That way your readers and students won't waste time trying to find the definitions of these not-really-tables in your schema.
From relational point of view, a JOIN table (a table which resolves many-to-many relationship) is real physical table. Else it wouldn't survive between requests. In my opinion, the term "join table" was coined by MVC "Code First" developers who ignore physical entities in the DB realm, especially if they are not shown in DbContext.
In my opinion, again, we should honor relational realities.
So help me Codd.

DB: advantages of relations

I always think that the relations between tables are needed to perform cross-table operations, such as join. But I noticed that I can inner join two tables that are not linked at all (hasn't any foreign keys).
So, my questions:
Are some differences (such as speed) in joining linked and not-linked tables?
What are the advantages/disadvantages of using relations bwtween tables?
Thank you in advance.
The primary advantage is that foreign key constraints ensure the relational integrity of the data.. ie it stops you from deleting something that has a related entry in another table
You only get a performance advantage if you create an index on your FK
The FK/PK relationship is a logical feature of the data that would exist even if it were not declared in a given database. You include FKs in a table precisely to establish these logical relationships and to make them visible in a way that makes useful inner joins possible. Declaring an FK as referencing a given PK has the advantage, as said in other answers, of preventing orphaned references, rows that reference a non existent PK.
Indexes can speed up joins. In a complicated query, the optimizer may have a lot of strategies to evaluate, and most of these will not use every available index. Good database systems have good optimizers. In most database systems, declaring a PK will create an index behind the scenes. Sometimes, but not always, creating an index on the FK with the same structure as the index n the PK will enable the optimizer to use a strategy called a merge-join. In certain circumstances a merge-join can be much faster than the alternatives.
When you join tables that are apprently unrelated, there are several cases.
One case is where you end up matching every row from table A with every row from table B. This is called a cartesian join. It takes a long time, and nearly always produces unintended results. One time in ten years I did an intentional cartesian join.
Another case is where both tables contain the same FK, and you match along those two FK. An example might be matching by ZIPCODE. Zipcodes are really FKs to some master zipcode table somewhere out there in post office land, even though most people who use zipcodes never realize that fact.
A third case is where there is a third table, a junction table, containing FKs that reference each of the two tables in question. This implements a many-to-many relationship. In this case, what you probably want to be doing is a three way join with two inner joins each of which has an FK/PK matchup as the join condition.
Either I'm telling a lot that you already know, or you would benefit by going through a basic tutorial on relational databases.
In relational database terms a relation is (more or less) the data structure you have called a table - it is not something that exists "between" tables. A important advantage of the relational model is that there are no predefined links or other navigational structures that limit the way data can be joined or otherwise combined. You are free to join relations (tables) in a query however you like.
What you are asking about is actually called a foreign key constraint. A foreign key is a type of constraint that helps ensure data integrity by preventing inconsistent values being populated in the database.

How to best explain on what fields should a user join on?

I need to explain to somebody how they can determine what fields from multiple tables/views they should join on. Any suggestions? I know how to do it but am having difficulty trying to explain it.
One of the issues they have is they will take two fields from two tables that are the same (zip code) and join on those, when in reality they should be joining on ID columns. When they choose the wrong column to join on it increases records they receive in return.
Should I work in PK and FK somewhere?
While it is indeed typical to join a PK to an FK any conversation about JOIN clauses that only revolve around PK's and FK's is fairly limited
For example I had this FROM clause in a recent SQL answer I gave
FROM
YourTable firstNames
LEFT JOIN YourTable lastNames
ON firstnames.Name = lastNames.Name
AND lastNames.NameType =2
and firstnames.FrequencyPercent < lastNames.FrequencyPercent
The table referenced on each side of the table is the same table (a self join) and it includes three condidtions one of which is an inequality. Furthermore there would never be an FK here because its looking to join on a field, that is by design, not a Candidate Key.
Also you don't have even have to join one table to another. You can join inline queries to each other which of course can't possibly have a Key.
So in order to properly understand JOIN you just need to understand that it combines the records from two relations (tables, views, inline queries) where some conditions evaluate to true. This means you need to understand boolean logic and the database and the data in the database.
If your user is having a problem with a specific JOIN ask them to SELECT some rows from one table and also the other and then ask them under what conditions would you want to combine the rows.
You don't need to talk in terms of a primary key of a table but you should point to it and explain that it uniquely identifies a given row and that you must join to related tables using it or you could get duplicated results.
Give them examples of joining with it and joining without it.
An ER diagram showing all of the tables they use and their key relationships would help ensure that they always use the correct keys.
It sounds to me like neither you, nor the person you are trying to help understands how this particular database is constructed and perhaps don't really even understand basic database fundamentals, like PK's and FK's. Most often a PK from one table is joined to a FK to another table.
Assuming the database has the proper PK's and FK's in place, it would probably help a great deal to generate an ER diagram. That would make the joining concept much easier to grasp.
Another approach you could take is to find someone who does understand these things and create some views for this person to use. This way he doesn't need to understand how to join the tables together.
A user shouldn't typically be doing joins. A user should have an interface that lets them get the data that they need in the way that they need it. If you don't have the developer resources to do that then you're going to be stuck with this problem of having to teach a user technical details. You also need to be very careful about what kind of damage the user can do. Do they have update rights on the data? I hope they don't accidentally do a DELETE FROM Table with no WHERE clause. Even if you restrict their permissions, a poorly written query can crush the database server or block resources causing problems for other users (and more work for you).
If you have no choice, then I think that you need to certainly teach them about primary and foreign keys, even if you don't call them that. Point out that the id on your table (or whatever your PK is) identifies a row. Then explain how the id appears in other tables to show the relationship. For example, "See, in the address table we have a person_id which tells us who that address belongs to."
After that, expect to spend a large portion of your time with that user as they make mistakes or come up with other things that they want to get from the database, but which they can't figure out how to get.
From theory, and ideally, you should define primary keys on all tables, and join tables using a primary key to the matching field or fields (foreign key) in the other table.
Even if you don't define or if they're not defined as primary keys, you need to make sure the fields uniquely identify the records in the table, and that they should be properly indexed.
For example, let's say the 'person' table has a SSN and a driver's license field. The SSN could be considered and flagged as the 'primary key', but if you join that table to a 'drivers' table which might not have the SSN, but does have the driver's license #, you could join them by the driver's license field (even if it's not flagged as primary key), but you need to make sure that the field is properly indexed in both tables.
...explain to somebody how they can determine what fields from multiple tables/views they should join on.
Simply put, look for the columns with values that match between the tables/views. Preferably, match exactly but some massaging might be necessary.
The existence of foreign key constraints would help to know what matches to what, but the constraint might not be directly to the table/view that is to be joined.
The existence of a primary key doesn't mean it is the criteria that is necessary for the query, so I would overlook this detail (depending on the audience).
I would recommend attacking the desired result set by starting with the columns desired, and working back from there. If there's more than one table's columns in the result set, focus on the table whose columns should be returning distinct results first and then gradually add joins, checking the result set between each JOIN addition to confirm the results are still the same. Otherwise, need to review the JOIN or if a JOIN is actually necessary vs IN or EXISTS.
I did this when I first started out, it comes from thinking of joins as just linking tables together, so I linked at all possible points.
Once you think of joins as a way to combine AND filter the data it becomes easier to understand them.
Writing out your request as a sentence is helpful too, "I want to see all the times Table A interacted with Table B". Then build a query from that using only the ID, noting that if you wanted to know "All the times Table A was in the same zip code as Table B" then you would join by zip code.

Why is a primary-foreign key relation required when we can join without it?

If we can get data from two tables without having primary and foreign key relation, then why we need this rule? Can you please explain me clearly, with suitable example?
It's a test database, don't mind the bad structure.
Tables' structure:
**
table - 'test1'
columns - id,lname,fname,dob
no primary and foreign key and also not unique(without any constraints)
**
**table - 'test2'
columns- id,native_city
again, no relations and no constraints**
I can still join these tables with same columns 'id',
so if there's no primary-foreign key, then what is the use of that?
The main reason for primary and foreign keys is to enforce data consistency.
A primary key enforces the consistency of uniqueness of values over one or more columns. If an ID column has a primary key then it is impossible to have two rows with the same ID value. Without that primary key, many rows could have the same ID value and you wouldn't be able to distinguish between them based on the ID value alone.
A foreign key enforces the consistency of data that points elsewhere. It ensures that the data which is pointed to actually exists. In a typical parent-child relationship, a foreign key ensures that every child always points at a parent and that the parent actually exists. Without the foreign key you could have "orphaned" children that point at a parent that doesn't exist.
You need two columns of the same type, one on each table, to JOIN on. Whether they're primary and foreign keys or not doesn't matter.
You don't need a FK, you can join arbitrary columns.
But having a foreign key ensures that the join will actually succeed in finding something.
Foreign key give you certain guarantees that would be extremely difficult and error prone to implement otherwise.
For example, if you don't have a foreign key, you might insert a detail record in the system and just after you checked that the matching master record is present somebody else deletes it. So in order to prevent this you need to lock the master table, when ever you modify the detail table (and vice versa). If you don't need/want that guarantee, screw the FKs.
Depending on your RDBMS a foreign key also might improve performance of select (but also degrades performance of updates, inserts and deletes)
I know its late to post, but I use the site for my own reference and so I wanted to put an answer here for myself to reference in the future too. I hope you (and others) find it helpful.
Lets pretend a bunch of super Einstein experts designed our database. Our super perfect database has 3 tables, and the following relationships defined between them:
TblA 1:M TblB
TblB 1:M TblC
Notice there is no relationship between TblA and TblC
In most scenarios such a simple database is easy to navigate but in commercial databases it is usually impossible to be able to tell at the design stage all the possible uses and combination of uses for data, tables, and even whole databases, especially as systems get built upon and other systems get integrated or switched around or out. This simple fact has spawned a whole industry built on top of databases called Business Intelligence. But I digress...
In the above case, the structure is so simple to understand that its easy to see you can join from TblA, through to B, and through to C and vice versa to get at what you need. It also very vaguely highlights some of the problems with doing it. Now expand this simple chain to 10 or 20 or 50 relationships long. Now all of a sudden you start to envision a need for exactly your scenario. In simple terms, a join from A to C or vice versa or A to F or B to Z or whatever as our system grows.
There are many ways this can indeed be done. The one mentioned above being the most popular, that is driving through all the links. The major problem is that its very slow. And gets progressively slower the more tables you add to the chain, the more those tables grow, and the further you want to go through it.
Solution 1: Look for a common link. It must be there if you taught of a reason to join A to C. If it is not obvious, create a relationship and then join on it. i.e. To join A through B through C there must be some commonality or your join would either produce zero results or a massive number or results (Cartesian product). If you know this commonality, simply add the needed columns to A and C and link them directly.
The rule for relationships is that they simply must have a reason to exist. Nothing more. If you can find a good reason to link from A to C then do it. But you must ensure your reason is not redundant (i.e. its already handled in some other way).
Now a word of warning. There are some pitfalls. But I don't do a good job of explaining them so I will refer you to my source instead of talking about it here. But remember, this is getting into some heavy stuff, so this video about fan and chasm traps is really only a starting point. You can join without relationships. But I advise watching this video first as this goes beyond what most people learn in college and well into the territory of the BI and SAP guys. These guys, while they can program, their day job is to specialise in exactly this kind of thing. How to get massive amounts of data to talk to each other and make sense.
This video is one of the better videos I have come across on the subject. And it's worth looking over some of his other videos. I learned a lot from him.
A primary key is not required. A foreign key is not required either. You can construct a query joining two tables on any column you wish as long as the datatypes either match or are converted to match. No relationship needs to explicitly exist.
To do this you use an outer join:
select tablea.code, tablea.name, tableb.location from tablea left outer join
tableb on tablea.code = tableb.code
join with out relation
SQL join

General question on SQL - JOINS

If I was asked to query JOIN of more than three tables, what is the best way I go about understanding the relationship of the tables before I code. Should I use Database Diagram in SQL Server, or would I be given the necessary information? What would you recommend?
Thanks in advance for your time.
You could use the diagramming tools in SQL Server Management Studio to discover any Foreign Key relationships between those tables. It might be quicker than using the GUI to inspect each table in Design mode, and viewing its Relationships dialog.
Consider creating an ad-hoc View with those 3 tables. This will help you produce the SQL statement that you'd need. If any relationships exist on those 3 tables, you'll have the JOIN statement created for you by the tool.
Right click Views -> New View.
Pick the tables you need, click Add
all relationships are displayed in the Diagram Pane, and the SQL Pane will the SELECT statement with the required JOINS.
Depending on the convention they use for creating foreign keys, it shouldn't be to hard to find the relationships between tables.
The convention we use is
dbo.TableA(ID PK)
dbo.TableB(ID PK, TableAID FK)
dbo.TableC(ID PK, TableBID FK)
...
If they don't use any convention at all or didn't even create Foreign Key constraints, you can take that as an opportunity to educate them about the importance of conventions aka the lost time and money by not using them.