Natural Join -- Relational theory and SQL

Natural Join -- Relational theory and SQL - sql

This question comes from my readings of C.J Date's SQL and Relational Theory: How to Write Accurate SQL Code and looking up about joins on the internet (which includes coming across multiple posts here on NATURAL JOINs (and about SQL Server's lack of support for it))
So here is my problem...
On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred).
On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction).
Is the reconciliation of these that:
Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS.
and / or
Good/Better table design should remove/minimise the problems that natural join creates.
?

a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),
"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."
This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.
"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."
Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.
"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."
While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.
Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.
"Good/Better table design should remove/minimise the problems that natural join creates."
Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...

First, the choice between theory and being practical is a fallacy. To quote Chris Date: "the truth is that theory--at least the theory I'm talking about here, which is relational theory--is most definitely very practical indeed".
Second, consider that natural join relies on attribute naming. Please (re)read the following sections of the Accurate SQL Code book:
6.12. The Reliance on Attribute Names. Salient quote:
The operators of the relational algebra… all rely heavily on attribute
naming.
3.9. Column Naming in SQL. Salient quote:
Strong recommendation: …if two columns in SQL represent "the same kind
of information," give them the same name wherever possible. (That's
why, for example, the two supplier number columns in the
suppliers-and-parts database are both called SNO and not, say, SNO in
one table and SNUM in the other.) Conversely, if two columns represent
different kinds of information, it's usually a good idea to give them
different names.
I'd like to address #kuru kuru pa's point (a good one too) about columns being added to a table over which you have no control, such as a "web service you're consuming." It seems to me that this problem is effectively mitigated using the strategy suggested by Date in section 3.9 (referenced above): quote:
For every base table, define a view identical to that base table except possibly for some column renaming.
Make sure the set of views so defined abides by the column naming discipline described above.
Operate in terms of those views instead of the underlying base tables.
Personally, I find the "natural join considered dangerous" attitude frustrating. Not wishing to sound self-righteous but my own naming convention, which follows the guidance of ISO 11179-5 Naming and identification principles, results in schema highly suited to natural join.
Sadly, natural join perhaps won't be supported anytime soon in the DBMS product I use professionally (SQL Server): the relevant feature request on Microsoft Connect
is currently closed as "won't fix" despite currently having a respectable +38 / -2 score
has been reopened and gained a respectable 46 / -2 score
(go vote for it now :)

The main problem with the NATURAL JOIN syntax in SQL is that it is typically too verbose.
In Tutorial D syntax I can very simply write a natural join as:
R{a,b,c} JOIN S{a,c,d};
But in SQL the SELECT statement needs either derived table subqueries or a WHERE clause and aliases to achieve the same thing. That's because a single "SELECT statement" is really a non-relational, compound operator in which the component operations always happen in a predetermined order. Projection comes after joins and columns in the result of a join don't necessarily have unique names.
E.g. the above query can be written in SQL as:
SELECT DISTINCT a, b, c, d
FROM
(SELECT a,b,c FROM R) R
NATURAL JOIN
(SELECT a,c,d FROM S) S;
or:
SELECT DISTINCT R.a, R.b, R.c, S.d
FROM R,S
WHERE R.a = S.a AND R.c = S.c;
People will likely prefer the latter version because it is shorter and "simpler".

Theory versus reality...
Natural joins are not practical.
There is no such thing as a pure (i.e. practice is idetical to theory) RDBMS, as far as I know.
I think Oracle and a few others actually support support natural joins -- TSQL doesn't.
Consider the world we live in -- chances of two tables each having a column with the same name is pretty high (like maybe [name] or [id] or [date], etc.). Maybe those chances are narrowed down a bit by grouping only those tables you might actually want to join. But regardless, without a careful examination of the table structure, you won't know if a "natural join" is a good idea or not. And even if it is, at that moment, it might not be in another year when the application gets an upgrade which adds columns to certain tables, etc., or the web service you're consuming adds fields you didn't know about, etc.
I think a "pure" system would have to be one you had 100% control over at a minimum, and then also, one that would have some good validation in the alter table / create table process that would warn / prevent you from creating a new column in some table that could be "naturally" joined to some other table you might not be intending it to be join-able to.
I guess bottom-line for me would be, valuing my sanity, wanting my applications to have maximum up-time, valuing quick/clean maintenance and upgrades, etc. -- good table design in this context means not using natural joins (ever).

Related

Why not have a JOINONE keyword in SQL to hint and enforce that each record has at most one match?

I encounter this a lot when writing SQL. I have two tables that are meant to be in a one-to-one relationship with each other, and I wish I could easily assert that fact in my query. For example, the simplified query:
SELECT Person.ID, Person.Name, Location.Address1
FROM Person
LEFT JOIN Location ON Person.LocationID = Location.ID
When I read this query I think to myself, well what if the Location table fails to enforce uniqueness on its ID column? Suddenly you could have the same Person multiple times in your resultset. Sure, I can go look at the schema to assure myself it's unique so everything will be okay, but why shouldn't I simply be able to put it right here in my query, a la:
SELECT Person.ID, Person.Name, Location.Address1
FROM Person
LEFT JOINONE Location ON Person.LocationID = Location.ID
Not only would a keyword like this (made up "JOINONE") make it 100% clear to a human reading this query that we are guaranteed to get exactly one row for each Person record, but it lets the db engine optimize its execution plan because it knows there won't be more than one match each, even if the foreign key relationship isn't defined in the schema.
Another advantage of this would be that the db engine could enforce it, so if the data actually did have more than one match, an error could be thrown. This happens for subqueries already, e.g.:
SELECT Person.ID, Person.Name
, (
SELECT Location.Address1
FROM Location
WHERE Location.ID = Person.Location
) AS Address1
FROM Person
This is nice and spiffy, 100% clear to the human reader, neatly optimizable, and enforced by the db engine. In fact I often end up doing things this way for all those reasons. The problem is, besides the distracting syntax, you can only select one field this way. (What if I want City, State, and Zip too?) How nice it would be if you could flow this table right along with the rest of your JOINs and select any fields from it you wish in your SELECT clause just like all the rest of your tables.
I couldn't find any other question like this around StackOverflow, though I did find lots of repeats of a close question: people wanting to choose a single record. Close but really quite a different kind of goal, and less meaningful in my opinion.
I'm posting this question to see if there's some mechanism already in the SQL language that I'm missing, or an efficient workaround anyone has come up with. The concept of a one-to-one vs. one-to-many relationship is so fundamental to relational database design, I'm just so surprised at the absence of this language element.

SQL is two languages in one. Constraints, including uniqueness constraints, are set using the data definition language (DDL) in SQL. This is a layer above the data manipulation language (DML), where SELECT statements live, and it's understood that statements issued in the DDL might invalidate statements in the DML.
There's no way for a query to prevent someone from executing an ALTER TABLE command and changing the name of a field that the query refers to between query runs.
And there isn't much more of a way for a query to be written defensively against uncertain constraints; it's OK if you need to ask someone for information outside of the database environment to address this. The information may also be available within the environment; in most engines, you get to it by querying the data dictionary. This is the INFORMATION_SCHEMA in MySQL, for instance.

SQL Server - lack of NATURAL JOIN / x JOIN y USING(field)

I've just been reading up on NATURAL JOIN / USING - SQL92 features which are (sadly?) missing from SQL Server's current repertoire.
Has anyone come from a DBMS that supported these to SQL Server (or another non-supporting DBMS) - were they as useful as they sound, or a can of worms (which also sounds possible!)?

I never use NATURAL JOIN because I don't like the possibility that the join could do something I don't intend just because some column name exists in both tables.
I do use the USING join syntax occasionally, but just as often it turns out that I need a more complex join condition than USING can support, so I convert it to the equivalent ON syntax after all.

Would you consider a DBMS that was truly relational?:
in Tutorial D [a truly relational
language], the only “join” operator is
called JOIN, and it means “natural
join”... There should be no other kind
of join... Few people have had the
experience of using a proper
relational language. Of those who
have, I strongly suspect that none of
them ever complained about some
perceived inconvenience in pairing
columns according to their names
Source: "The Importance of Column Names" by Hugh Darwen

It's a matter of convenience. Not indispensable, but it should have its place, for example in interactive querying (every keystroke brings us closer to RSI, anyway), or some simple cases of hand written SQL even in production code (yes, I wrote that. And even seen JOIN USING in serious code, written by wise programmers other than myself. But, I'm digressing).
I found this question when looking for confirmation that SS is missing this feature, and I got it. I am only bewildered by the amount of hate against this syntax, which I attribute to the Sour Grapes Syndrome. I feel amused when being lectured with a patronising tone Sweets (read: syntactic sugar) is bad for your health. You don't need it anyway.
What is nice in the JOIN USING syntax, is that it works not just on column names, but also on column aliases, for example:
-- foreign key "order".customerId references (customer.id)
SELECT c.*, c.id as customerId, o.* from customer c
join "order" o using (customerId);
I don't agree with "Join using would be better, if only (...)". Or the argument, that you may need more complex conditions. From a different point of view, why use JOIN ON? Why not be pure, and move all conditions to the WHERE clause?
SELECT t1.*, t2.* from t1, t2 where t2.t1_id = t1.id;
I could now go mad and argue, how this is the cleanest way to express a join, and you can immediately start adding more conditions in the where clause, which you usually need anyway, blah blah blah...
So you shouldn't miss this particular syntax too dearly, but there's nothing to be happy about for not having it ("Phew, that was close. So good not to have JOIN USING. I was spared a lot of pain").
So, while I personally use JOIN ON 99% of the time, I feel no Schadenfreude when there is no JOIN USING or NATURAL JOIN.

I don't see the value of the USING or NATURAL syntax - as you've encountered, only ON is consistently implemented so it's best from a portability standpoint.
Being explicit is also better for maintenance, besides that the alternatives can be too limited to deal with situations. I'd also prefer my codebase be consistent.

What is so bad about using SQL INNER JOIN

Every time a database diagram gets looked out, one area people are critical of is inner joins. They look at them hard and has questions to see if an inner join really needs to be there.
Simple Library Example:
A many-to-many relationship is normally defined in SQL with three tables: Book, Category, BookCategory.
In this situation, Category is a table that contains two columns: ID, CategoryName.
In this situation, I have gotten questions about the Category table, is it need? Can it be used as a lookup table, and in the BookCategory table store the CategoryName instead of the CategoryID to stop from having to do an additional INNER JOIN. (For this question, we are going to ignore the changing, deleting of any CategoryNames)
The question is, what is so bad about inner joins? At what point is doing them a negative thing (general guidelines like # of transactions, # of records, # of joins in a statement, etc)?

Your example is a good counterexample. How do you rename categories if they're spread throughout the various rows of the BookCategory table? Your UPDATE to do the rename would touch all the rows in the same category.
With the separate table, you only have to update one row. There is no duplicate information.

I would be more concerned about OUTER joins, and the potential to pick up info that wasn't intended.
In your example, having the Category table means that a book is limited to being filed under a preset Category (via a foriegn key relationship), if you just shoved multiple entries in to the BookCategory table then it would be harder to limit what is selected for the Category.
Doing an INNER join is not so bad, it is what databases are made for. The only time it is bad is when you are doing it on a table or column that is inadequately indexed.

I am not sure there is some thing wrong in inner join per se, it is like each IF you add to your code impacts performance (or should I say every line...), but still, you need a minimum number of those to make your system work (yes yes, I know about Turing machines).
So if you have something that is not needed, it will be frowned upon.

When you map your domain model onto the relational model you have to split the information across multiple relations in order to get a normalized model - there is no way around that. And then you have to use joins to combine the relations again and get your information back. The only bad thing about this is that joins are relative expensive.
The other option would be not to normalize your relational model. This will fill your database with much redundant data, give you many opportunities to turn your data inconsistent and make updates a nightmare.
The only reason not to normalize a relational model (I can think of at the moment) is that reading performance is extremely - and I mean extremely - critical.
By the way, why do you (they) only mention inner joins? How are left, right, and full outer joins significantly different from inner joins?

Nobody can offer much about general guidelines - they'd be specific to the server, hardware, database design, and expectations... way too many variables.
Specifically about INNER JOINs being inefficient or bad... JOINs are the center of relational DBs, and they've been around for decades. It's only wrong when you use it wrong, because obviously someone's doing it right since it's not extinct yet. Personally, I'd assume anyone throwing out blanket statements like that either don't know SQL or know just enough to get in trouble. Next time it comes up, teach them how to use the query cache.
(Not mentioning update/delete, but you didn't say inserts!: the increased maintainability through avoiding humans and their typos can easily be worth at least 10x the time a join will take.)

Denormalizing for sanity or performance?

I've started a new project and they have a very normalized database. everything that can be a lookup is stored as the foreign key to the lookup table. this is normalized and fine, but I end up doing 5 table joins for the simplest queries.
from va in VehicleActions
join vat in VehicleActionTypes on va.VehicleActionTypeId equals vat.VehicleActionTypeId
join ai in ActivityInvolvements on va.VehicleActionId equals ai.VehicleActionId
join a in Agencies on va.AgencyId equals a.AgencyId
join vd in VehicleDescriptions on ai.VehicleDescriptionId equals vd.VehicleDescriptionId
join s in States on vd.LicensePlateStateId equals s.StateId
where va.CreatedDate > DateTime.Now.AddHours(-DateTime.Now.Hour)
select new {va.VehicleActionId,a.AgencyCode,vat.Description,vat.Code,
vd.LicensePlateNumber,LPNState = s.Code,va.LatestDateTime,va.CreatedDate}
I'd like to recommend that we denormaize some stuff. like the state code. I don't see the state codes changing in my lifetime. similar story with the 3-letter agency code. these are handed out by the agency of agencies and will never change.
When I approached the DBA with the state code issue and the 5 table joins. i get the response that "we are normalized" and that "joins are fast".
Is there a compelling argument to denormalize? I'd do it for sanity if nothing else.
the same query in T-SQL:
SELECT VehicleAction.VehicleActionID
, Agency.AgencyCode AS ActionAgency
, VehicleActionType.Description
, VehicleDescription.LicensePlateNumber
, State.Code AS LPNState
, VehicleAction.LatestDateTime AS ActionLatestDateTime
, VehicleAction.CreatedDate
FROM VehicleAction INNER JOIN
VehicleActionType ON VehicleAction.VehicleActionTypeId = VehicleActionType.VehicleActionTypeId INNER JOIN
ActivityInvolvement ON VehicleAction.VehicleActionId = ActivityInvolvement.VehicleActionId INNER JOIN
Agency ON VehicleAction.AgencyId = Agency.AgencyId INNER JOIN
VehicleDescription ON ActivityInvolvement.VehicleDescriptionId = VehicleDescription.VehicleDescriptionId INNER JOIN
State ON VehicleDescription.LicensePlateStateId = State.StateId
Where VehicleAction.CreatedDate >= floor(cast(getdate() as float))

I don't know if I would even call what you want to do denormalization -- it looks more like you just want to replace artificial foreign keys (StateId, AgencyId) with natural foreign keys (State Abbreviation, Agency Code). Using varchar fields instead of integer fields will slow down join/query performance, but (a) if you don't even need to join the table most of the time because the natural FK is what you want anyway it's not a big deal and (b) your database would need to be pretty big/have a high load for it to be noticeable.
But djna is correct in that you need a complete understanding of current and future needs before making a change like this. Are you SURE the three letter agency codes will never change, even five years from now? Really, really sure?

Some denormalization can be needed for performance (and sanity) reasons at some times. Hard to tell wihout seeing all your tables / needs etc...
But why not just build a few convenience views (to do a few joins) and then use these to be able to write simpler queries?

Beware of wanting to shape things to your current idioms. Right now the unfamiliar code seems unweildy and obstructive to your understanding. In time it's possible that you will become acclimatised.
If current (or known future) requirements, such as performance are not being met then that's a whole different issue. But remember anything can be performance tuned, the objective is not to make things as fast as possible, but to make them fast enough.

This previous post dealt with a similar issue to the one you're having. Hopefully it will be helpful to you.
Dealing with "hypernormalized" data
My own personal take on normalization is to normalize as much as possible, but denormalize only for performance. And evn the denormalization for performance is something to avoid. I'd go the route of profiling,setting correct indexes, etc before I'd denormalize.
Sanity... That's overrated. Especially in our profession.

Well, what about the performance? If the performance is okay, just make the five table JOIN into a view and, for sanity, SELECT from the view when you need the data.
State abbreviations are one of the cases in which I think meaningful keys are okay. For very simple lookup tables with a limited number of rows and where I'm in complete control of the data (meaning it's not populated from some outside source) I'll sometimes create meaningful four or five character keys so that the key value can proxy for the fully descriptive lookup value in some queries.

Create a view (or inline table-valued function to get parameterization). In any case, I usually put all my code into SPs (some code generated) whether they use views or not and that's that, you pretty much only ever write the join once.

An argument (for this "normalization") that the three-letter codes might change isn't very compelling without a plan for what you will do if the codes do change, and how your artificial-key scenario will address this eventuality better than using the codes as keys. Unless you've implemented a fully temporal schema (which is horribly difficult to do and not suggested by your example), it's not obvious to me how your normalization benefits you at all. Now if you work with agencies from multiple sources and standards that might have colliding code names, or if "state" might eventually mean a two-letter code for state, province, department, canton, or estado, that's another matter. You then need your own keys or you need a two-column key with more information than that code.

Has anyone written a higher level query langage (than sql) that generates sql for common tasks, on limited schemas

Sql is the standard in query languages, however it is sometime a bit verbose. I am currently writing limited query language that will make my common queries quicker to write and with a bit less mental overhead.
If you write a query over a good database schema, essentially you will be always joining over the primary key, foreign key fields so I think it should be unnecessary to have to state them each time.
So a query could look like.
select s.name, region.description from shop s
where monthly_sales.amount > 4000 and s.staff < 10
The relations would be
shop -- many to one -- region,
shop -- one to many -- monthly_sales
The sql that would be eqivilent to would be
select distinct s.name, r.description
from shop s
join region r on shop.region_id = region.region_id
join monthly_sales ms on ms.shop_id = s.shop_id
where ms.sales.amount > 4000 and s.staff < 10
(the distinct is there as you are joining to a one to many table (monthly_sales) and you are not selecting off fields from that table)
I understand that original query above may be ambiguous for certain schemas i.e if there the two relationship routes between two of the tables. However there are ways around (most) of these especially if you limit the schema allowed. Most possible schema's are not worth considering anyway.
I was just wondering if there any attempts to do something like this?
(I have seen most orm solutions to making some queries easier)
EDIT: I actually really like sql. I have used orm solutions and looked at linq. The best I have seen so far is SQLalchemy (for python). However, as far as I have seen they do not offer what I am after.

Hibernate and LinqToSQL do exactly what you want

I think you'd be better off spending your time just writing more SQL and becoming more comfortable with it. Most developers I know have gone through just this progression, where their initial exposure to SQL inspires them to bypass it entirely by writing their own ORM or set of helper classes that auto-generates the SQL for them. Usually they continue adding to it and refining it until it's just as complex (if not more so) than SQL. The results are sometimes fairly comical - I inherited one application that had classes named "And.cs" and "Or.cs", whose main functions were to add the words " AND " and " OR ", respectively, to a string.
SQL is designed to handle a wide variety of complexity. If your application's data design is simple, then the SQL to manipulate that data will be simple as well. It doesn't make much sense to use a different sort of query language for simple things, and then use SQL for the complex things, when SQL can handle both kinds of thing well.

I believe that any (decent) ORM would be of help here..

Entity SQL is slightly higher level (in places) than Transact SQL. Other than that, HQL, etc. For object-model approaches, LINQ (IQueryable<T>) is much higher level, allowing simple navigation:
var qry = from cust in db.Customers
select cust.Orders.Sum(o => o.OrderValue);
etc

Martin Fowler plumbed a whole load of energy into this and produced the Active Record pattern. I think this is what you're looking for?

Not sure if this falls in what you are looking for but I've been generating SQL dynamically from the definition of the Data Access Objects; the idea is to reflect on the class and by default assume that its name is the table name and all properties are columns. I also have search criteria objects to build the where part. The DAOs may contain lists of other DAO classes and that directs the joins.
Since you asked for something to take care of most of the repetitive SQL, this approach does it. And when it doesn't, I just fall back on handwritten SQL or stored procedures.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas