Why not have a JOINONE keyword in SQL to hint and enforce that each record has at most one match? - sql

I encounter this a lot when writing SQL. I have two tables that are meant to be in a one-to-one relationship with each other, and I wish I could easily assert that fact in my query. For example, the simplified query:
SELECT Person.ID, Person.Name, Location.Address1
FROM Person
LEFT JOIN Location ON Person.LocationID = Location.ID
When I read this query I think to myself, well what if the Location table fails to enforce uniqueness on its ID column? Suddenly you could have the same Person multiple times in your resultset. Sure, I can go look at the schema to assure myself it's unique so everything will be okay, but why shouldn't I simply be able to put it right here in my query, a la:
SELECT Person.ID, Person.Name, Location.Address1
FROM Person
LEFT JOINONE Location ON Person.LocationID = Location.ID
Not only would a keyword like this (made up "JOINONE") make it 100% clear to a human reading this query that we are guaranteed to get exactly one row for each Person record, but it lets the db engine optimize its execution plan because it knows there won't be more than one match each, even if the foreign key relationship isn't defined in the schema.
Another advantage of this would be that the db engine could enforce it, so if the data actually did have more than one match, an error could be thrown. This happens for subqueries already, e.g.:
SELECT Person.ID, Person.Name
, (
SELECT Location.Address1
FROM Location
WHERE Location.ID = Person.Location
) AS Address1
FROM Person
This is nice and spiffy, 100% clear to the human reader, neatly optimizable, and enforced by the db engine. In fact I often end up doing things this way for all those reasons. The problem is, besides the distracting syntax, you can only select one field this way. (What if I want City, State, and Zip too?) How nice it would be if you could flow this table right along with the rest of your JOINs and select any fields from it you wish in your SELECT clause just like all the rest of your tables.
I couldn't find any other question like this around StackOverflow, though I did find lots of repeats of a close question: people wanting to choose a single record. Close but really quite a different kind of goal, and less meaningful in my opinion.
I'm posting this question to see if there's some mechanism already in the SQL language that I'm missing, or an efficient workaround anyone has come up with. The concept of a one-to-one vs. one-to-many relationship is so fundamental to relational database design, I'm just so surprised at the absence of this language element.

SQL is two languages in one. Constraints, including uniqueness constraints, are set using the data definition language (DDL) in SQL. This is a layer above the data manipulation language (DML), where SELECT statements live, and it's understood that statements issued in the DDL might invalidate statements in the DML.
There's no way for a query to prevent someone from executing an ALTER TABLE command and changing the name of a field that the query refers to between query runs.
And there isn't much more of a way for a query to be written defensively against uncertain constraints; it's OK if you need to ask someone for information outside of the database environment to address this. The information may also be available within the environment; in most engines, you get to it by querying the data dictionary. This is the INFORMATION_SCHEMA in MySQL, for instance.

Related

What are SQLite triggers useful for?

What are SQLite triggers good for compared to (pre-compiled/macro-style) sequences of statements? Any useful optimization perhaps? Or just a matter of programmer taste?
I find triggers rather hard to use in several cases since Common Table Expressions such as SELECT won´t work in triggers. Maybe I´m looking at this the wrong way. Hence the generic question.
A trigger runs when a thing happens.
Typically, you wouldn't be selecting anything in a trigger, because there's nowhere for the output result to go. Rather, you'd be updating something else - for example, say you're building a family tree database, and you have a table for Individuals, as well as one for Families.
In that example, you might have a trigger such that when a record is inserted into the Individuals table, you might also create a record in the Families table if you don't already have an entry in Families for the LastName of the Individual.

What is so bad about using SQL INNER JOIN

Every time a database diagram gets looked out, one area people are critical of is inner joins. They look at them hard and has questions to see if an inner join really needs to be there.
Simple Library Example:
A many-to-many relationship is normally defined in SQL with three tables: Book, Category, BookCategory.
In this situation, Category is a table that contains two columns: ID, CategoryName.
In this situation, I have gotten questions about the Category table, is it need? Can it be used as a lookup table, and in the BookCategory table store the CategoryName instead of the CategoryID to stop from having to do an additional INNER JOIN. (For this question, we are going to ignore the changing, deleting of any CategoryNames)
The question is, what is so bad about inner joins? At what point is doing them a negative thing (general guidelines like # of transactions, # of records, # of joins in a statement, etc)?
Your example is a good counterexample. How do you rename categories if they're spread throughout the various rows of the BookCategory table? Your UPDATE to do the rename would touch all the rows in the same category.
With the separate table, you only have to update one row. There is no duplicate information.
I would be more concerned about OUTER joins, and the potential to pick up info that wasn't intended.
In your example, having the Category table means that a book is limited to being filed under a preset Category (via a foriegn key relationship), if you just shoved multiple entries in to the BookCategory table then it would be harder to limit what is selected for the Category.
Doing an INNER join is not so bad, it is what databases are made for. The only time it is bad is when you are doing it on a table or column that is inadequately indexed.
I am not sure there is some thing wrong in inner join per se, it is like each IF you add to your code impacts performance (or should I say every line...), but still, you need a minimum number of those to make your system work (yes yes, I know about Turing machines).
So if you have something that is not needed, it will be frowned upon.
When you map your domain model onto the relational model you have to split the information across multiple relations in order to get a normalized model - there is no way around that. And then you have to use joins to combine the relations again and get your information back. The only bad thing about this is that joins are relative expensive.
The other option would be not to normalize your relational model. This will fill your database with much redundant data, give you many opportunities to turn your data inconsistent and make updates a nightmare.
The only reason not to normalize a relational model (I can think of at the moment) is that reading performance is extremely - and I mean extremely - critical.
By the way, why do you (they) only mention inner joins? How are left, right, and full outer joins significantly different from inner joins?
Nobody can offer much about general guidelines - they'd be specific to the server, hardware, database design, and expectations... way too many variables.
Specifically about INNER JOINs being inefficient or bad... JOINs are the center of relational DBs, and they've been around for decades. It's only wrong when you use it wrong, because obviously someone's doing it right since it's not extinct yet. Personally, I'd assume anyone throwing out blanket statements like that either don't know SQL or know just enough to get in trouble. Next time it comes up, teach them how to use the query cache.
(Not mentioning update/delete, but you didn't say inserts!: the increased maintainability through avoiding humans and their typos can easily be worth at least 10x the time a join will take.)

Strategy for avoiding a common sql development error (misleading result on join bug)

Sometimes when i'm writing moderately complex SELECT statements with a few JOINs, wrong key columns are sometimes used in the JOIN statement that still return valid-looking results.
Because the auto numbering values (especially early in development) all tend to fall in similar ranges (sub 100s or so) the SELECT sill produces some results. These results often look valid at first glance and a problem is not detected until much, much later making debugging much more difficult because familiarity with the data structures and code has staled. (Gone stale in the dev's mind.)
i just spent several hours tracking down yet another of this issue that i've run into a too many times before. i name my tables and columns carefully, write my SQL statements methodically but this is an issue i can't seem to competely avoid. It comes back and bites me for hours of productivity about twice a year on average.
My question is: Has anyone come up with a clever method for avoiding this; what i assume is probably a common SQL bug/mistake?
i have thought of trying to auto-number starting with different start values but this feels cludgy and would get ugly trying to keep such a scheme straight for data models with dozens of tables... Any better ideas?
P.S.
i am very careful and methodical in naming my tables and columns. Patient table gets PatientId column, Facility get a FacilityId etc. This issues tends to arise when there are join tables involved where the linkage takes on extra meaning such as: RelatedPatientId, ReferingPatientId, FavoriteItemId etc.
When writing long complex SELECT statements try to limit the result to one record.
For instance, assume you have this gigantic enormous awesome CMS system and you have to write internal reports because the reports that come with it are horrendous. You notice that there are about 500 tables. Your select statement joins 30 of these tables. Your result should limit your row count by using a WHERE clause.
My advice is to rather then get all this code written and generalized for all cases, break the problem up and use WHERE and limit the row count to only say a record. Check all fields, if they look ok, break it up and let your code return more rows. Only after further checking should you generalize.
It bites a lot of us who keep adding more and more joins until it seems to look ok, but only after Joe Blow the accountant runs the report does he realize that the PO for 4 million was really the telephone bill for the entire year. Somehow that join got messed up!
One option would be to use your natural keys.
More practically, Red Gate SQL Prompt picks the FK columns for me.
I also tend to build up one JOIN at a time to see how things look.
If you have a visualization or diagramming tool for your SQL statements, you can follow the joins visually, and any errors will become immediately apparent, provided you have followed a sensible naming scheme for your primary and foreign keys.
Your column names should take care of this unless you named them all "ID". Are you writing multiple select statement using the same tables? You may want to create views for the more common ones.
If you're using SQL Server, you can use GUID columns as primary keys (that's what we do). You won't have problems with collisions again.
You could use GUIDs as your primary keys, but it has its pros and cons.
This pro is actually not mentioned on that page.
I have never tried doing this myself - I use a tool on top of SQL that makes incorrect joins very unlikely, so I don't have this problem. I just thought I'd mention it as another option though!
For IDs use TableNameID, for example for table Person, use PersonID
Use db model and look at the drawing when writing queries.
This way join looks like:
... ON p.PersonID = d.PersonID
as opposed to:
... ON p.ID = d.ID
Auto-increment integer PKs are among your best friends.

Denormalizing for sanity or performance?

I've started a new project and they have a very normalized database. everything that can be a lookup is stored as the foreign key to the lookup table. this is normalized and fine, but I end up doing 5 table joins for the simplest queries.
from va in VehicleActions
join vat in VehicleActionTypes on va.VehicleActionTypeId equals vat.VehicleActionTypeId
join ai in ActivityInvolvements on va.VehicleActionId equals ai.VehicleActionId
join a in Agencies on va.AgencyId equals a.AgencyId
join vd in VehicleDescriptions on ai.VehicleDescriptionId equals vd.VehicleDescriptionId
join s in States on vd.LicensePlateStateId equals s.StateId
where va.CreatedDate > DateTime.Now.AddHours(-DateTime.Now.Hour)
select new {va.VehicleActionId,a.AgencyCode,vat.Description,vat.Code,
vd.LicensePlateNumber,LPNState = s.Code,va.LatestDateTime,va.CreatedDate}
I'd like to recommend that we denormaize some stuff. like the state code. I don't see the state codes changing in my lifetime. similar story with the 3-letter agency code. these are handed out by the agency of agencies and will never change.
When I approached the DBA with the state code issue and the 5 table joins. i get the response that "we are normalized" and that "joins are fast".
Is there a compelling argument to denormalize? I'd do it for sanity if nothing else.
the same query in T-SQL:
SELECT VehicleAction.VehicleActionID
, Agency.AgencyCode AS ActionAgency
, VehicleActionType.Description
, VehicleDescription.LicensePlateNumber
, State.Code AS LPNState
, VehicleAction.LatestDateTime AS ActionLatestDateTime
, VehicleAction.CreatedDate
FROM VehicleAction INNER JOIN
VehicleActionType ON VehicleAction.VehicleActionTypeId = VehicleActionType.VehicleActionTypeId INNER JOIN
ActivityInvolvement ON VehicleAction.VehicleActionId = ActivityInvolvement.VehicleActionId INNER JOIN
Agency ON VehicleAction.AgencyId = Agency.AgencyId INNER JOIN
VehicleDescription ON ActivityInvolvement.VehicleDescriptionId = VehicleDescription.VehicleDescriptionId INNER JOIN
State ON VehicleDescription.LicensePlateStateId = State.StateId
Where VehicleAction.CreatedDate >= floor(cast(getdate() as float))
I don't know if I would even call what you want to do denormalization -- it looks more like you just want to replace artificial foreign keys (StateId, AgencyId) with natural foreign keys (State Abbreviation, Agency Code). Using varchar fields instead of integer fields will slow down join/query performance, but (a) if you don't even need to join the table most of the time because the natural FK is what you want anyway it's not a big deal and (b) your database would need to be pretty big/have a high load for it to be noticeable.
But djna is correct in that you need a complete understanding of current and future needs before making a change like this. Are you SURE the three letter agency codes will never change, even five years from now? Really, really sure?
Some denormalization can be needed for performance (and sanity) reasons at some times. Hard to tell wihout seeing all your tables / needs etc...
But why not just build a few convenience views (to do a few joins) and then use these to be able to write simpler queries?
Beware of wanting to shape things to your current idioms. Right now the unfamiliar code seems unweildy and obstructive to your understanding. In time it's possible that you will become acclimatised.
If current (or known future) requirements, such as performance are not being met then that's a whole different issue. But remember anything can be performance tuned, the objective is not to make things as fast as possible, but to make them fast enough.
This previous post dealt with a similar issue to the one you're having. Hopefully it will be helpful to you.
Dealing with "hypernormalized" data
My own personal take on normalization is to normalize as much as possible, but denormalize only for performance. And evn the denormalization for performance is something to avoid. I'd go the route of profiling,setting correct indexes, etc before I'd denormalize.
Sanity... That's overrated. Especially in our profession.
Well, what about the performance? If the performance is okay, just make the five table JOIN into a view and, for sanity, SELECT from the view when you need the data.
State abbreviations are one of the cases in which I think meaningful keys are okay. For very simple lookup tables with a limited number of rows and where I'm in complete control of the data (meaning it's not populated from some outside source) I'll sometimes create meaningful four or five character keys so that the key value can proxy for the fully descriptive lookup value in some queries.
Create a view (or inline table-valued function to get parameterization). In any case, I usually put all my code into SPs (some code generated) whether they use views or not and that's that, you pretty much only ever write the join once.
An argument (for this "normalization") that the three-letter codes might change isn't very compelling without a plan for what you will do if the codes do change, and how your artificial-key scenario will address this eventuality better than using the codes as keys. Unless you've implemented a fully temporal schema (which is horribly difficult to do and not suggested by your example), it's not obvious to me how your normalization benefits you at all. Now if you work with agencies from multiple sources and standards that might have colliding code names, or if "state" might eventually mean a two-letter code for state, province, department, canton, or estado, that's another matter. You then need your own keys or you need a two-column key with more information than that code.

When to use SQL Table Alias

I'm curious to know how people are using table aliases. The other developers where I work always use table aliases, and always use the alias of a, b, c, etc.
Here's an example:
SELECT a.TripNum, b.SegmentNum, b.StopNum, b.ArrivalTime
FROM Trip a, Segment b
WHERE a.TripNum = b.TripNum
I disagree with them, and think table aliases should be use more sparingly.
I think they should be used when including the same table twice in a query, or when the table name is very long and using a shorter name in the query will make the query easier to read.
I also think the alias should be a descriptive name rather than just a letter. In the above example, if I felt I needed to use 1 letter table alias I would use t for the Trip table and s for the segment table.
There are two reasons for using table aliases.
The first is cosmetic. The statements are easier to write, and perhaps also easier to read when table aliases are used.
The second is more substantive. If a table appears more than once in the FROM clause, you need table aliases in order to keep them distinct. Self joins are common in cases where a table contains a foreign key that references the primary key of the same table.
Two examples: an employees table that contains a supervisorID column that references the employeeID of the supervisor.
The second is a parts explosion. Often, this is implemented in a separate table with three columns: ComponentPartID, AssemblyPartID, and Quantity. In this case, there won't be any self joins, but there will often be a three way join between this table and two different references to the table of Parts.
It's a good habit to get into.
I use them to save typing. However, I always use letters similar to the function. So, in your example, I would type:
SELECT t.TripNum, s.SegmentNum, s.StopNum, s.ArrivalTime
FROM Trip t, Segment s
WHERE t.TripNum = s.TripNum
That just makes it easier to read, for me.
As a general rule I always use them, as there are usually multiple joins going on in my stored procedures. It also makes it easier when using code generation tools like CodeSmith to have it generate the alias name automatically for you.
I try to stay away from single letters like a & b, as I may have multiple tables that start with the letter a or b. I go with a longer approach, the concatenation of the referenced foreign key with the alias table, for example CustomerContact ... this would be the alias for the Customer table when joining to a Contact table.
The other reason I don't mind longer name, is due to most of my stored procedures are being generated via code CodeSmith. I don't mind hand typing the few that I may have to build myself.
Using the current example, I would do something like:
SELECT TripNum, TripSegment.SegmentNum, TripSegment.StopNum, TripSegment.ArrivalTime
FROM Trip, Segment TripSegment
WHERE TripNum = TripSegment.TripNum
Can I add to a debate that is already several years old?
There is another reason that no one has mentioned. The SQL parser in certain databases works better with an alias. I cannot recall if Oracle changed this in later versions, but when it came to an alias, it looked up the columns in the database and remembered them. When it came to a table name, even if it was already encountered in the statement, it re-checked the database for the columns. So using an alias allowed for faster parsing, especially of long SQL statements. I am sure someone knows if this is still the case, if other databases do this at parse time, and if it changed, when it changed.
I use it always, reasons:
leaving full tables names in statements makes them hard to read, plus you cannot have a same table twice
not using anything is a very bad idea, because later you could add some field to one of the tables that is already present in some other table
Consider this example:
select col1, col2
from tab1
join tab2 on tab1.col3 = tab2.col3
Now, imagine a few months later, you decide to add column named 'col1' to tab2. Database will silently allow you to do that, but applications would break when executing the above query because of ambiguity between tab1.col1 and tab2.col1.
But, I agree with you on the naming: a, b, c is fine, but t and s would be much better in your example. And when I have the same table more than once, I would use t1, t2, ... or s1, s2, s3...
Using the full name makes it harder to read, especially for larger queries or the Order/Product/OrderProduct scenari0
I'd use t and s. Or o/p/op
If you use SCHEMABINDING then columns must be qualified anyway
If you add a column to a base table, then the qualification reduces the chance of a duplicate in the query (for example a "Comment" column)
Because of this qualification, it makes sense to always use aliases.
Using a and b is blind obedience to a bizarre standard.
In simple queries I do not use aliases. In queries whit multiple tables I always use them because:
they make queries more readable (my
aliases are 2 or more capital letters that is a shortcut for the table name and if possible
a relationship to other
tables)
they allow faster developing and
rewriting (my table names are long and have prefixes depending on role they pose)
so instead of for example:
SELECT SUM(a.VALUE)
FROM Domesticvalues a, Foreignvalues b
WHERE a.Value>b.Value
AND a.Something ...
I write:
select SUM(DVAL.Value)
from DomesticValues DVAL, ForeignValues FVAL
where DVAL.Value > FVAL.Value
and DVAL.Something ...
There are many good ideas in the posts above about when and why to alias table names. What no one else has mentioned is that it is also beneficial in helping a maintainer understand the scope of tables. At our company we are not allowed to create views. (Thank the DBA.) So, some of our queries become large, even exceeding the 50,000 character limit of a SQL command in Crystal Reports. When a query aliases its tables as a, b, c, and a subquery of that does the same, and multiple subqueries in that one each use the same aliases, it is easy for one to mistake what level of the query is being read. This can even confuse the original developer when enough time has passed. Using unique aliases within each level of a query makes it easier to read because the scope remains clear.
I feel that you should use them as often as possible but I do agree that t & s represent the entities better than a & b.
This boils down to, like everything else, preferences. I like that you can depend on your stored procedures following the same conventions when each developer uses the alias in the same manner.
Go convince your coworkers to get on the same page as you or this is all worthless. The alternative is you could have a table Zebra as first table and alias it as a. That would just be cute.
i only use them when they are necessary to distinguish which table a field is coming from
select PartNumber, I.InventoryTypeId, InventoryTypeDescription
from dbo.Inventory I
inner join dbo.InventoryType IT on IT.InventoryTypeId = I.InventoryTypeId
In the example above both tables have an InventoryTypeId field, but the other field names are unique.
Always use an abbreviation for the table as the name so that the code makes more sense - ask your other developers if they name their local variables A, B, C, etc!
The only exception is in the rare cases where the SQL syntax requires a table alias but it isn't referenced, e.g.
select *
from (
select field1, field2, sum(field3) as total
from someothertable
) X
In the above, SQL syntax requires the table alias for the subselect, but it isn't referenced anywhere so I get lazy and use X or something like that.
I find it nothing more than a preference. As mentioned above, aliases save typing, especially with long table/view names.
One thing I've learned is that especially with complex queries; it is far simpler to troubleshoot six months later if you use the alias as a qualifier for every field reference. Then you aren't trying to remember which table that field came from.
We tend to have some ridiculously long table names, so I find it easier to read if the tables are aliased. And of course you must do it if you are using a derived table or a self join, so being in the habit is a good idea. I find most of our developers end up using the same alias for each table in all their sps,so most of the time anyone reading it will immediately know what pug is the alias for or mmh.
I always use them. I formerly only used them in queries involving just one table but then I realized a) queries involving just one table are rare, and b) queries involving just one table rarely stay that way for long. So I always put them in from the start so that I (or someone else) won't have to retro fit them later. Oh and BTW: I call them "correlation names", as per the SQL-92 Standard :)
Tables aliases should be four things:
Short
Meaningful
Always used
Used consistently
For example if you had tables named service_request, service_provider, user, and affiliate (among many others) a good practice would be to alias those tables as "sr", "sp", "u", and "a", and do so in every query possible. This is especially convenient if, as is often the case, these aliases coincide with acronyms used by your organization. So if "SR" and "SP" are the accepted terms for Service Request and Service Provider respectively, the aliases above carry a double payload of intuitively standing in for both the table and the business object it represents.
The obvious flaws with this system are first that it can be awkward for table names with lots of "words" e.g. a_long_multi_word_table_name which would alias to almwtn or something, and that it's likely you'll end up with tables named such that they abbreviate the same. The first flaw can be dealt with however you like, such as by taking the last 3 or 4 letters, or whichever subset you feel is most representative, most unique, or easiest to type. The second I've found in practice isn't as troublesome as it might seem, perhaps just by luck. You can also do things like take the second letter of a "word" in the table as well, such as aliasing account_transaction to "atr" instead of "at" to avoid conflicting with account_type.
Of course whether you use the above approach or not, aliases should be short because you'll be typing them very very frequently, and they should always be used because once you've written a query against a single table and omitted the alias, it's inevitable that you'll later need to edit in a second table with duplicate column names.
Because I always fully qualify my tables, I use aliases to provide a shorter name for the fields being SELECTed when JOINed tables are involved. I think it makes my code easier to follow.
If the query deals with only one source - no JOINs - then I don't use an alias.
Just a matter of personal preference.
Always. Make it a habit.