PK Query on M2M Relation - sql

I was reading on Neo4j, a graph database, and how it compares to the relational model. Here is one thing it mentions in how to query a M2M join for the "Departments" associated with a single user here:
I would think though if I knew beforehand I'm just looking up a single row-by-PK and there are less than 5 departments likely for that user, I would write the query as follows:
SELECT name FROM department WHERE department_id IN (
SELECT department_id FROM PersonDepartment WHERE user_id IN (
SELECT pk FROM Person WHERE name='Alice' # assume unique name
)
)
I'm sure writing this in the more common 'join format' would be optimized by the RDBMS into something closer to the above but I'm using the above just to show how the above query seems like it would take almost no time to execute, or am I wrong here? On the other hand, writing the above in the more concise Cypher format of: [p:Person{name:"Alice"}]-[:BELONGS_TO]->[d:Department] is much simpler to read and write.

Preliminary
To get some issues that confuse the problem out of the way, so that we can answer the question in a straight-forward manner.
The text blurb in the graphic
It is completely dishonest, the typical Straw Man argument, used to demean what he is against, and to elevate what he is for. He poses the Relational method as something that it is not (a 1960's Record Filing System with IDs as "Primary Keys"), and then takes it down
Whoopee, he destroyed his own concoction, his Straw Man
The Relational method remains, unaffected
Nevertheless, the uneducated will be confused.
ID fields as a "Primary Key"
The Relational Model explicitly prohibits IDs, which are Physical. The PK must be "made up from the data", which is Logical.
Further, the file contains duplicate rows (IDs do not provide row uniqueness, which is demanded in the RM)
IDs complicate DML code, and force more JOINs (that are not required in the equivalent Relational database), which the dear professor is counting on, in his erection of the Straw Man
The IDs need to be removed, and proper Relational Keys need to be implemented
Relational Integrity, which is Logical (as distinct from Referential Integrity, which is Physical), is lost, not possible
Full detail in Relational schema for a book graph.
No one in the right mind is going to step through those three tables in that way, let alone prescribe it.
he is using procedural code, such as in a CURSOR, which is anti-Relational, and stupefyingly slow
the RM and SQL are based on Set Theory, so use set verbs, such as SELECT, and select only what you need
the proposition is a single set, a single SELECT fulfils it.
Questions
I would think though if I knew beforehand I'm just looking up a single row-by-PK and there are less than 5 departments likely for that user, I would write the query as follows: ...
Definitely not. Even with the IDs
the population in each table is irrelevant (the PK has an unique index)
let us assume 10,000 Persons; 10,000 Departments; 16,000,000 PersonDepartments
performance should never be considered when modelling, or when writing DML code
it should be considered only when some code performs badly, with a view to improving it.
Other than for the purpose of clarifying your question, that code can be dismissed.
I'm sure writing this in the more common 'join format' would be optimized by the RDBMS
Yes.
with a genuine SQL Platform, it will do many things re optimisation, at many levels: parsing; determination of a Query Plan; considerations of Statistics; etc.
with the freeware "SQLs", it does a mickey mouse version of that (at best), and none at all (at worst). Which is why performance is consideration, everywhere, but that is abnormal; sub-standard.
into something closer to the above
Definitely not. That is a dog's breakfast. It will create a very elegant and optimised Query Plan, and then a hierarchic Query Tree (run-time executable, that can be shared).
but I'm using the above just to show how the above query seems like it would take almost no time to execute, or am I wrong here?
No, you are right. In the sense that either the horrible code example operating on a RFS, or the correct code operating on a Relational database, will execute in millisecs, "almost no time".
Relational Data Model
If you wish to evaluate what he intended in his proposition (what departments does Alice work for), without the dishonesty of his Straw Man, using a Relational database (no IDs, proper Relational Keys), we need a data model.
All my models are rendered in IDEF1X, the one and only Standard for Relational data modelling. (ERD cannot be used.)
The IDEF1X Introduction is essential reading.
The code is simple.
SELECT NameFirst,
DepartmentCode
FROM Person P
JOIN Employee E ON P.PersonNo = E.EmployeeNo
WHERE NameFirst = "Alice"
This code may produce is more meaningful result set, it is stil a single, simple SELECT.
SELECT NameLast,
NameFirst,
D.Name,
EmploymentDate
FROM Person P
JOIN Employee E ON P.PersonNo = E.EmployeeNo
JOIN Department D ON E.DepartmentCode = D.DepartmentCode
WHERE NameFirst = "Alice"
Comments
One question regarding the "no IDs, proper keys" -- doesn't the PersonNo act the same way as would an autoincrementing PK to identify a person?
Yes.
Except that AUTOINCREMENT/IDENTITY columns have horrendous maintenance problems, thus we do not allow them in Production, thus we do not allow them in Development that is not intended for Production.
The alternative for an INSERT is:
...
PersonNo = (
SELECT MAX( PersonNo ) + 1
FROM Person
)
...
Of course, for high performance OLTP, there are other methods.
Never use the Oracle method, which is a file of records, each holding a next-sequential-number for some other file.
If we went with PK must be "made up from the data" and no SS# or some uniquely-identifying-person-code, it'd then be just combining a bunch of stuff: FirstName+LastName+BirthPlace+Birthdate (or whatever combination would give enough granularity to guarantee uniqueness)
Yes.  That is answered in full detail in the IDEF1X Introduction, please read.
Short answer ...
this is a true surrogate, not a RecordID (which is falsely called a surrogate).
the only justification is when
the natural PK gets too long (here 7 columns and 120+ bytes), to be carried into subordinate tables as FKs,
and
the table is the top of a data hierarchy, which therefore does not suffer an Access Path Independence breach, as stipulated in Codd's Relational Model.  Which is true in the usage here.
Technically, a surrogate or RecordID) is a breach of the Relational Key Normal Form.  A properly placed surrogate does not breach the Access Path Independence Rule (nothing above the breach to be accessed), whereas a RecordID always does. The user does not see the surrogate or RecordID, because it is not data.
Separately, note that ( LastName, FirstName, Initial, Birthdate, BirthCountry, BirthState, BirthPlace ) is an established international convention (not quite a standard) for identifying a person.

Related

What's the best practice to connect a table to a junction table in relational database design?

I'm building a relational database that will act as a CRM for a travel company. I have removed tables and attributes to make this as simple as possible. Users will send quotes to customers.
A hotel can have many rooms (e.g. hotel 1 can have both a twin room and a triple room).
A room can have many hotels (e.g. a both hotels 1 and 2 can have a twin room).
Let's say a customer has a group of 6.
A user could send this customer a quote for hotel 1 with either 3x twin rooms or 2x triple rooms.
A quote will need to contain the hotel and appropriate room type and room type quantities.
Whats the best practice to connect table HOTEL_ROOM_JUNCTION to QUOTE as they key is a multi-attribute, composite key?
Thank you
Noting the Relational Database tag.
Problem
There is a lack of precision in your declarations:
A hotel can have many rooms (e.g. hotel 1 can have both a twin room and a triple room).
A room can have many hotels (e.g. a both hotels 1 and 2 can have a twin room).
I think you mean RoomType. From the rest of your declarations, the system you are implementing is for Quotations of rooms across all hotels, not a room booking system for each of the hotels. That is, you need to track RoomType, not Room, per Hotel.
The tables as given are not Relational tables, they do not have any of the requirements that make them Relational. When you start with stamping an id field on every file, it cripples the data analysis & data modelling exercise that is required to create a set of Relational tables. That is anti-Relational:
physical pointers such as record id are expressly prohibited in the Relational Model.
The Primary Key must be "made up from the data".
I appreciate that you have been schooled in that, due to the marketing and promotion of primitive methods as "relational".
.
For starters, each logical row (not physical record with a record id) must be unique.
The fields in each file should not be prefixed with the filename. In SQL (the data sub-language for the implementation of the Relational Model), the fully qualified address for a column is:
[server.][database.][owner.][table.]column
with defaults (obvious) for each element. If a column is ambiguous, simply prefix it with the table name.
Primary Keys are a special case. In order to avoid confusion (and now, to allow the new NATURAL JOIN), they should be the full name, in both the PK and FK locations. An id on every file would ensure buggy code.
Relational Data Model
If I address all those issues, and model the data according to the Relational Model, it would be:
Notation
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993.
My IDEF1X Introduction is essential reading for those who are new to the Relational Model, or its modelling method. Note that IDEF1X models are rich in detail and precision, showing all required details, whereas home-grown models have far less than that. Which means, the notation has to be understood.
Content
Relational Key
In order to make the logical rows unique, we need to make a Key from the data. The users know their data, they know what is unique and what is not. Usually they will have a ShortName for such things as Company; Hotel; Customer; etc.
If you do not communicate with the user, there is no chance of supplying the user's needs.
Hotel, UserName, Customer are ShortNames, which are unique, which therefore are the Primary Key. (More, later)
Relational Keys are composites, because they preserve the natural data hierarchies. Get used to it.
If you need the DDL for composite Keys, please ask.
Presuming that a Hotel may be a chain or franchise, we need a Location to make a specific hotel that has rooms unique.
The following are discrete Facts, and should not be mixed together (doing so will lead to complex constraints and horrendous SQL code):
HotelRoomType
that a Hotel.Location has a particular RoomType; and the Price
RoomTypeAvailable
that a Hotel.Location has one of those RoomTypes available on a particular Date; and the Number.
I presume there is a file from the hotels that you will be importing on a daily basis: this is the central table for that, with the constraints, of course.
Quote
that an User is providing a Quote that is requested by a single Customer, for a single TravelDate, for a single Hotel.Location. This allows separate Quotes for separate Hotel.Locations for a single TravelDate; Quotes for a Customer for more than one TravelDate; etc.
.
If you need multiple Hotel.Locations (and their RoomTypes) on a single Quote, let me know in the comments, and I will update the data model.
QuoteRoomType
that a Quote contains a line item which is a single RoomType in the single Hotel.Location that is available on the TravelDate.
Relational Integrity
A logical feature of the Relational Model, which is distinct from Referential Integrity, which is a physical feature in SQL. It is not possible to achieve this in a Record Filing System with record ids as "primary keys", not even an advanced and progressed one (after the various errors in the initial RFS have been corrected). Genuine logical Keys ("made up from the data") are required.
In RoomTypeAvailable, we have constrained:
RoomTypes to that which the Hotel.Location actually has (in HotelRoomType)
AND is actually available on Date.
In QuoteRoomType, we have constrained:
Hotel.Location to that which is in the Quote,
AND RoomTypes to that which is available in Hotel.Location (in HotelRoomType),
AND which is available on the TravelDate (RoomTypeAvailable.Date "maps to" QuoteRoomType.TravelDate).
1960's Record Filing System • Anti-Relational, Sold as "relational"
This section is relevant for those who prescribe a Record ID field as "primary key" in every file. And somehow think that that is "relational". Others can safely skip it.
For comparison, here is the set of files that one would come up with, if one followed the techniques and methods that are promoted and marketed by Date; Darwen; Fagin; et al crowd, falsely proposed as "relational".
This a "mature" or "advanced" model, the fourth or fifth iteration. It has a number of improvements over the initial RFS. The initial or second or third iteration would not be equivalent enough to offer a comparison:
the Facts that are required to support the system have been determined (as opposed to the initial model, the record perspective, which is oblivious to Facts).
the content of the records have been improved to prevent duplicates, to the extent possible given the record content (but it is still streets behind the uniqueness provided in a Relational data model)
Fails Relational
Nevertheless it has no Relational features, which are logical. It has only the physical features of SQL reference-ability. Just a few of the many failures, which the mob prescribes as "relational":
Duplicate rows (logical) are not prevented, because rows are not defined.
No Relational Integrity
which depends on Relational Keys. (Refer to the Relational Keys detailed above.)
Eg. QuoteRoomType is constrained to any RoomTypeAvailable.
It is not possible to constrain it to:
the HotelId that is referenced in the Quote only,
OR to RoomTypes that exist in the HotelId only,
OR to RoomTypesAvailable that are available on the TravelDate only.
One additional field, and one additional index, for the Record id on every file. That will have a marvellous effect on performance.
Horrendous navigation and query code.
No Relational Power
When two distal files need to be JOINed, each of the intermediate files must be additionally JOINed, something that is not required in a Relational database. That is because it breaks the Access Path Independence Rule, a concept that the razor gang have not been understand in the fifty years since the advent of the RM. But they will come up with yet another abnormal "normal form", to add to their bag of seventeen thus far.
More, Not Fewer, Joins
Let’s look at what that means. We need a query to provide statistics for RoomTypes that have been quoted for previous year, so that hotels can re-arrange their room types to suit the expected traffic.
Using the Relational data model (separate section above), we would code:
SELECT RoomType.RoomType, -- Relational database
Description,
SUM( NumRoom )
FROM RoomType
JOIN QuoteRoomType ON RoomType.RoomType = QuoteRoomType.RoomType
WHERE DATEPART( YY, TravelDate ) = DATEPART( YY, GETDATE() ) - 1
GROUP BY RoomType.RoomType, Description
Using the Record Filing System data model, which is the result of following the advice of the Date; Darwen; Fagin; philipxy; AntC; et al gang, which is falsely marketed as "relational" (above), we would be forced to code:
SELECT RoomType, -- Record Filing System
Description,
SUM( NumRoom )
FROM RoomType
JOIN HotelRoomType
ON RoomType.RoomTypeId = HotelRoomType.RoomTypeId
JOIN RoomTypeAvailable
ON HotelRoomType.HotelRoomTypeId = RoomTypeAvailable.HotelRoomTypeId
JOIN QuoteRoomType
ON RoomTypeAvailable.RoomTypeAvailableId = QuoteRoomType.RoomTypeAvailableId
JOIN Quote
ON QuoteRoomType.QuoteId Quote.QuoteId
WHERE DATEPART( YY, TravelDate ) = DATEPART( YY, GETDATE() ) - 1
GROUP BY RoomType, Description
Gotta love the QueryPlan for that, that the SQL platform will produce.
Re-arranging the order of the JOINs might improve the tortoise.
Resorting to moving fragments such as “partial FDs” or “MVDs” around, might improve it.
Perhaps deploying more “candies”, plus the required additional indices, all over the place, will help. But wait, that would be duplication on a mass scale, it would break Normalisation, there would be Update Anomalies everywhere one looks.
Note that that result set has no reliability; no credibility. Why ? Because, as already proved, the QuoteRoomType is not constrained to the Quote.Hotel (referenced by HotelId);
or to the Quote.TravelDate;
or to the RoomTypes available in QuoteHotel (referenced by HotelId).
Further, there may well be duplicates, because prevention can only be partially implemented. The result of which is unreliable result sets.
Simplicity vs Complexity
If you have the interest and the stamina, you can attempt to elevate the RFS by muddling through their "partial dependencies"; "transitive dependencies"; "candies"; "multi-valued dependencies"; etc, all of which are neither defined in, nor required in, the Relational Model. They are expressly for use in the Record Filing Systems of the last century.
First, the RFS paradigm (marketed as "relational") forces a record mindset, instead of a data-only mindset.
Second, it breaks everything down into fragments, instead of understanding the atoms; the Facts, in their full context (data hierarchies).
Third, it gives you a morass of complexity to handle the fragments, that have no relevance when handling atoms.
When you are done, all that complexity in the Record Filing System will still not be anywhere near the simplicity of the equivalent Relational data model: it will have:
No Relational Integrity (yes, yes, we have Declarative Referential Integrity, and that only for physical records, not for logical rows)
No Relational Power (multiple forced JOINs in every query)
No Relational Speed (those additional columns and indices have an effect).
And the navigation and query code will be horrendous, and prone to errors.
Please feel feel to ask specific questions. Also, please supply clarifications as noted, and I will update the data model.
Since a specific room can only exist in one hotel the table HOTEL_ROOM_JUNCTION is redundant. So pk hotel_id is fk in rooom, and pk in room is a concat key of hotel_id and room_id.
If one quote can consist of several rooms you need a connecting table between quote and room them with fk quote_id, room_id and hotel_id and those three will be the pk in that table. (As a rule of thumb, that kind of table will usually need a timestamp).
(as a side note; I would name the tables QUOTES, ROOMS and HOTELS since they contain many)
EDIT: I miss read the question somewhat .. to make my model as OP wants I need to add ROOM_TYPES with pk room_type_id which will be fk (not null) in ROOMS but not part of the pk.

It is ok to have duplicated values in SQL

I'm not a DBA so I'm not familiar with the proper lingo, so maybe the title of the question could be a little misleading.
So, the thing. I have Members for a certain system, these members can be part of a demographic segment (any kind of segment: favorite color, gender, job, etc)
These are the tables
SegmentCategory
ID, Name, Description
SegmentCategory_segment
SegmentID, SegmentCategoryID
Segment
ID, Name, Description
MemberSegment
ID, MemberID, SegmentID
So the guy that designed the DB decided to go uber normalizing everything so he put the member's gender on a segment and not in the Member's table.
Is this ok? According to my logic, gender it's a property of the Member so it must be on its entity. But by doing this then there must be duplicated data (The gender on the Member and Gender as a segment) But a trigger on the Member table could just fix this (Update the segment on a gender change)
Having to crawl 4 tables just to get a property from the member seems like over engineering to me.
My question is whether I'm right or not? If so, how could I propose the change to the DBA?
There isn't a blanket rule you can apply to database decisions like this. It depends on what applications/processes it is supporting. A database for reporting is much easier to work with when it is more de-normalized (in a well thought out way) than it is a more transactional database.
You can have a customer record spread across 2 tables, for instance, if some data is accessed or updated more often than other parts. Say you only need one half of the data 90% of your queries, but don't want to drag around the the varchar(max) fields you have there for whatever reason.
Having said that, having a table with just a gender/memberid is on the far side of extreme. From my naive understanding of your situation I feel you just need a members table with views over top for your segments.
As for the DBA, ultimately I imagine it will be them who will be needing to maintain the integrity of the data, so I would just approach them and say "hey, what do you think of this?" Hopefully they'll either see the merit or be able to give you reasons to their design decisions.

Is it better to name the primary key column id or *_id?

I've been using Rails for a few years and I've grown used to the convention of naming the primary key column id. But I've run across lots of examples in SQL books that name the primary key column something like employee_id for an employees table or feed_id for a feeds table.
One advantage of the 2nd system seems to be that you can use USING() more to produce more concise SQL queries:
select feeds.title, items.title from items inner join feeds USING(feed_id);
As opposed to
select feeds.title, items.title from items inner join feeds on feeds.id = items.feed_id;
Which naming convention is better? Which is favored by experienced database administrators?
Also, is it better to pluralize the name of the table?
I always use the verbose form (i.e. 'employee_id' rather than 'id') as it is more descriptive. If you are joining more than one table and both have 'id' column you will have to alias 'id' if you need to SELECT both of the ids. Also, as you mentioned, you get the advantage of USING clause. In the grand scheme of things it isn't a huge factor one way or the other but the more verbose form gives you advantages.
Both options are valid but the purists will say use id as its name is specified by the table.
I use table_id because I find it to be more descriptive and makes debugging easier. It's more practical.
Re: Tablenames. Another hotly debated topic among database nerds but I say Singular.
Tablename_Id is my strong preference. When you do joins to Fks you know exactly what to join to what and don't make mistakes where you join to ID in table a when you meant tableb below is an example of how easy this is to do especially if you copy the on clause from somewhere else
FROM tablea a
JOIN tableb b
ON a.ID = b.tableaid
JOIN tablec c
ON a.ID = c.tablebid
In the case above, you really wanted to join to B.Id but forgot to change it from a when you copied. It will work and give you a resultset that isn't correct. If you use table_id instead, the query would fail the syntax check.
Another problem with using Id is when you are doing complex reports. Since the repport queries have to have fields with individual names, you can end up wasting time writing a bunch of aliases you wouldn't need if you had named the id with the tablename.
Now people who use ORMs don't write a lot of SQl but what they do write and what report writers write are generally complex, complicated statements. You need to design you database to make it easier to do those things than simple queries.
The use of ID as the name for the identifying field is considered a SQl antipattern. http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?s=books&ie=UTF8&qid=1308929815&sr=1-1
This is user preference, but I always name the primary keys of my tables Id. I always name references of that Id in other tables as [SingularEntityName][Id] e.g.
Credentials
Id Password
Users
Id Name CredentialId
Descriptions
Id UserId
Keeps my references clean. However, just be consistant in your naming and it really shouldn't matter how you set up your schemas.
To open the can of worms again,
I'm willing to bet those who select tablename_id are older, more experienced programmers.
Those who use just id are younger.
Why ? because you learn redundancy and constancy is not always a bad thing.
the one thing I would add to the #1 answer, use the "_" helps make it easier to pick out the variable in code, in the table, etc... I do the same for foreign keys. TableName_FK Some will argue over that but it works for me and it's obvious what it is.
I have had to work on other's code many times over the years. Consistency is critical, obfuscation is worthless and meaningful variable names very helpful.
There are those who argue that verbosity makes code harder to read. I don't think that argument flies in today's world of objects.that.derive.from.some.microsoft.class.twenty.layers.deep.that.you.have.to.fully.reference.
BTW - as so many have said, it's your choice. Those folks who spend time arguing over coding syntax don't have enough work to do. Learn to be flexible and to use the standards of the workplace where you are employed. If you are lucky enough to set your own standards, then have at it. The fact your are wondering is great. But choose one and then be consistent (until you change jobs or decide you have a paradigm shift that means you want to change your style.)
You can often pick out what era someone started learning to code by their personal preferences and styles. Guys that write very tight, minmal, hard to read code, started back when memory was very limited (DOS) and probably wrote a lot of assembler, those that use Hungarian started back with the Win SDK, etc...
This discussion has been evolving for decades. The older I get, the more I document my code, the more meaningful I make my variable names, etc... because in a week I will have forgotten what I wrote and I need the road maps to make sense of it. Not so much that I'm forgetful, although that's part of the equation, but more so because I'm writing code on so many different projects.
it's entirely your choice. But personally I prefer the second one as I wouldn't need to look for table names in my code when I come across an id. I think tablename_id is better.
Another advantage to giving your primary keys names that are unique to that table is that it makes it easier to have a naming convention, when referring to those keys in different tables, that indicates the corresponding key.
For example, suppose everything in your alpha table begins alpha_, so that you have alpha_id as your primary key. In your beta table - where everything would begin beta_ - you would use beta_alpha_id to have a reference in that table to the keys in the alpha table.

How important are lookup tables?

A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.
The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.
For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.
Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')
Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.
The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast
Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.
The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.

SQL efficiency argument, add a column or solvable by query?

I am a recent college graduate and a new hire for software development. Things have been a little slow lately so I was given a db task. My db skills are limited to pet projects with Rails and Django. So, I was a little surprised with my latest task.
I have been asked by my manager to subclass Person with a 'Parent' table and add a reference to their custodian in the Person table. This is to facilitate going from Parent to Form when the custodian, not the Parent, is the FormContact.
Here is a simplified, mock structure of a sql-db I am working with. I would have drawn the relationship tables if I had access to Visio.
We have a table 'Person' and we have a table 'Form'. There is a table, 'FormContact', that relates a Person to a Form, not all Persons are related to a Form. There is a relationship table for Person to Person relationships (Employer, Parent, etc.)
I've asked, "Why this couldn't be handled by a query?" Response, Inefficient. (Really!?!)
So, I ask, "Why not have a reference to the Form? That would be more efficient since you wouldn't be querying the FormContacts table with the reference from child/custodian." Response, this would essentially make the Parent is a FormContact. (Fair enough.)
I went ahead an wrote a query to get from non-FormContact Parent to Form, and tested on the production server. The response time was instantaneous. SOME_VALUE is the Parent's fk ID.
SELECT FormID
FROM FormContact
WHERE FormContact.ContactID
IN (SELECT SourceContactID
FROM ContactRelationship
WHERE (ContactRelationship.RelatedContactID = *SOME_VALUE*)
AND (ContactRelationship.Relationship = 'Parent'));
If I am right, "This is an unnecessary change." What should I do, defend my position or should I concede to the managers request?
If I am wrong. What is my error? Is there a better solution than the manager's?
First things first, your query could use some reworking. Rather than subselects, try using a join:
SELECT FormID
FROM FormContact fc
JOIN ContactRelationship cr on cr.SourceContactID = fc.ContactID
and cr.Relationship = 'Parent'
WHERE cr.RelatedContactID = #parent_id
Secondly, the issue you're dealing with is normalization vs. performance. From a purity perspective, yes, your solution is "more correct" (as you aren't duplicating data, which eliminates the possibility for the disparities in the duplicated data causing conflicts and aberrant behavior), but pure normalization is not always the wisest course of action.
Normalization can induce performance penalties, especially in larger data sets. These penalties have to be weighed alongside the benefits from normalization to see which side "wins".
That being said, I can't see how joining the Person table again on the ParentID column (I'm assuming that's what you'd be adding) would provide a performance boost over the join listed above, assuming that the columns in question are properly indexed.
If the query above works for you and you do rigorous performance testing to show that it's valid, take it to your manager and ask for his input. Because you're new and fresh out of college, be very willing to defer to your manager's judgment and wishes on this one. There will be much bigger battles to fight in the future.