I am a recent college graduate and a new hire for software development. Things have been a little slow lately so I was given a db task. My db skills are limited to pet projects with Rails and Django. So, I was a little surprised with my latest task.
I have been asked by my manager to subclass Person with a 'Parent' table and add a reference to their custodian in the Person table. This is to facilitate going from Parent to Form when the custodian, not the Parent, is the FormContact.
Here is a simplified, mock structure of a sql-db I am working with. I would have drawn the relationship tables if I had access to Visio.
We have a table 'Person' and we have a table 'Form'. There is a table, 'FormContact', that relates a Person to a Form, not all Persons are related to a Form. There is a relationship table for Person to Person relationships (Employer, Parent, etc.)
I've asked, "Why this couldn't be handled by a query?" Response, Inefficient. (Really!?!)
So, I ask, "Why not have a reference to the Form? That would be more efficient since you wouldn't be querying the FormContacts table with the reference from child/custodian." Response, this would essentially make the Parent is a FormContact. (Fair enough.)
I went ahead an wrote a query to get from non-FormContact Parent to Form, and tested on the production server. The response time was instantaneous. SOME_VALUE is the Parent's fk ID.
SELECT FormID
FROM FormContact
WHERE FormContact.ContactID
IN (SELECT SourceContactID
FROM ContactRelationship
WHERE (ContactRelationship.RelatedContactID = *SOME_VALUE*)
AND (ContactRelationship.Relationship = 'Parent'));
If I am right, "This is an unnecessary change." What should I do, defend my position or should I concede to the managers request?
If I am wrong. What is my error? Is there a better solution than the manager's?
First things first, your query could use some reworking. Rather than subselects, try using a join:
SELECT FormID
FROM FormContact fc
JOIN ContactRelationship cr on cr.SourceContactID = fc.ContactID
and cr.Relationship = 'Parent'
WHERE cr.RelatedContactID = #parent_id
Secondly, the issue you're dealing with is normalization vs. performance. From a purity perspective, yes, your solution is "more correct" (as you aren't duplicating data, which eliminates the possibility for the disparities in the duplicated data causing conflicts and aberrant behavior), but pure normalization is not always the wisest course of action.
Normalization can induce performance penalties, especially in larger data sets. These penalties have to be weighed alongside the benefits from normalization to see which side "wins".
That being said, I can't see how joining the Person table again on the ParentID column (I'm assuming that's what you'd be adding) would provide a performance boost over the join listed above, assuming that the columns in question are properly indexed.
If the query above works for you and you do rigorous performance testing to show that it's valid, take it to your manager and ask for his input. Because you're new and fresh out of college, be very willing to defer to your manager's judgment and wishes on this one. There will be much bigger battles to fight in the future.
Related
I was reading on Neo4j, a graph database, and how it compares to the relational model. Here is one thing it mentions in how to query a M2M join for the "Departments" associated with a single user here:
I would think though if I knew beforehand I'm just looking up a single row-by-PK and there are less than 5 departments likely for that user, I would write the query as follows:
SELECT name FROM department WHERE department_id IN (
SELECT department_id FROM PersonDepartment WHERE user_id IN (
SELECT pk FROM Person WHERE name='Alice' # assume unique name
)
)
I'm sure writing this in the more common 'join format' would be optimized by the RDBMS into something closer to the above but I'm using the above just to show how the above query seems like it would take almost no time to execute, or am I wrong here? On the other hand, writing the above in the more concise Cypher format of: [p:Person{name:"Alice"}]-[:BELONGS_TO]->[d:Department] is much simpler to read and write.
Preliminary
To get some issues that confuse the problem out of the way, so that we can answer the question in a straight-forward manner.
The text blurb in the graphic
It is completely dishonest, the typical Straw Man argument, used to demean what he is against, and to elevate what he is for. He poses the Relational method as something that it is not (a 1960's Record Filing System with IDs as "Primary Keys"), and then takes it down
Whoopee, he destroyed his own concoction, his Straw Man
The Relational method remains, unaffected
Nevertheless, the uneducated will be confused.
ID fields as a "Primary Key"
The Relational Model explicitly prohibits IDs, which are Physical. The PK must be "made up from the data", which is Logical.
Further, the file contains duplicate rows (IDs do not provide row uniqueness, which is demanded in the RM)
IDs complicate DML code, and force more JOINs (that are not required in the equivalent Relational database), which the dear professor is counting on, in his erection of the Straw Man
The IDs need to be removed, and proper Relational Keys need to be implemented
Relational Integrity, which is Logical (as distinct from Referential Integrity, which is Physical), is lost, not possible
Full detail in Relational schema for a book graph.
No one in the right mind is going to step through those three tables in that way, let alone prescribe it.
he is using procedural code, such as in a CURSOR, which is anti-Relational, and stupefyingly slow
the RM and SQL are based on Set Theory, so use set verbs, such as SELECT, and select only what you need
the proposition is a single set, a single SELECT fulfils it.
Questions
I would think though if I knew beforehand I'm just looking up a single row-by-PK and there are less than 5 departments likely for that user, I would write the query as follows: ...
Definitely not. Even with the IDs
the population in each table is irrelevant (the PK has an unique index)
let us assume 10,000 Persons; 10,000 Departments; 16,000,000 PersonDepartments
performance should never be considered when modelling, or when writing DML code
it should be considered only when some code performs badly, with a view to improving it.
Other than for the purpose of clarifying your question, that code can be dismissed.
I'm sure writing this in the more common 'join format' would be optimized by the RDBMS
Yes.
with a genuine SQL Platform, it will do many things re optimisation, at many levels: parsing; determination of a Query Plan; considerations of Statistics; etc.
with the freeware "SQLs", it does a mickey mouse version of that (at best), and none at all (at worst). Which is why performance is consideration, everywhere, but that is abnormal; sub-standard.
into something closer to the above
Definitely not. That is a dog's breakfast. It will create a very elegant and optimised Query Plan, and then a hierarchic Query Tree (run-time executable, that can be shared).
but I'm using the above just to show how the above query seems like it would take almost no time to execute, or am I wrong here?
No, you are right. In the sense that either the horrible code example operating on a RFS, or the correct code operating on a Relational database, will execute in millisecs, "almost no time".
Relational Data Model
If you wish to evaluate what he intended in his proposition (what departments does Alice work for), without the dishonesty of his Straw Man, using a Relational database (no IDs, proper Relational Keys), we need a data model.
All my models are rendered in IDEF1X, the one and only Standard for Relational data modelling. (ERD cannot be used.)
The IDEF1X Introduction is essential reading.
The code is simple.
SELECT NameFirst,
DepartmentCode
FROM Person P
JOIN Employee E ON P.PersonNo = E.EmployeeNo
WHERE NameFirst = "Alice"
This code may produce is more meaningful result set, it is stil a single, simple SELECT.
SELECT NameLast,
NameFirst,
D.Name,
EmploymentDate
FROM Person P
JOIN Employee E ON P.PersonNo = E.EmployeeNo
JOIN Department D ON E.DepartmentCode = D.DepartmentCode
WHERE NameFirst = "Alice"
Comments
One question regarding the "no IDs, proper keys" -- doesn't the PersonNo act the same way as would an autoincrementing PK to identify a person?
Yes.
Except that AUTOINCREMENT/IDENTITY columns have horrendous maintenance problems, thus we do not allow them in Production, thus we do not allow them in Development that is not intended for Production.
The alternative for an INSERT is:
...
PersonNo = (
SELECT MAX( PersonNo ) + 1
FROM Person
)
...
Of course, for high performance OLTP, there are other methods.
Never use the Oracle method, which is a file of records, each holding a next-sequential-number for some other file.
If we went with PK must be "made up from the data" and no SS# or some uniquely-identifying-person-code, it'd then be just combining a bunch of stuff: FirstName+LastName+BirthPlace+Birthdate (or whatever combination would give enough granularity to guarantee uniqueness)
Yes. That is answered in full detail in the IDEF1X Introduction, please read.
Short answer ...
this is a true surrogate, not a RecordID (which is falsely called a surrogate).
the only justification is when
the natural PK gets too long (here 7 columns and 120+ bytes), to be carried into subordinate tables as FKs,
and
the table is the top of a data hierarchy, which therefore does not suffer an Access Path Independence breach, as stipulated in Codd's Relational Model. Which is true in the usage here.
Technically, a surrogate or RecordID) is a breach of the Relational Key Normal Form. A properly placed surrogate does not breach the Access Path Independence Rule (nothing above the breach to be accessed), whereas a RecordID always does. The user does not see the surrogate or RecordID, because it is not data.
Separately, note that ( LastName, FirstName, Initial, Birthdate, BirthCountry, BirthState, BirthPlace ) is an established international convention (not quite a standard) for identifying a person.
I'm not a DBA so I'm not familiar with the proper lingo, so maybe the title of the question could be a little misleading.
So, the thing. I have Members for a certain system, these members can be part of a demographic segment (any kind of segment: favorite color, gender, job, etc)
These are the tables
SegmentCategory
ID, Name, Description
SegmentCategory_segment
SegmentID, SegmentCategoryID
Segment
ID, Name, Description
MemberSegment
ID, MemberID, SegmentID
So the guy that designed the DB decided to go uber normalizing everything so he put the member's gender on a segment and not in the Member's table.
Is this ok? According to my logic, gender it's a property of the Member so it must be on its entity. But by doing this then there must be duplicated data (The gender on the Member and Gender as a segment) But a trigger on the Member table could just fix this (Update the segment on a gender change)
Having to crawl 4 tables just to get a property from the member seems like over engineering to me.
My question is whether I'm right or not? If so, how could I propose the change to the DBA?
There isn't a blanket rule you can apply to database decisions like this. It depends on what applications/processes it is supporting. A database for reporting is much easier to work with when it is more de-normalized (in a well thought out way) than it is a more transactional database.
You can have a customer record spread across 2 tables, for instance, if some data is accessed or updated more often than other parts. Say you only need one half of the data 90% of your queries, but don't want to drag around the the varchar(max) fields you have there for whatever reason.
Having said that, having a table with just a gender/memberid is on the far side of extreme. From my naive understanding of your situation I feel you just need a members table with views over top for your segments.
As for the DBA, ultimately I imagine it will be them who will be needing to maintain the integrity of the data, so I would just approach them and say "hey, what do you think of this?" Hopefully they'll either see the merit or be able to give you reasons to their design decisions.
I've been using Rails for a few years and I've grown used to the convention of naming the primary key column id. But I've run across lots of examples in SQL books that name the primary key column something like employee_id for an employees table or feed_id for a feeds table.
One advantage of the 2nd system seems to be that you can use USING() more to produce more concise SQL queries:
select feeds.title, items.title from items inner join feeds USING(feed_id);
As opposed to
select feeds.title, items.title from items inner join feeds on feeds.id = items.feed_id;
Which naming convention is better? Which is favored by experienced database administrators?
Also, is it better to pluralize the name of the table?
I always use the verbose form (i.e. 'employee_id' rather than 'id') as it is more descriptive. If you are joining more than one table and both have 'id' column you will have to alias 'id' if you need to SELECT both of the ids. Also, as you mentioned, you get the advantage of USING clause. In the grand scheme of things it isn't a huge factor one way or the other but the more verbose form gives you advantages.
Both options are valid but the purists will say use id as its name is specified by the table.
I use table_id because I find it to be more descriptive and makes debugging easier. It's more practical.
Re: Tablenames. Another hotly debated topic among database nerds but I say Singular.
Tablename_Id is my strong preference. When you do joins to Fks you know exactly what to join to what and don't make mistakes where you join to ID in table a when you meant tableb below is an example of how easy this is to do especially if you copy the on clause from somewhere else
FROM tablea a
JOIN tableb b
ON a.ID = b.tableaid
JOIN tablec c
ON a.ID = c.tablebid
In the case above, you really wanted to join to B.Id but forgot to change it from a when you copied. It will work and give you a resultset that isn't correct. If you use table_id instead, the query would fail the syntax check.
Another problem with using Id is when you are doing complex reports. Since the repport queries have to have fields with individual names, you can end up wasting time writing a bunch of aliases you wouldn't need if you had named the id with the tablename.
Now people who use ORMs don't write a lot of SQl but what they do write and what report writers write are generally complex, complicated statements. You need to design you database to make it easier to do those things than simple queries.
The use of ID as the name for the identifying field is considered a SQl antipattern. http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?s=books&ie=UTF8&qid=1308929815&sr=1-1
This is user preference, but I always name the primary keys of my tables Id. I always name references of that Id in other tables as [SingularEntityName][Id] e.g.
Credentials
Id Password
Users
Id Name CredentialId
Descriptions
Id UserId
Keeps my references clean. However, just be consistant in your naming and it really shouldn't matter how you set up your schemas.
To open the can of worms again,
I'm willing to bet those who select tablename_id are older, more experienced programmers.
Those who use just id are younger.
Why ? because you learn redundancy and constancy is not always a bad thing.
the one thing I would add to the #1 answer, use the "_" helps make it easier to pick out the variable in code, in the table, etc... I do the same for foreign keys. TableName_FK Some will argue over that but it works for me and it's obvious what it is.
I have had to work on other's code many times over the years. Consistency is critical, obfuscation is worthless and meaningful variable names very helpful.
There are those who argue that verbosity makes code harder to read. I don't think that argument flies in today's world of objects.that.derive.from.some.microsoft.class.twenty.layers.deep.that.you.have.to.fully.reference.
BTW - as so many have said, it's your choice. Those folks who spend time arguing over coding syntax don't have enough work to do. Learn to be flexible and to use the standards of the workplace where you are employed. If you are lucky enough to set your own standards, then have at it. The fact your are wondering is great. But choose one and then be consistent (until you change jobs or decide you have a paradigm shift that means you want to change your style.)
You can often pick out what era someone started learning to code by their personal preferences and styles. Guys that write very tight, minmal, hard to read code, started back when memory was very limited (DOS) and probably wrote a lot of assembler, those that use Hungarian started back with the Win SDK, etc...
This discussion has been evolving for decades. The older I get, the more I document my code, the more meaningful I make my variable names, etc... because in a week I will have forgotten what I wrote and I need the road maps to make sense of it. Not so much that I'm forgetful, although that's part of the equation, but more so because I'm writing code on so many different projects.
it's entirely your choice. But personally I prefer the second one as I wouldn't need to look for table names in my code when I come across an id. I think tablename_id is better.
Another advantage to giving your primary keys names that are unique to that table is that it makes it easier to have a naming convention, when referring to those keys in different tables, that indicates the corresponding key.
For example, suppose everything in your alpha table begins alpha_, so that you have alpha_id as your primary key. In your beta table - where everything would begin beta_ - you would use beta_alpha_id to have a reference in that table to the keys in the alpha table.
A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.
The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.
For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.
Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')
Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.
The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast
Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.
The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.
If I have a parent and a child table filled with data, is it trivial to add a new table between them?
For example, before introduction the relationship is:
Parent -> Child
Then:
Parent -> New Table -> Child
In this case I'm referring to SQLite3 so a Child in this schema has a Foreign Key which matches the Primary Key of the Parent Table.
Thanks!
This may be too obvious, but...
How trivial it is will be dependent on how much code has already been written that needs to change with this. If this is a new app, with little code written, and you're just beginning to work on the design, then yes it's trivial. If you have tons of functions (whether it's external code, or DB code like stored procedures and views) accessing these tables expecting the original relationship then it becomes less trivial.
Changing it in the database should be relatively non-trivial, assuming you know enough SQL to populate the new table and set up the relations.
As with all development challenges, you just need to look at what will be affected, how, and determine how you're going to account for those changes.
All of this is a really long-winded way of saying "it depends on your situaiton".
I am not disagreeing with David at all, just being precise re a couple of aspects of the change.
If you have implemented reasonable standards, then the only code affected with be code that addresses the changed columns in Child (not New_Table). If you have not, then an unknown amount of code, which should not need to change, will have to change.
The second consideration is the quality of the Primary Key in Child. If you have Natural Relational Keys, the addition of New_Table has less impact, not data changes required. If you have IDENTITY type keys, then you may need to reload, or worse, "re-factor" the keys.
Last, introducing New_Table is a correction to a Normalisation error, which is a good thing. Consequentially, certain Child.columns will become New_Table.columns, and New_Table can be loaded from the existing data. You need to do that correctly and completely, in order to realise the performance gain from the correction. That may mean changing a couple more code segments.
If you have ANSI SQL, all the tasks are fairly straight-forward and easy.