Optional Database Entities - sql

ORIGINAL (see UPDATED QUESTION below)
I am designing a new laboratory database that tests a wide variety of tests on a wide variety of sample types.
The following list is my current candidate for the list of main entities to best model the laboratory work.
For each entity, a 1-to-many relationship exists from that entity to the entity below. In other words, every entity (except REQ) has at least columns for entity_id and parent_id.
Main Entities:
REQ: Request (a form)
SAM: Sample (the material)
TST: Test (requested procedures)
SUB: ** Sub-Test (part of standard test)
TRI: ** Trial (single instance: usually for mean,range, and stddev)
MEA: Measurement (a measured number)
** Not all tests have subtests, and not all tests have trials.
Sub-tests are a set of tests grouped together by a single name for easy referencing. For example, a lot acceptance test (LAT) for a particular product is defined as the following tests: viscosity, %-nitrogen, pH, and density.
A trial is a single experiment performed multiple times for product assurance. For example, fifty bullets might be shot, and each shot is a trial. The accuracy of each bullet might be required to fall within a certain range, and the average accuracy of all fifty bullets might be required to be in a tighter range.
Question: How should I model cases when sub-tests and/or trials are not needed?
Option 1: Use a "blank" sub-test (or trial) if not needed.
Option 2: Consider sub-tests and trials to be tests (and have a test_id as a parent), so that measurements always have a test as a parent.
Option 3: Optional parents for measurement (trial, sub-test, or test) and trials (sub-test or test).
Option x: Any other option worth considering.
FYI: If required to answer the question, I will be using Oracle.
UPDATED QUESTION
In general, my schema is a heirarchy of entities where each entity (except top) must have ONE parent and (except bottom) must have at least one child. What is the best way to handle cases where an internal entity is unnecessary in certain situation, or what is the benefit/drawback to using a particular option?
Option 1 (Dummy): Use a "dummy" entry to indicate entity does not apply in this case.
Option 2 (Rollup): Roll-up optional entities into next higher parent entity.
Option 3 (Pick-a-Parent): Entity (C) below optional entity (B) with required entity (A) must have ONE parent but the parent can be either the optional entity (B) or the next higher one (A).
Option x: Any other option worth considering.

Addressing your simplified question:
Given a hierarchy as you've described, if I found that some levels in the hierarchy were optional, I would question whether a hierarchy really mapped well to my domain. I would consider drawing my relations differently, or redefining the entities in my schema.
I don't think a more detailed answer to the general question is possible in a short space like this, since figuring out the best representation of a domain is a) hard, and b) very specific to the particular domain.

Use Outer Joins. (RIGHT OUTER JOIN and LEFT OUTER JOIN).
They were made specifically for this.

< Edit > This is my first post. Based on the comments, I'll be adding a second post.
Here's my take on an architectural first pass. This stuff generally requires a LOT of back-and-forth with the subject matter experts to get right.
"Test" means one of:
- Take an action, measure results
- Take several actions (subtests), measure results for each
- Make no tests whatsoever (yet you can still have measurements -- ?)
I'd configure this as a "parent" Test table and a child "SubTest" table, where Test can have 0 or more related SubTests, and every SubTest must be related with one and only one Test. (If a test has only one SubTest, enter it in its own table, don't try and track SubTests in the Test table.)
Trials can only exist if there are SubTests. Therefore, Trials are a child of the SubTest table; SubTests can have zero or more Trials, and Trials must be related with one and only one SubTest.
Measures only exist if there are Trials. Therefore, repeat the above, with Measures as a child of Trials.
Can there be SubTests without Trials (or Tests)? If so, then don't enter any Trials.
Can there be Measures without Trials? If no, you don't need any Trials (or SubTests). If yes (?), once again enter some properly labeled dummy/placholder SubTests or Trials as necessary.
Again, this is rudimentary, and more interviews with the folks driving requirements is required.

As others have remarked it is hard for us to give a definitive answer without understanding more about your domain. You have attempted to distill a lot of business rules into a couple of paragraphs but some important information has been lost. Specifically, it is not possible to be sure whether two entities are genuinely distinct without knowing their attributes. Having said all which, let's have a go.
A TEST is a single procedure. Despite containing the word "test" a LAT is not a TEST in its own right but is rather a pre-defined set of such procedures. I would model this scenario as an entity TEST with an optional parent entity, which I would prefer to call TEST_GROUP (as that is what it is) but it is best to use the domain name, SUB_TEST.
A TRIAL appears to be distinct from a TEST, so model it as a separate entity. Therefore you have a choice when it comes to MEASUREMENT: you can have one entity with two optional foreign keys or you can have TEST_MEASUREMENT and TRIAL_MEASUREMENT. Choosing which road to go depends on the characteristics and usage profile.
The following is an initial stab at the entity relationships. This would be the point in the project when the user goes, "Oh no, that is not what I meant at all."
create table sample (
sample_id number not null
, constraint samp_pk primary key (sample_id)
)
/
create table sub_test (
sub_test_id number not null
, sample_id number not null
, constraint subt_pk primary key (sub_test_id)
, constraint subt_samp_fk foreign key (sample_id)
references sample (sample_id)
)
/
create table test (
test_id number not null
, sample_id number not null
, sub_test_id number
, constraint tst_pk primary key (test_id)
, constraint tst_samp_fk foreign key (sample_id)
references sample (sample_id)
, constraint tst_subt_fk foreign key (sub_test_id)
references sub_test (sub_test_id)
)
/
create table trial (
trial_id number not null
, test_id number not null
, constraint trl_pk primary key (trial_id)
, constraint trl_tst_fk foreign key (test_id)
references test (test_id)
)
/
create table measurement (
measurement_id number not null
, trial_id number
, test_id number
, constraint meas_pk primary key (measurement_id)
, constraint meas_tst_fk foreign key (test_id)
references test (test_id)
, constraint meas_trl_fk foreign key (trial_id)
references trial (trial_id)
, constraint measurement_ck check (
(test_id is not null and trial_id is null)
or (test_id is null and trial_id is not null)
)
/
Edit
Addressing your more generic question.
Option 1 (Dummy)
Never use a dummy record. It's is like using a magic value instead of a null. The solution is worse than the problem it solves.
Option 2 (Rollup)
This can work when the parent and the child have the same attributes. But it is not a viable solution if they have different columns, or if they are different dependencies. Even if they have identical data structures but different business uses it may still be a problem.
Option 3 (Pick-a-Parent)
This would be my preferred solution. The snag is the need for a check constraint to ensure that one (and only one) of the eligible foreign keys has been populated. You also need to guard against allowing too many parents/grandparents/great-grandparents into the mix.

I am not entirely sure I understand the details of your question, but it sounds like you should have the following:
Table Test
test_id, request, sample, test
Table SubTest
subtest_id, test_id (foreign key to Test)
Table Trial
trial_id, trial_name, measurement, subtest_id
So, Test is a collection of subtests (possibly just one subtest), and a subtest is a collection of Trials (possibly just one trial)

I'm not entirely certain I understand your domain, but could you do something like this?
Tests has a parent_test_id column, which can be NULL (when set, this is a subtest).
Trials has a test_id column. (All tests have at least one trial, since you did a thing and had at least one measurement, right?)
Measurements has a trial_id column.
This does seem to violate your premise, since it stipulates that all tests have at least one trial, so it's possible I misunderstand the requirements. How can you have a test with no trials?
Anyway, if necessary, you could put both a trial_id and a test_id on Measurements, possibly with a constraint that one or the other must be NULL (and the other must be set).

I'll take a second stab at this one, based on the feedback from my first post. The key thing to understand is that design and architecture can be highly iterative, and I doubt you'll get the ideal model without a lot of back-and-forth--something that doesn't play out to well on Stack Overflow. Odds are you'll take the ideas posted (APC has some good ones), bounce them around with the people you work with, and come up with something that'll work.
My goal these days when designing databases is to try and produce a fully normalized model. Once you've got that, if it doesn't seem reasonable or practical you can denormalize for efficiency, expediency, or whatever -- but the key thing is you denormalize after you've found the ideal model. If you stop normalization before you get to fully normalized, you haven't denormalized, you've just got a sloppy model.
Here's the entities I see to-date:
What you've labeled as the top-level test, for purposes of clarity here I'm going to call an Exam. You define an exam and all its contents (below), and people contact your laboratory to run these exams on their problems.
For any given exam performed for a customer, you run a bunch of Tests. Any given test may be used by (required by?) any number of exams.
Often, you get a set of Tests that are done together for more than one Exam. If there are properties that apply to the specific set of Tests, you might want to identify each set as its own entity. Call these TestGroups. However, if these are only used associate a specific set of Tests with one or more Exams, you might not get any particular benefit our of defining them as their own entity. (These are your SubTests.)
So, an Exam "has" or "contains" one or more Tests. Alternatively, Exams are related with one or more TestGroups. However, trying to relate an Exam with zero or more TestGroups and zero or more individual Tests will produce an overly complex model (let alone physical implmentation), and I'd really want to avoid that. Perhaps a TestGroup can contain a single Test, so Exams only reference TestGroups? Maybe an Exam can only be related to one TestGroup -- in which case that'd be the "many to many" table relating Exams with Tests. This depends on further discussion of requirements with the subject matter experts.
So you have Exams -- Exam definitions, really -- related somehow or other with multiple Tests. Next up, you have a "paid instance" of an Exam (customer X comes in and pays you to test his Widgets). Call this a CustomerExam; it has all the contact and billing info, identifies the Exam to be run, and thus is related to the Tests to be performed for the customer. (There's probably a Customer entity out there too...?)
Trials are perfomed for the Tests that are part of a CustomerExam. They don't relate with the Exam or the Test, they are an instance of the Trial being performed. (Seems safe to assume that the "meaning/definition" of a Trial would actually be part of a Test--for example, if Test = Is gun accurate, then the work required by a Trial for that Test = fire gun 50 times and measure). So as Trials are performed for the Tests of a given CustomerExam. Are they performed once, or more than once? (Is a trial to fire the gun 50 times, or is each shot counted as a trial? What if they do two rounds of 50 shots?) Whatever, the attributes of the Trial event are stored here -- when it happened, who did it, special notes/circumstances, whatever.
Measures are produced by (or for?) Trials. The meaning/definition of each measure is actually part of the definition of a Trial (which is part of the definition of a Test); the event of the Trial produces specific values for the defined/anticipated Measures. The assumption is that a Trial will generate zero (?) or more Measures, so Measures are their own entity.
Looking back at this, it seems like there's some form of implicit double stucture: a set of tables to define available Exams, Tests, Trials, and Measures (what can be examined, how can it be tested, what shall we measure) and a companion set of tables to track specific instances of each (who wanted it, who did the work, when did they do it, what were the results)
I've got to have way over-anazled this problem. The key thing here is, as with all design sessions, in posing ideas and asking questions, did they generate your own ideas, questions, or answers?

Related

Optaplanner: how efficiently enforce one-to-many constraint?

I prepare example with easyScoreCalculatorClass and incrementalScoreCalculatorClass (Java score counting) to solve problem with power consumers and power suppliers (phones and chargers, where voltage must be equal and consumer required ampere must be not greater then supplier provided ampere, and each supplier has cost).
Optaplanner make solution for sample with 2 consumers and one supplier which is not desired as I expected that one supplier can't operate on two consumers in one time and solution must end with at least -1 value fro hard score.
To enforce one-to-namy constraint I see possibility by maintaining map with count how many times suppliers assigned to consumers. Is that one possible solution to enforce constraint in Java code?
Is that possible to organize data or mark fields with annotations so this constraint automatically enforced?
I look to optaplanner: how to enforce planning variables values to be used only once but don't understand how to map Drool code to Java...
Several of the examples have uniqueness constraint that you describe. For example, in the n-queens example all rows must be different. That example has a java score impl too.

Best practices for referencing natural and/or surrogate key values in code

I'm modifying some stored procedures that manage status changes when records are updated.
For example, if I have these two tables
Request(RequestID, StatusID)
Status(StatusID, StatusName)
I'm trying to determine the best to handle calling out the statuses in code.
Do I use StatusID or StatusName?
It's not guaranteed that StatusID will match between environments (DEV, PRE, PROD, etc).
Also, StatusName could be changed and I wouldn't want to have to alter code because I needed to change a StatusName.
I could create a 2nd unique column, which would sort of closely resemble StatusID.
I'd make sure this column was matched between regions, but that doesn't seem that clean either and sort of repetitive.
Can anyone suggest a cleaner, simpler way?
The difficulty of matching code to data can only partially be handled with a second column. When someone adds an item, what does this mean? If they re-use a known constant, what does it mean if you don't require this column to be unique?
Often times we will have user modifiable lookup tables, but they will have to be associated with a number of other flags indicating how to interpret the status - "IsTreatedAsExpired", "IsTreatedAsActive" or perhaps other tables which hold the statuses which are treated as certain things.
I think you really need to figure out the scope of what you want to allow with this table first. Because if you have a LOT of code references, you would be better off using a natural key which is in sync with your code on all installations. A possibility to handle this is to use negative numbers for unmovable codes (identity insert to add new unmovable codes) and then have your sequence only add positive ones. But again, this doesn't address the semantics of how your program would handle or use the user-entered extensions.
Again, it's hard to say without getting the full scope sorted out here.
From the information you've given, StatusID may have different values in different databases, presumably because your keys are generated automatically and are not specified by you. If so then obviously it's impossible to use StatusID consistently in your code anyway (without standardizing the values). Therefore the question becomes "is it acceptable/practical/desirable to hard-code StatusName values in my code?"
The obvious answer is yes, what's the alternative? If you have a certain status that represents 'ready' and you want to reference that in code then you must put something in your code that identifies the status unambiguously.
If you add a second key of some kind (as Carlos suggested) you still have the same basic problem that changing a natural key value is changing the identity of the status and therefore changes the meaning of your code. If you change the 'real' natural key (READY) without changing the second key (RDY) then your code will become more confusing and difficult to maintain.
If you do something more complex like extracting 'constants' or 'configuration parameters' into a configuration file or table or even writing a custom preprocessor to insert key values into your scripts at deployment time, you add lots of complexity for very little gain (unless you have other good reasons for doing it). I've seen this approach used, and it was a huge, unmaintainable mess.
In practice, StatusName is most likely to change because a) someone thinks another name would be 'more accurate' or 'look better', or b) you discover that it doesn't correctly represent your requirements. If you're forced to spend time on a) then just change the display name in your front end or reports and leave the database and code alone. If b) comes up then by definition your current data model and code are inaccurate and must be revised and possibly modified anyway. And when b) does happen, it often results in adding a new code, not changing the existing one (e.g. because someone defined a new process step that there is no existing code for).
And if you are open to changing your development and deployment practices there are other ways to look at this issue too, as others have suggested. Can you make your StatusID values the same everywhere? Technically it's possible, so what are the organizational reasons not to? Can you reduce the probability and impact of StatusName changes through change management and code reviews? Can you improve your requirements process to capture certain information more effectively?
Write a user defined function that accepts status name and gives out the status if wherever you are referring the status id
select * from resources where statusid = dbo.getStatusId("COMPLETED");
This would make sure that resolving the status id always happens within the function that you have defined
As a rule of thumb when you have id,value tables (Status, Result, Area, etc..) I usually add a third field that its the record's mnemonic value and always use that, neither the name or the id.
Now the mnemonic value is like a business key (well, it is a business key) in the sense that its a business value and does not depend on the database (for the id) or the way it displayed (the description) so for example for your status table you may have
StatusID,StatusName,StatusMnemo
1 ,COMPLETED ,COM
2 ,REJETED ,REJ
and so forth.
And in your queries you always join by statusId but you add a clause to join against the status table by StatusMnemo. This is a value that's independent across environments and remains constant.
Also in inserts, you always use statusid.
If you have statusID values that need special treatment then they should be the same across environments.
Why would you introduce a statusID that needs special treatment in Prod that has not gone thru Pre and Dev?
What I often do is start iden at 100 and use that for generic status that don't need special treatment.
Then DEV owns the space under 100 for special treatment using IDENTITY INSERT ON.
If deploy from DEV to PRE insert any records under 100.

Table structure for Scheduling App in SQL DB

I'm working on a database to hold information for an on-call schedule. Currently I have a structure that looks about like this:
Table - Person: (key)ID, LName, FName, Phone, Email
Table - PersonTeam: (from Person)ID, (from Team)ID
Table - Team: (key)ID, TeamName
Table - Calendar: (key dateTime)dt, year, month, day, etc...
Table - Schedule: (from Calendar)dt, (id of Person)OnCall_NY, (id of Person)OnCall_MA, (id of Person)OnCall_CA
My question is: With the Schedule table, should I leave it structured as is, where the dt is a unique key, or should I rearrange it so that dt is non-unique and the table looks like this:
Table - Schedule: (from Calendar)dt, (from Team)ID, (from Person)ID
and have multiple entries for each day, OR would it make sense to just use:
Table - Schedule: (from Calendar)dt, (from PersonTeam)KeyID - [make a key ID on each of the person/team pairings]
A team will always have someone on call, but a person can be on call for more than one team at a time (if they are on multiple teams).
If a completely different setup would work better let me know too!
Thanks for any help! I apologize if my question is unclear. I'm learning fast but nevertheless still fairly new to using SQL daily, so I want to make sure I'm using best practices when I learn so I don't develop bad habits.
The current version, one column per team, is probably not a good idea. Since you're representing teams as a table (and not as an enum or equivalent), it means you expect to add/remove teams over time. That would force you to add/remove columns to the table, which is always a much larger task than adding/removing a few rows.
The 2nd option is the usual solution to a problem like this. A safe choice. You can always define an additional foreign key constraint from Schedule(teamID, personID) to PersonTeam to ensure you don't mistakenly assign schedule duty to a person not belonging to the team.
The 3rd option is pretty much equivalent to the 2nd, only you're swapping a composite natural key for PersonTeam for a surrogate simple key. Since the two components of said composite key are already surrogate, there is no advantage (in terms of immutability, etc.) to adding this additional one. Plus it would turn a very simple N-M relationship (PersonTeam) which most DB managers / ORMs will handle nicely into a more complex object which will need management on its own.
By Occam's razor, I'd do away with the additional surrogate key and use your 2nd option.
In my view, the answer may depend on whether the number of teams is fixed and fairly small. Of course, whether the names of the teams are fixed or not, may also matter, but that would probably have more to do with column naming.
More specifically, my view is this:
If the business requirement is to always have a small and fixed number of people (say, three) on call, then it may well be more convenient to allocate three columns in Schedule, one for every team to hold the ID of the appointed person, i.e. like your current structure:
dt OnCall_NY OnCall_MA OnCall_CA
--- --------- --------- ---------
with dt as the primary key.
If the number of teams (in the Team table) is fixed too, you could include teams' names/designators in the column names like you are doing now, but if the number of teams is more than three and it's just the number of teams in Schedule that is limited to three, then you could just use names like OnCallID1, OnCallID2, OnCallID3.
But even if that requirement is fixed, it may only turn out fixed today, and tomorrow your boss says, "We no longer work with a fixed number of teams (on call)", or "We need to extend the number of teams supported to four, and we may need to extend it further in the future". So, a more universal approach would be the one you are considering switching to in your question, that is
dt Team Person
--- ---- ------
where the primary key would now be dt, Team.
That way you could easily extend/reduce the number of people on call on the database level without having to change anything in the schema.
UPDATE
I forgot to address your third option in my original answer (sorry). Here goes.
Your first option (the one actually implemented at the moment) seems to imply that every team can be presented by (no more than) one person only. If you assign surrogate IDs to the Person/Team pairs and use those keys in Schedule instead of separate IDs for Person and Team, you will probably be unable to enforce the mentioned "one person per team in Schedule" requirement (or, at least, that might prove somewhat troublesome) at the database level, while, using separate keys, it would be just enough to set Team to be part of a composite key (dt, Team) and you are done, no more than one team per day now.
Also, you may have difficulties letting a person change the team over time if their presence in the team was fixated in this way, i.e. with a Schedule reference to the Person/Team pair. You would probably have to change the Team reference in the PersonTeam table, which would result in misrepresentation of historical info: when looking at the people on call back on certain day, the person's Team shown would be the one they belong to now, not the one they did then.
Using separate IDs for people and teams in Schedule, on the other hand, would allow you to let people change teams freely, provided you do not make (Schedule.Team, Schedule.Person) a reference to (PersonTeam.Team, PersonTeam.Person), of course.

How important are lookup tables?

A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.
The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.
For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.
Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')
Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.
The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast
Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.
The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.

SSIS Population of Slowly Changing Dimension with outrigger

Working on a data warehouse, a suitable analogy for the problem is that we have Healthcare Practitioners. Healthcare Practitioners have a number of professional attributes and work in an open number of teams and in an open number of clinical areas.
For example, you may have a nurse who works in children's services across a number of teams as a relief/contractor/bank staff person. Or you may have a newly qualified doctor who works general medicine who is doing time in a special area pending qualifying as a consultant of that special area.
So we have an open number of areas of work and an open number of teams, we can't have team 1, team 2 etc in our dimensions. The other attributes may change over time also, like base location (where they work out of), the main team and area they work in..
So, following Kimble I've gone for outriggers:
Table DimHealthProfessionals:
Key (primary key, identity)
Name
Main Team
Main Area of Work
Base Location
Other Attribute 1
Other Attribute 2
Start Date
End Date
Table OutriggerHealthProfessionalTeam:
HPKey (foreign key to DimHealthPRofessionals.Key)
Team Name
Team Type
Other Team Attribute 1
Other Team Attribute 2
Table OutriggerHealthProfessionalAreaOfWork:
HPKey (as above)
Area of Work
Other AoW attribute 1
If any attribute of the HP changes, or the combination of teams or areas of work in which they work change, we need to create a new entry in the SCD and it's outrigger tables to encapsulate this.
And we're doing this in SSIS.
The source data is basically an HP table with the main attributes, a table of areas of work, a table of teams and a pair of mapping tables to map a current set of areas of work to an HP.
I have three data sources, one brings in the HCP information, one the areas of work of all HCPs and one the team memberships.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
Admittedly I may not understand everything here, but it seems to me that the relationship in this example should be reversed. Place TeamKey and the WorkAreaKey in the dimHealthProfessionals -- this should simplify things.
With this in place, you simply make sure to deliver outriggers before the dimHealthProfessionals.
Treat outriggers as dimensions in their own right. You may want to treat dimHealthProfessionals as a type 2 dimension, to properly capture the history.
EDIT
Considering that team to person is many-to-many, a fact is more appropriate.
A column in a dimension table is appropriate only if a person can belong to only one team at a time. Same with work areas.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
I'm not sure I understand your question fully. If you are unsure about change detection, then use Checksums in the package. Build up a temp table with the data as it is in the source, then compare each row to its counterpart (joined via the business keys) by computing the checksum for both rows and comparing those. If they differ, the data has changed.
If you are talking about cascading updates in a historized dimension hierarchy (and you can treat the outriggers like a hierarchy in this context) then the foreign key lookups will automatically lookup the newer entry in DimHealthProfessionals if you have a historization (i.e. have validFrom / validThrough timestamps in DimHealthProfessionals). Those different foreign keys result in a different checksum.