database: summarizing data which expires - sql

I'm struggling to find an efficient and flexible representation for my data. We have a many-to-many relationship between two entities which have arbitrary lifetimes. Let's call these Voter and Candidate. Each relationship has a measurement which we'd like to summarize in various ways. These are timestamped and are guaranteed to be within the lifetime of the two related entities. Let's say the measure is approval rating, or just Rating.
One unusual requirement is that if I'm summarizing a period which has no measurement, I should substitute the latest valid measurement, rather than giving NULL.
Our current solution is to compile a list of valid voters and candidates for each day, then formulate a many-to-many table which records the latest valid measure.
What would your solution be?
This allows me to do a single query to get a daily summary:
select
avg(rating), valid_date, candidate_SSN, candidate_DOB
from
daily_rating natural join rating
group by
valid_date, candidate_SSN, candidate_DOB
This might work ok, but It seems inefficient to me. We're repeating a lot of data, especially if nothing happens for a given day. It also is unclear how to do weekly/monthly summaries without compiling even more tables. Since we're dealing with millions of rows (we're not really talking about voter polls...) I'm looking for a more efficient solution.

I have used data-warehousing technique here, hence the dim and fact table names.
dimDate is so-called date dimension, one row per a date.
dimCandidate has all candidate data, new and old records. In data-warehousing terms this is called type 2 dimension. One candidate can have several rows in this table, only one of them having r_status = 'current'.
Fields
, r_valid_from date
, r_valid_to date
, r_version integer -- (1, 2, 3,..)
, r_status varchar(10) -- (expired, current)
describe a record (row) status. Each time a candidate status changes, a new row is inserted and the pervious row's r_valid_to and r_status are modified.
CandidateFullName is a business (natural) key and has to uniquely identify a candidate. No two candidates can have the same CandidateFullName. Note that the CandidateKey uniquely identifies a row in the table, while CandidateFullName uniquely identifies a candidate.
dimVoter has voter data, new and old records -- just like the dimCandidate.
dimCampaign describes campaign details, this is so-called type one dimension, does not hold historical data.
factRating has the Rating measure.
Normaly this would be enough, but there is the reqirement to interpolate the missing data for a day; for that, an aggregate table aggDailyRating is introduced. At the end of a day, a scheduled job aggregates ratings for the day. This job takes care of the data-interpolation requirement.
This way the aggregate table has one row for each date-(valid) candidate-campaign combination. Note that voter is not included in the combination, data is aggregated over all voters.
Any reporting is done on the aggregate table, for example
--
-- monthy rating for years 2009-2010
-- for candidate john_smith_256
--
select
CalendarYear
, MonthNumber
, avg(DailyRating) as AverageRating
from aggDailyRating as f
join dimDate as d on d.DateKey = f.DateKey
join dimCandidate as c on c.CandidateKey = f.CandidateKey
where CandidateFullName = 'john_smith_256'
and CalendarYear between 2009 and 2010
group by CalendarYear, MonthNumber
order by CalendarYear desc, MonthNumber desc ;

Yes, that is very inefficient and wasteful. It is merely a set of files, not reasonably comparable to a set of "tables" or a "database"; extensions and enhancements to it will compound the duplication and inefficiency. Duplication is the antithesis of a database. In database terms, there are far more efficient and easier ways to implement that.
Assumption
Your post does not provide much info, so I have had to make some assumptions, but I think you can correct my submission quite easily if any of them are incorrect. Otherwise comment, and I will correct my submission.
A Voter is a Person; a Candidate is a Voter; (Candidate = subset of Voter)
A Campaign is related to Candidate (not to a Polling Campaign).
A Poll is a survey of the Voters response to a Candidate's performance, staring on a set date, running over a few days, and completing on an set date.
There are many Measures, such as ApprovalRating, that are surveyed in each Poll.
The Measures of such surveys across all Voters are aggregated at the Poll level.
Limitation
The expiry requirement is unclear, so I am not suggesting I have implemented that. If the model does not provide that for you (if it is not immediately obvious), supply details and I will add to the model. The current model provides exclusion/inclusion capability for what I understand the expiry requirement to be.
The Poll::Measure does not have enough info to be implemented fully; I need further details. The submission is primitive and unconstrained in that area.
Likewise, any Poll::Campaign relation or constraint ("there are many Polls per Campaign, and they are always related to Campaign") has not been implemented.
The arrangement of the key in the child tables is arbitrary for now: if you identify the most common queries, it can be re-arranged, so that the most those obtain the best speed.
Submission
Campaign Poll Data Model
This is just a Relational (Normalised; zero duplication) Database, pure IDEF1X, including provision for the consideration that the child tables will be huge: migration of narrow surrogate keys into the child tables, avoiding migration of wide keys.
It provides "data warehouse" capability as is. In fact, if it does not provide any BI or DSS requirement in a single query, that is only due to lack of detail from you; please provide, and I will happily change it. (Note, your item re "single query" is actually "single file"; joins are pedestrian in a Relational database.)
Keys such as %Code are 2-, 3-, and at most 4-characters. Such keys are just as fast as Integer keys, and very helpful (makes sense) when perusing the tables (without having to join the parent).
Any and all aggregation, either to load the historic rows, or to produce aggregates for the current values, should be possible in a single Relational (set-oriented) command; you should not need to resort to serial (cursor) processing. Again, if you think you need to, please comment and I will provide the set-oriented method.
We implement Versioning in DBs quite differently to the way it is done in DWs, and without limitations. Please identify if you require versioning of (eg) Candidate, and I will provide.
Last, the Null requirement is not unusual. It is catered for here. Again, if you think it isn't ...

Related

How to structure SQL tables with one (non-composite) candidate key and all non-primary attributes?

I'm not very familiar with relational databases but here is my question.
I have some raw data that's collected as a result of a customer survey. For each customer who participated, there is only one record and that's uniquely identifiable by the CustomerId attribute. All other attributes I believe fall under the non-prime key description as no other attribute depends on another, apart from the non-composite candidate key. Also, all columns are atomic, as in, none can be split into multiple columns.
For example, the columns are like CustomerId(non-sequential), Race, Weight, Height, Salary, EducationLevel, JobFunction, NumberOfCars, NumberOfChildren, MaritalStatus, GeneralHealth, MentalHealth and I have 100+ columns like this in total.
So, as far as I understand we can't talk about any form of normalization for this kind of dataset, am I correct?
However, given the excessive number of columns, if I wanted to split this monolithic table into tables with fewer columns, ie based on some categorisation of columns like demographics, health, employment etc, is there a specific name for such a structure/approach in the literature? All the tables are still going to be using the CustomerId as their primary key.
Yes, this is part of an assignment and as part of a task, it's required to fit this dataset into a relational DB, not a document DB which I don't think would gain anything in this case anyway.
So, there is no direct question as such as I worded above but creating a table with 100+ columns doesn't feel right to me. Therefore, what I am trying to understand is how the theory approaches such blobs. Some concept names or potential ideas for further investigation would be appreciated as I even don't know how to look this up.
In relational databases using all information in a table is not a good usage.
As you mentioned groping some columns in other tables and join all tables with master table is well. In this usage you can also manage one to many, many to one and many to many relationships. Such as customers could have more than one address or phone numbers.
An other usage is making a table like customer_properities and use columns like property_type and property_value and store data by rows.
But the first usage is more effective and most common usage
customer_id property_type properity_value
1 num_of_child 3
1 age 22
1 marial_status Single
.
.
.

What's the best practice to connect a table to a junction table in relational database design?

I'm building a relational database that will act as a CRM for a travel company. I have removed tables and attributes to make this as simple as possible. Users will send quotes to customers.
A hotel can have many rooms (e.g. hotel 1 can have both a twin room and a triple room).
A room can have many hotels (e.g. a both hotels 1 and 2 can have a twin room).
Let's say a customer has a group of 6.
A user could send this customer a quote for hotel 1 with either 3x twin rooms or 2x triple rooms.
A quote will need to contain the hotel and appropriate room type and room type quantities.
Whats the best practice to connect table HOTEL_ROOM_JUNCTION to QUOTE as they key is a multi-attribute, composite key?
Thank you
Noting the Relational Database tag.
Problem
There is a lack of precision in your declarations:
A hotel can have many rooms (e.g. hotel 1 can have both a twin room and a triple room).
A room can have many hotels (e.g. a both hotels 1 and 2 can have a twin room).
I think you mean RoomType. From the rest of your declarations, the system you are implementing is for Quotations of rooms across all hotels, not a room booking system for each of the hotels. That is, you need to track RoomType, not Room, per Hotel.
The tables as given are not Relational tables, they do not have any of the requirements that make them Relational. When you start with stamping an id field on every file, it cripples the data analysis & data modelling exercise that is required to create a set of Relational tables. That is anti-Relational:
physical pointers such as record id are expressly prohibited in the Relational Model.
The Primary Key must be "made up from the data".
I appreciate that you have been schooled in that, due to the marketing and promotion of primitive methods as "relational".
.
For starters, each logical row (not physical record with a record id) must be unique.
The fields in each file should not be prefixed with the filename. In SQL (the data sub-language for the implementation of the Relational Model), the fully qualified address for a column is:
[server.][database.][owner.][table.]column
with defaults (obvious) for each element. If a column is ambiguous, simply prefix it with the table name.
Primary Keys are a special case. In order to avoid confusion (and now, to allow the new NATURAL JOIN), they should be the full name, in both the PK and FK locations. An id on every file would ensure buggy code.
Relational Data Model
If I address all those issues, and model the data according to the Relational Model, it would be:
Notation
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993.
My IDEF1X Introduction is essential reading for those who are new to the Relational Model, or its modelling method. Note that IDEF1X models are rich in detail and precision, showing all required details, whereas home-grown models have far less than that. Which means, the notation has to be understood.
Content
Relational Key
In order to make the logical rows unique, we need to make a Key from the data. The users know their data, they know what is unique and what is not. Usually they will have a ShortName for such things as Company; Hotel; Customer; etc.
If you do not communicate with the user, there is no chance of supplying the user's needs.
Hotel, UserName, Customer are ShortNames, which are unique, which therefore are the Primary Key. (More, later)
Relational Keys are composites, because they preserve the natural data hierarchies. Get used to it.
If you need the DDL for composite Keys, please ask.
Presuming that a Hotel may be a chain or franchise, we need a Location to make a specific hotel that has rooms unique.
The following are discrete Facts, and should not be mixed together (doing so will lead to complex constraints and horrendous SQL code):
HotelRoomType
that a Hotel.Location has a particular RoomType; and the Price
RoomTypeAvailable
that a Hotel.Location has one of those RoomTypes available on a particular Date; and the Number.
I presume there is a file from the hotels that you will be importing on a daily basis: this is the central table for that, with the constraints, of course.
Quote
that an User is providing a Quote that is requested by a single Customer, for a single TravelDate, for a single Hotel.Location. This allows separate Quotes for separate Hotel.Locations for a single TravelDate; Quotes for a Customer for more than one TravelDate; etc.
.
If you need multiple Hotel.Locations (and their RoomTypes) on a single Quote, let me know in the comments, and I will update the data model.
QuoteRoomType
that a Quote contains a line item which is a single RoomType in the single Hotel.Location that is available on the TravelDate.
Relational Integrity
A logical feature of the Relational Model, which is distinct from Referential Integrity, which is a physical feature in SQL. It is not possible to achieve this in a Record Filing System with record ids as "primary keys", not even an advanced and progressed one (after the various errors in the initial RFS have been corrected). Genuine logical Keys ("made up from the data") are required.
In RoomTypeAvailable, we have constrained:
RoomTypes to that which the Hotel.Location actually has (in HotelRoomType)
AND is actually available on Date.
In QuoteRoomType, we have constrained:
Hotel.Location to that which is in the Quote,
AND RoomTypes to that which is available in Hotel.Location (in HotelRoomType),
AND which is available on the TravelDate (RoomTypeAvailable.Date "maps to" QuoteRoomType.TravelDate).
1960's Record Filing System • Anti-Relational, Sold as "relational"
This section is relevant for those who prescribe a Record ID field as "primary key" in every file. And somehow think that that is "relational". Others can safely skip it.
For comparison, here is the set of files that one would come up with, if one followed the techniques and methods that are promoted and marketed by Date; Darwen; Fagin; et al crowd, falsely proposed as "relational".
This a "mature" or "advanced" model, the fourth or fifth iteration. It has a number of improvements over the initial RFS. The initial or second or third iteration would not be equivalent enough to offer a comparison:
the Facts that are required to support the system have been determined (as opposed to the initial model, the record perspective, which is oblivious to Facts).
the content of the records have been improved to prevent duplicates, to the extent possible given the record content (but it is still streets behind the uniqueness provided in a Relational data model)
Fails Relational
Nevertheless it has no Relational features, which are logical. It has only the physical features of SQL reference-ability. Just a few of the many failures, which the mob prescribes as "relational":
Duplicate rows (logical) are not prevented, because rows are not defined.
No Relational Integrity
which depends on Relational Keys. (Refer to the Relational Keys detailed above.)
Eg. QuoteRoomType is constrained to any RoomTypeAvailable.
It is not possible to constrain it to:
the HotelId that is referenced in the Quote only,
OR to RoomTypes that exist in the HotelId only,
OR to RoomTypesAvailable that are available on the TravelDate only.
One additional field, and one additional index, for the Record id on every file. That will have a marvellous effect on performance.
Horrendous navigation and query code.
No Relational Power
When two distal files need to be JOINed, each of the intermediate files must be additionally JOINed, something that is not required in a Relational database. That is because it breaks the Access Path Independence Rule, a concept that the razor gang have not been understand in the fifty years since the advent of the RM. But they will come up with yet another abnormal "normal form", to add to their bag of seventeen thus far.
More, Not Fewer, Joins
Let’s look at what that means. We need a query to provide statistics for RoomTypes that have been quoted for previous year, so that hotels can re-arrange their room types to suit the expected traffic.
Using the Relational data model (separate section above), we would code:
SELECT RoomType.RoomType, -- Relational database
Description,
SUM( NumRoom )
FROM RoomType
JOIN QuoteRoomType ON RoomType.RoomType = QuoteRoomType.RoomType
WHERE DATEPART( YY, TravelDate ) = DATEPART( YY, GETDATE() ) - 1
GROUP BY RoomType.RoomType, Description
Using the Record Filing System data model, which is the result of following the advice of the Date; Darwen; Fagin; philipxy; AntC; et al gang, which is falsely marketed as "relational" (above), we would be forced to code:
SELECT RoomType, -- Record Filing System
Description,
SUM( NumRoom )
FROM RoomType
JOIN HotelRoomType
ON RoomType.RoomTypeId = HotelRoomType.RoomTypeId
JOIN RoomTypeAvailable
ON HotelRoomType.HotelRoomTypeId = RoomTypeAvailable.HotelRoomTypeId
JOIN QuoteRoomType
ON RoomTypeAvailable.RoomTypeAvailableId = QuoteRoomType.RoomTypeAvailableId
JOIN Quote
ON QuoteRoomType.QuoteId Quote.QuoteId
WHERE DATEPART( YY, TravelDate ) = DATEPART( YY, GETDATE() ) - 1
GROUP BY RoomType, Description
Gotta love the QueryPlan for that, that the SQL platform will produce.
Re-arranging the order of the JOINs might improve the tortoise.
Resorting to moving fragments such as “partial FDs” or “MVDs” around, might improve it.
Perhaps deploying more “candies”, plus the required additional indices, all over the place, will help. But wait, that would be duplication on a mass scale, it would break Normalisation, there would be Update Anomalies everywhere one looks.
Note that that result set has no reliability; no credibility. Why ? Because, as already proved, the QuoteRoomType is not constrained to the Quote.Hotel (referenced by HotelId);
or to the Quote.TravelDate;
or to the RoomTypes available in QuoteHotel (referenced by HotelId).
Further, there may well be duplicates, because prevention can only be partially implemented. The result of which is unreliable result sets.
Simplicity vs Complexity
If you have the interest and the stamina, you can attempt to elevate the RFS by muddling through their "partial dependencies"; "transitive dependencies"; "candies"; "multi-valued dependencies"; etc, all of which are neither defined in, nor required in, the Relational Model. They are expressly for use in the Record Filing Systems of the last century.
First, the RFS paradigm (marketed as "relational") forces a record mindset, instead of a data-only mindset.
Second, it breaks everything down into fragments, instead of understanding the atoms; the Facts, in their full context (data hierarchies).
Third, it gives you a morass of complexity to handle the fragments, that have no relevance when handling atoms.
When you are done, all that complexity in the Record Filing System will still not be anywhere near the simplicity of the equivalent Relational data model: it will have:
No Relational Integrity (yes, yes, we have Declarative Referential Integrity, and that only for physical records, not for logical rows)
No Relational Power (multiple forced JOINs in every query)
No Relational Speed (those additional columns and indices have an effect).
And the navigation and query code will be horrendous, and prone to errors.
Please feel feel to ask specific questions. Also, please supply clarifications as noted, and I will update the data model.
Since a specific room can only exist in one hotel the table HOTEL_ROOM_JUNCTION is redundant. So pk hotel_id is fk in rooom, and pk in room is a concat key of hotel_id and room_id.
If one quote can consist of several rooms you need a connecting table between quote and room them with fk quote_id, room_id and hotel_id and those three will be the pk in that table. (As a rule of thumb, that kind of table will usually need a timestamp).
(as a side note; I would name the tables QUOTES, ROOMS and HOTELS since they contain many)
EDIT: I miss read the question somewhat .. to make my model as OP wants I need to add ROOM_TYPES with pk room_type_id which will be fk (not null) in ROOMS but not part of the pk.

Table structure for Scheduling App in SQL DB

I'm working on a database to hold information for an on-call schedule. Currently I have a structure that looks about like this:
Table - Person: (key)ID, LName, FName, Phone, Email
Table - PersonTeam: (from Person)ID, (from Team)ID
Table - Team: (key)ID, TeamName
Table - Calendar: (key dateTime)dt, year, month, day, etc...
Table - Schedule: (from Calendar)dt, (id of Person)OnCall_NY, (id of Person)OnCall_MA, (id of Person)OnCall_CA
My question is: With the Schedule table, should I leave it structured as is, where the dt is a unique key, or should I rearrange it so that dt is non-unique and the table looks like this:
Table - Schedule: (from Calendar)dt, (from Team)ID, (from Person)ID
and have multiple entries for each day, OR would it make sense to just use:
Table - Schedule: (from Calendar)dt, (from PersonTeam)KeyID - [make a key ID on each of the person/team pairings]
A team will always have someone on call, but a person can be on call for more than one team at a time (if they are on multiple teams).
If a completely different setup would work better let me know too!
Thanks for any help! I apologize if my question is unclear. I'm learning fast but nevertheless still fairly new to using SQL daily, so I want to make sure I'm using best practices when I learn so I don't develop bad habits.
The current version, one column per team, is probably not a good idea. Since you're representing teams as a table (and not as an enum or equivalent), it means you expect to add/remove teams over time. That would force you to add/remove columns to the table, which is always a much larger task than adding/removing a few rows.
The 2nd option is the usual solution to a problem like this. A safe choice. You can always define an additional foreign key constraint from Schedule(teamID, personID) to PersonTeam to ensure you don't mistakenly assign schedule duty to a person not belonging to the team.
The 3rd option is pretty much equivalent to the 2nd, only you're swapping a composite natural key for PersonTeam for a surrogate simple key. Since the two components of said composite key are already surrogate, there is no advantage (in terms of immutability, etc.) to adding this additional one. Plus it would turn a very simple N-M relationship (PersonTeam) which most DB managers / ORMs will handle nicely into a more complex object which will need management on its own.
By Occam's razor, I'd do away with the additional surrogate key and use your 2nd option.
In my view, the answer may depend on whether the number of teams is fixed and fairly small. Of course, whether the names of the teams are fixed or not, may also matter, but that would probably have more to do with column naming.
More specifically, my view is this:
If the business requirement is to always have a small and fixed number of people (say, three) on call, then it may well be more convenient to allocate three columns in Schedule, one for every team to hold the ID of the appointed person, i.e. like your current structure:
dt OnCall_NY OnCall_MA OnCall_CA
--- --------- --------- ---------
with dt as the primary key.
If the number of teams (in the Team table) is fixed too, you could include teams' names/designators in the column names like you are doing now, but if the number of teams is more than three and it's just the number of teams in Schedule that is limited to three, then you could just use names like OnCallID1, OnCallID2, OnCallID3.
But even if that requirement is fixed, it may only turn out fixed today, and tomorrow your boss says, "We no longer work with a fixed number of teams (on call)", or "We need to extend the number of teams supported to four, and we may need to extend it further in the future". So, a more universal approach would be the one you are considering switching to in your question, that is
dt Team Person
--- ---- ------
where the primary key would now be dt, Team.
That way you could easily extend/reduce the number of people on call on the database level without having to change anything in the schema.
UPDATE
I forgot to address your third option in my original answer (sorry). Here goes.
Your first option (the one actually implemented at the moment) seems to imply that every team can be presented by (no more than) one person only. If you assign surrogate IDs to the Person/Team pairs and use those keys in Schedule instead of separate IDs for Person and Team, you will probably be unable to enforce the mentioned "one person per team in Schedule" requirement (or, at least, that might prove somewhat troublesome) at the database level, while, using separate keys, it would be just enough to set Team to be part of a composite key (dt, Team) and you are done, no more than one team per day now.
Also, you may have difficulties letting a person change the team over time if their presence in the team was fixated in this way, i.e. with a Schedule reference to the Person/Team pair. You would probably have to change the Team reference in the PersonTeam table, which would result in misrepresentation of historical info: when looking at the people on call back on certain day, the person's Team shown would be the one they belong to now, not the one they did then.
Using separate IDs for people and teams in Schedule, on the other hand, would allow you to let people change teams freely, provided you do not make (Schedule.Team, Schedule.Person) a reference to (PersonTeam.Team, PersonTeam.Person), of course.

Two or more similar counts on fact table in dimensional modelling

I have designed a fact table that stores the facts for a specific date dimension and an action type such as create, update or cancelled. The facts can be create and cancelled only once, but update many times.
myfact
---------------
date_key
location_key
action_type_key
This will allow me to get a count for all the updates done, all the new ones created for a period and specify a specific region through the location dimension.
Now in addition I also have 2 counts for each fact, i.e. Number of People, Number of Buildings. There is no relation between these. And I would like to query on how many of the facts having a specific count, such as how many have 10 building, how many have 9 etc.
What would be the best table design for these. Basically I see the following options, but am open to hear better solutions.
add the counts as reference info in the fact table as people_count and building_count
add a dimension for each of these that stores the valid options, i.e. people dimension that stores a key and a count and building dimension that stores a key and a count. The main fact will have a people_key and a building_key
add one dimension for the count these is used for both people and building counts, i.e. count dimension that stores a key and a generic count. The main fact will have a people_count_key and a building_count_key
First your counts are essentially "dimensions" in the purest sense (you can think of dimensions as a way to group records for reporting purposes). The question though is whether dimensional modeling is what you want to do. I think you are better off as seeing this as something of an implicit dimension than you are to add dimension tables. What this means essentially is that dimension tables add nothing and they create corner cases of errors I just don't think are very helpful unless you need to track a bunch of information related to numbers.
If it were me I would just add the counts to the fact table, not to other tables.

Is there ever a time where using a database 1:1 relationship makes sense?

I was thinking the other day on normalization, and it occurred to me, I cannot think of a time where there should be a 1:1 relationship in a database.
Name:SSN? I'd have them in the same table.
PersonID:AddressID? Again, same table.
I can come up with a zillion examples of 1:many or many:many (with appropriate intermediate tables), but never a 1:1.
Am I missing something obvious?
A 1:1 relationship typically indicates that you have partitioned a larger entity for some reason. Often it is because of performance reasons in the physical schema, but it can happen in the logic side as well if a large chunk of the data is expected to be "unknown" at the same time (in which case you have a 1:0 or 1:1, but no more).
As an example of a logical partition: you have data about an employee, but there is a larger set of data that needs to be collected, if and only if they select to have health coverage. I would keep the demographic data regarding health coverage in a different table to both give easier security partitioning and to avoid hauling that data around in queries unrelated to insurance.
An example of a physical partition would be the same data being hosted on multiple servers. I may keep the health coverage demographic data in another state (where the HR office is, for example) and the primary database may only link to it via a linked server... avoiding replicating sensitive data to other locations, yet making it available for (assuming here rare) queries that need it.
Physical partitioning can be useful whenever you have queries that need consistent subsets of a larger entity.
One reason is database efficiency. Having a 1:1 relationship allows you to split up the fields which will be affected during a row/table lock. If table A has a ton of updates and table b has a ton of reads (or has a ton of updates from another application), then table A's locking won't affect what's going on in table B.
Others bring up a good point. Security can also be a good reason depending on how applications etc. are hitting the system. I would tend to take a different approach, but it can be an easy way of restricting access to certain data. It's really easy to just deny access to a certain table in a pinch.
My blog entry about it.
Sparseness. The data relationship may be technically 1:1, but corresponding rows don't have to exist for every row. So if you have twenty million rows and there's some set of values that only exists for 0.5% of them, the space savings are vast if you push those columns out into a table that can be sparsely populated.
Most of the highly-ranked answers give very useful database tuning and optimization reasons for 1:1 relationships, but I want to focus on nothing but "in the wild" examples where 1:1 relationships naturally occur.
Please note one important characteristic of the database implementation of most of these examples: no historical information is retained about the 1:1 relationship. That is, these relationships are 1:1 at any given point in time. If the database designer wants to record changes in the relationship participants over time, then the relationships become 1:M or M:M; they lose their 1:1 nature. With that understood, here goes:
"Is-A" or supertype/subtype or inheritance/classification relationships: This category is when one entity is a specific type of another entity. For example, there could be an Employee entity with attributes that apply to all employees, and then different entities to indicate specific types of employee with attributes unique to that employee type, e.g. Doctor, Accountant, Pilot, etc. This design avoids multiple nulls since many employees would not have the specialized attributes of a specific subtype. Other examples in this category could be Product as supertype, and ManufacturingProduct and MaintenanceSupply as subtypes; Animal as supertype and Dog and Cat as subtypes; etc. Note that whenever you try to map an object-oriented inheritance hierarchy into a relational database (such as in an object-relational model), this is the kind of relationship that represents such scenarios.
"Boss" relationships, such as manager, chairperson, president, etc., where an organizational unit can have only one boss, and one person can be boss of only one organizational unit. If those rules apply, then you have a 1:1 relationship, such as one manager of a department, one CEO of a company, etc. "Boss" relationships don't only apply to people. The same kind of relationship occurs if there is only one store as the headquarters of a company, or if only one city is the capital of a country, for example.
Some kinds of scarce resource allocation, e.g. one employee can be assigned only one company car at a time (e.g. one truck per trucker, one taxi per cab driver, etc.). A colleague gave me this example recently.
Marriage (at least in legal jurisdictions where polygamy is illegal): one person can be married to only one other person at a time. I got this example from a textbook that used this as an example of a 1:1 unary relationship when a company records marriages between its employees.
Matching reservations: when a unique reservation is made and then fulfilled as two separate entities. For example, a car rental system might record a reservation in one entity, and then an actual rental in a separate entity. Although such a situation could alternatively be designed as one entity, it might make sense to separate the entities since not all reservations are fulfilled, and not all rentals require reservations, and both situations are very common.
I repeat the caveat I made earlier that most of these are 1:1 relationships only if no historical information is recorded. So, if an employee changes their role in an organization, or a manager takes responsibility of a different department, or an employee is reassigned a vehicle, or someone is widowed and remarries, then the relationship participants can change. If the database does not store any previous history about these 1:1 relationships, then they remain legitimate 1:1 relationships. But if the database records historical information (such as adding start and end dates for each relationship), then they pretty much all turn into M:M relationships.
There are two notable exceptions to the historical note: First, some relationships change so rarely that historical information would normally not be stored. For example, most IS-A relationships (e.g. product type) are immutable; that is, they can never change. Thus, the historical record point is moot; these would always be implemented as natural 1:1 relationships. Second, the reservation-rental relationship store dates separately, since the reservation and the rental are independent events, each with their own dates. Since the entities have their own dates, rather than the 1:1 relationship itself having a start date, these would remain as 1:1 relationships even though historical information is stored.
Your question can be interpreted in several ways, because of the way you worded it. The responses show this.
There can definitely be 1:1 relationships between data items in the real world. No question about it. The "is a" relationship is generally one to one. A car is a vehicle.
One car is one vehicle. One vehicle might be one car. Some vehicles are trucks, in which case one vehicle is not a car. Several answers address this interpretation.
But I think what you really are asking is... when 1:1 relationships exist, should tables ever be split? In other words, should you ever have two tables that contain exactly the same keys? In practice, most of us analyze only primary keys, and not other candidate keys, but that question is slightly diferent.
Normalization rules for 1NF, 2NF, and 3NF never require decomposing (splitting) a table into two tables with the same primary key. I haven't worked out whether putting a schema in BCNF, 4NF, or 5NF can ever result in two tables with the same keys. Off the top of my head, I'm going to guess that the answer is no.
There is a level of normalization called 6NF. The normalization rule for 6NF can definitely result in two tables with the same primary key. 6NF has the advantage over 5NF that NULLS can be completely avoided. This is important to some, but not all, database designers. I've never bothered to put a schema into 6NF.
In 6NF missing data can be represent by an omitted row, instead of a row with a NULL in some column.
There are reasons other than normalization for splitting tables. Sometimes split tables result in better performance. With some database engines, you can get the same performance benefits by partitioning the table instead of actually splitting it. This can have the advantage of keeping the logical design easy to understand, while giving the database engine the tools needed to speed things up.
I use them primarily for a few reasons. One is significant difference in rate of data change. Some of my tables may have audit trails where I track previous versions of records, if I only care to track previous versions of 5 out of 10 columns splitting those 5 columns onto a separate table with an audit trail mechanism on it is more efficient. Also, I may have records (say for an accounting app) that are write only. You can not change the dollar amounts, or the account they were for, if you made a mistake then you need to make a corresponding record to write adjust off the incorrect record, then create a correction entry. I have constraints on the table enforcing the fact that they cannot be updated or deleted, but I may have a couple of attributes for that object that are malleable, those are kept in a separate table without the restriction on modification. Another time I do this is in medical record applications. There is data related to a visit that cannot be changed once it is signed off on, and other data related to a visit that can be changed after signoff. In that case I will split the data and put a trigger on the locked table rejecting updates to the locked table when signed off, but allowing updates to the data the doctor is not signing off on.
Another poster commented on 1:1 not being normalized, I would disagree with that in some situations, especially subtyping. Say I have an employee table and the primary key is their SSN (it's an example, let's save the debate on whether this is a good key or not for another thread). The employees can be of different types, say temporary or permanent and if they are permanent they have more fields to be filled out, like office phone number, which should only be not null if the type = 'Permanent'. In a 3rd normal form database the column should depend only on the key, meaning the employee, but it actually depends on employee and type, so a 1:1 relationship is perfectly normal, and desirable in this case. It also prevents overly sparse tables, if I have 10 columns that are normally filled, but 20 additional columns only for certain types.
The most common scenario I can think of is when you have BLOB's. Let's say you want to store large images in a database (typically, not the best way to store them, but sometimes the constraints make it more convenient). You would typically want the blob to be in a separate table to improve lookups of the non-blob data.
In terms of pure science, yes, they are useless.
In real databases it's sometimes useful to keep a rarely used field in a separate table: to speed up queries using this and only this field; to avoid locks, etc.
Rather than using views to restrict access to fields, it sometimes makes sense to keep restricted fields in a separate table to which only certain users have access.
I can also think of situations where you have an OO model in which you use inheritance, and the inheritance tree has to be persisted to the DB.
For instance, you have a class Bird and Fish which both inherit from Animal.
In your DB you could have an 'Animal' table, which contains the common fields of the Animal class, and the Animal table has a one-to-one relationship with the Bird table, and a one-to-one relationship with the Fish table.
In this case, you don't have to have one Animal table which contains a lot of nullable columns to hold the Bird and Fish-properties, where all columns that contain Fish-data are set to NULL when the record represents a bird.
Instead, you have a record in the Birds-table that has a one-to-one relationship with the record in the Animal table.
1-1 relationships are also necessary if you have too much information. There is a record size limitation on each record in the table. Sometimes tables are split in two (with the most commonly queried information in the main table) just so that the record size will not be too large. Databases are also more efficient in querying if the tables are narrow.
In SQL it is impossible to enforce a 1:1 relationship between two tables that is mandatory on both sides (unless the tables are read-only). For most practical purposes a "1:1" relationship in SQL really means 1:0|1.
The inability to support mandatory cardinality in referential constraints is one of SQL's serious limitations. "Deferrable" constraints don't really count because they are just a way of saying the constraint is not enforced some of the time.
It's also a way to extend a table which is already in production with less (perceived) risk than a "real" database change. Seeing a 1:1 relationship in a legacy system is often a good indicator that fields were added after the initial design.
Most of the time, designs are thought to be 1:1 until someone asks "well, why can't it be 1:many"? Divorcing the concepts from one another prematurely is done in anticipation of this common scenario. Person and Address don't bind so tightly. A lot of people have multiple addresses. And so on...
Usually two separate object spaces imply that one or both can be multiplied (x:many). If two objects were truly, truly 1:1, even philosophically, then it's more of an is-relationship. These two "objects" are actually parts of one whole object.
If you're using the data with one of the popular ORMs, you might want to break up a table into multiple tables to match your Object Hierarchy.
I have found that when I do a 1:1 relationship its totally for a systemic reason, not a relational reason.
For instance, I've found that putting the reserved aspects of a user in 1 table and putting the user editable fields of the user in a different table allows logically writing those rules about permissions on those fields much much easier.
But you are correct, in theory, 1:1 relationships are completely contrived, and are almost a phenomenon. However logically it allows the programs and optimizations abstracting the database easier.
extended information that is only needed in certain scenarios. in legacy applications and programming languages (such as RPG) where the programs are compiled over the tables (so if the table changes you have to recompile the program(s)). Tag along files can also be useful in cases where you have to worry about table size.
Most frequently it is more of a physical than logical construction. It is commonly used to vertically partition a table to take advantage of splitting I/O across physical devices or other query optimizations associated with segregating less frequently accessed data or data that needs to be kept more secure than the rest of the attributes on the same object (SSN, Salary, etc).
The only logical consideration that prescribes a 1-1 relationship is when certain attributes only apply to some of the entities. However, in most cases there is a better/more normalized way to model the data through entity extraction.
The best reason I can see for a 1:1 relationship is a SuperType SubType of database design. I created a Real Estate MLS data structure based on this model. There were five different data feeds; Residential, Commercial, MultiFamily, Hotels & Land.
I created a SuperType called property that contained data that was common to each of the five separate data feeds. This allowed for very fast "simple" searches across all datatypes.
I create five separate SubTypes that stored the unique data elements for each of the five data feeds. Each SuperType record had a 1:1 relationship to the appropriate SubType record.
If a customer wanted a detailed search they had to select a Super-Sub type for example PropertyResidential.
In my opinion a 1:1 relationship maps a class Inheritance on a RDBMS.
There is a table A that contains the common attributes, i.e. the partent class status
Each inherited class status is mapped on the RDBMS with a table B with a 1:1 relationship
to A table, containing the specialized attributes.
The table namend A contain also a "type" field that represents the "casting" functionality
Bye
Mario
You can create a one to one relationship table if there is any significant performance benefit. You can put the rarely used fields into separate table.
1:1 relationships don't really make sense if you're into normalization as anything that would be 1:1 would be kept in the same table.
In the real world though, it's often different. You may want to break your data up to match your applications interface.
Possibly if you have some kind of typed objects in your database.
Say in a table, T1, you have the columns C1, C2, C3… with a one to one relation. It's OK, it's in normalized form. Now say in a table T2, you have columns C1, C2, C3, … (the names may differ, but say the types and the role is the same) with a one to one relation too. It's OK for T2 for the same reasons as with T1.
In this case however, I see a fit for a separate table T3, holding C1, C2, C3… and a one to one relation from T1 to T3 and from T2 to T3. I even more see a fit if there exist another table, with which there already exist a one to multiple C1, C2, C3… say from table A to multiple rows in table B. Then, instead of T3, you use B, and have a one to one relation from T1 to B, the same for from T2 to B, and still the same one to multiple relation from A to B.
I believe normalization do not agree with this, and that may be an idea outside of it: identifying object types and move objects of a same type to their own storage pool, using a one to one relation from some tables, and a one to multiple relation from some other tables.
It is unnecessary great for security purposes but there better ways to perform security checks. Imagine, you create a key that can only open one door. If the key can open any other door, you should ring the alarm. In essence, you can have "CitizenTable" and "VotingTable". Citizen One vote for Candidate One which is stored in the Voting Table. If citizen one appear in the voting table again, then their should be an alarm. Be advice, this is a one to one relationship because we not refering to the candidate field, we are refering to the voting table and the citizen table.
Example:
Citizen Table
id = 1, citizen_name = "EvryBod"
id = 2, citizen_name = "Lesly"
id = 3, citizen_name = "Wasserman"
Candidate Table
id = 1, citizen_id = 1, candidate_name = "Bern Nie"
id = 2, citizen_id = 2, candidate_name = "Bern Nie"
id = 3, citizen_id = 3, candidate_name = "Hill Arry"
Then, if we see the voting table as so:
Voting Table
id = 1, citizen_id = 1, candidate_name = "Bern Nie"
id = 2, citizen_id = 2, candidate_name = "Bern Nie"
id = 3, citizen_id = 3, candidate_name = "Hill Arry"
id = 4, citizen_id = 3, candidate_name = "Hill Arry"
id = 5, citizen_id = 3, candidate_name = "Hill Arry"
We could say that citizen number 3 is a liar pants on fire who cheated Bern Nie. Just an example.
When you are dealing with a database from a third party product, then you probably don't want to alter their database as to prevent tight coupling. but you may have data that corresponds 1:1 with their data
Anywhere were two entirely independent entities share a one-to-one relationship. There must be lots of examples:
person <-> dentist (its 1:N, so its wrong!)
person <-> doctor (its 1:N, so it's also wrong!)
person <-> spouse (its 1:0|1, so its mostly wrong!)
EDIT: Yes, those were pretty bad examples, particularly if I was always looking for a 1:1, not a 0 or 1 on either side. I guess my brain was mis-firing :-)
So, I'll try again. It turns out, after a bit of thought, that the only way you can have two separate entities that must (as far as the software goes) be together all of the time is for them to exist together in higher categorization. Then, if and only if you fall into a lower decomposition, the things are and should be separate, but at the higher level they can't live without each other. Context, then is the key.
For a medical database you may want to store different information about specific regions of the body, keeping them as a separate entity. In that case, a patient has just one head, and they need to have it, or they are not a patient. (They also have one heart, and a number of other necessary single organs). If you're interested in tracking surgeries for example, then each region should be a unique separate entity.
In a production/inventory system, if you're tracking the assembly of vehicles, then you certainly want to watch the engine progress differently from the car body, yet there is a one to one relationship. A care must have an engine, and only one (or it wouldn't be a 'car' anymore). An engine belongs to only one car.
In each case you could produce the separate entities as one big record, but given the level of decomposition, that would be wrong. They are, in these specific contexts, truly independent entities, although they might not appear so at a higher level.
Paul.