get confused in primary key concept and nomalisation - sql

Let's take an example of a login table with the columns: id (primary key),username,password,status.
id is primary key but still we authenticate user by searching table through username+password. Doesn't it violate normalisation rule?
Another example: suppose we have two tables, employer and job
employer table's id is used injob table as foreign key but job table itself has its own id
job table
---------
id (primary key) || employer_id (foreign key) || etc etc
Now when we will search job posted by a employer we use employer_id but this table has its primary key?

Primary key on any table is an unique identifier and indexed by default. It does not mean that records will always be searched through that field itself. Now, when you are to link a parent record to a child record, you can use the primary key to establish the relationships.
While trying to get the records, you will make use of primary key as a link between different tables. For example, you have employers and employees. A search might look like: "Get me all the employees for this employer". Now, employer is the main entity here and we are looking for linked employee records for it. Here, employer id will help us fetch related records. A query for the same might appear something like this:
SELECT [COLUMN NAMES HERE]
FROM EMPLOYER INNER JOIN EMPLOYEE ON
EMPLOYER.ID = EMPLOYEE.EMPLOYERID

I suggest you to read some tutorial, but, shortly ... a primary key is an id that is unique and not null and identifies an entry in your table. A foreign key is a reference to an id in another table. As you said, employee and job tables. An id is in most cases generated by a sequence, you don't know its value before inserting the record.
You usually perform seraches through username, name, ... datas that are user-known. In your case, when you search for a job, you will probably perform a Join. A join is a relation (and there are more types) between tables. In your case you will do
select *
from employer emp inner join job jjj on emp.id = jjj.employer _id
The ids are usefull, code-wise, when you have to update/delete a record. In this case usually you know everything about your record, id included, then you will use the id (also because ids usually have indexes, so the queries are faster). You can usually create indexes in the columns you use in filters, in order to reduce the execution time of your queries.

We need to distinguish between business (or candidate) keys and technical (or surrogate) keys. The business key is the data item(s) which uniquely identify this row in the real world. The technical key is a convenience for data management, generated by some computer process such as a sequence or sys_guid().
Yes, the use of technical keys means storing redundant information but this is a case where practicality trumps theory. Primary keys should not change but in real life they can (e.g. people change their name due to various life events). Technical keys have no meaning so do not change. Business keys are often compound keys, which is often inconvenient for enforcing foreign keys (and occasionally highly undesirable, for instance when the business key is a sensitive data item such as Social Security Number).
So, it is common for tables to have an ID column as the primary key, which is used for foreign key relationships, and a unique constraint to enforce the business key.
In your first example username is a business key and id is a technical key. This is one reason why we should have two data models. The Logical Data Model has an Entity called user with a candidate key of username. The physical data model has a Table called user with a primary key of id and a unique key of username.
For your second example, it seems you're modelling a Situations Vacant job board. The relationship between Employer and Job is one to many, that is an employer can advertise many jobs. So the Job table has its own primary key, id, plus a foreign key employer_id which references the primary key of the Employer table. This means we can find all the jobs for a specific employer. But because the job table has its own primary key we can identify each job so that we can distinguish the Janitor job at Harrisons Pharmaceuticals from the Janitor job at Ravi's Cash'n'Carry. (Which incidentally shows why we need technical keys: imagine if the employer_id foreign key was a string of varchar2(128) for each row on the Job table. )
In a Logical Data Model the employer_id would be implied on the Entity job by a link to the Entity employee but would be shown (actually it might, it depends on the tool). In the physical model the column must be added to the dependent table, because database constraints (and joins!) physically need the column in order to work.
So we can see that Normalisation applies to the representation and storage of business data but the practicalities of database engines mean that we need additional columns to manage the data.

Related

Database design when having either foreign key or string

In my application I'm creating a case having a supervisor, but as the supervisor can be either an employee or an external supervisor, I'd like to be able to save either an employee id for the internal reference or a string for the name of the external supervisors name.
How should I implement this? Is having a table "case" and the sub-tables "case_internal_sv" and "case_external_sv" the way to go?
There's not a lot of information in your question on which to base an answer.
If there is common data and functionality across all types of supervisors then you will probably want one table to hold that common data. That table would establish the primary key values for supervisors and the case table would have a foreign key into this table. Information that is unique to either internal or external supervisors would go into separate tables and those tables would also have a foreign key back to the common supervisor level data.
This design is superior because you only have one place to go in order to find a list of all supervisors and because you can enforce the supervisor / case relationship directly in the database without a lot of code or additional constraints to ensure that "one and only one" of two columns is populated.
It's sufficiently superior, from a database point of view, that I'd consider using this design even if the data for internal and external supervisors is completely disjoint (which it's unlikely to be).
If your database allows you to define multiple primary keys you could have field employee and field employee_type combine to form the unique primary key. If not you could have an autogenerated primary key for the table and have a field for employee type and a field for employee_id.
What database are you using?
Having tables too normalized can do more harm than help. You should evaluate the benefits/drawbacks of having sub-tables based on your requirements.
A simple solution can be to have a 'sv_type' column in 'case' table. And have two columns
'internal_sv_id', a nullable foreign key to the employee table
'external_sv_name', a nullable string to save external name.
Then check for supervisor in one of these two columns based on the 'sv_type'
This design might not fully conform to 3rd normal form, but it can save lot of costly joins and allow integrity with employee table.
As for me, I'd go for the simplest solution in case of doubt.

Foreign keys vs secondary keys

I used to think that foreign key and secondary key are the same thing.
After Googling the result are even more confusing, some consider them to be the same, others said that a secondary key is an index that doesn't have to be unique, and allows faster access to data than with the primary key.
Can someone explain the difference?
Or is it indeed a case of mixed terminology?
Does it maybe differ per database type?
The definition in wiki/Foreign_key states that:
In the context of relational databases, a foreign key is a field (or
collection of fields) in one table that uniquely identifies a row of
another table. In other words, a foreign key is a column or a
combination of columns that is used to establish and enforce a link
between two tables.
The table containing the foreign key is called the referencing or
child table, and the table containing the candidate key is called the
referenced or parent table.
Take the example of the case:
A customer may place 0,1 or more orders.
From the point of the business, each customer is identified by a unique id (Primary Key) and instead of repeating the customer information with each order, we place a reference, or a pointer to that unique customer id (Customer's Primary Key) in the order table. By looking at any order, we can tell who placed it using the unique customer id.
The relationship established between the parent (Customer table) and the child table (Order table) is established when you set the value of the FK in the Order table after the Customer row has been inserted. Also, deleting a child row may affect the parent depending on your Referential Integrity stings (Cascading Rules) established when the FK was created. FKs help establish integrity in a relational database system.
As for the "Secondary Key", the term refers to a structure of 1 or more columns that together help retrieve 1 or more rows of the same table. The word 'key' is somewhat misleading to some. The Secondary Key does not have to be unique (unlike the PK). It is not the Primary Key of the table. It is used to locate rows in the same table it is defined within (unlike the FK). Its enforcement is only through an index (either unique or not) and it is implementation is optional. A table could have 0,1 or more Secondary Key(s). For example, in an Employee table, you may use an auto generated column as a primary key. Alternatively, you may decide to use the Employee Number or SSN to retrieve employee(s) information.
Sometimes people mix the term "Secondary Key" with the term "Candidate Key" or "Alternate Key" (usually appears in Normalization context) but they are all different.
A foreign key is a key that references an index on some other table. For example, if you have a table of customers, one of the columns on that table may be a country column which would just contain an ID number, which would match the ID of that country in a separate Country table. That country column in the customer table would be a foreign key.
A secondary key on the other hand is just a different column in the table that you have used to create an index (which is used to speed up queries). Foreign keys have nothing to do with improving query speeds.
"Secondary key" is not a term I'm familiar with. It doesn't appear in the index of Database Design for Mere Mortals and I don't remember it in Pro SQL Server 2012 Relational Database Design and Implementation (my two "goto" books for database design). It also doesn't appear in the index for SQL for Smarties. It sounds like its not an actual term at all.
I've always used the term "candidate key".
A candidate key is a way to uniquely identify an entity. You identify all the candidate keys during the design phase of a database system. During the implementation phase, you will decide on a primary key: either one of the candidate keys or an artificial key. The primary key will probably be implemented with a primary key constraint; the candidate keys will probably be implemented with unique constraints.
A foreign key is an instance of one entity's candidate key in another entity, representing a relationship between the two entities. It will probably be implemented with a foreign key constraints.

Converting an ER diagram to relational model

I know how to convert an entity set, relationship, etc. into the relational model but what i wonder is that what should we do when an entire diagram is given? How do we convert it? Do we create a separate table for each relationship, and for each entity set? For example, if we are given the following ER diagram:
My solution to this is like the following:
//this part includes the purchaser relationship and policies entity set
CREATE TABLE Policies (
policyid INTEGER,
cost REAL,
ssn CHAR(11) NOT NULL,
PRIMARY KEY (policyid).
FOREIGN KEY (ssn) REFERENCES Employees,
ON DELETE CASCADE)
//this part includes the dependents weak entity set and beneficiary relationship
CREATE TABLE Dependents (
pname CHAR(20),
age INTEGER,
policyid INTEGER,
PRIMARY KEY (pname, policyid).
FOREIGN KEY (policyid) REFERENCES Policies,
ON DELETE CASCADE)
//This part includes Employees entity set
CREATE TABLE Employees(
ssn Char(11),
name char (20),
lot INTEGER,
PRIMARY KEY (ssn) )
My questions are:
1)Is my conversion true?
2)What are the steps for converting a complete diagram into relational model.
Here are the steps that i follow, is it true?
-I first look whether there are any weak entities or key constraints. If there
are one of them, then i create a single table for this entity set and the related
relationship. (Dependents with beneficiary, and policies with purchaser in my case)
-I create a separate table for the entity sets, which do not have any participation
or key constraints. (Employees in my case)
-If there are relationships with no constraints, I create separate table for them.
-So, in conclusion, every relationship and entity set in the diagram are included
in a table.
If my steps are not true or there is something i am missing, please can you write the steps for conversion? Also, what do we do if there is only participation constraint for a relationship, but no key constraint? Do we again create a single table for the related entity set and relationship?
I appreciate any help, i am new to databases and trying to learn this conversion.
Thank you
Hi #bigO I think it is safe to say that your conversion is true and the steps that you have followed are correct. However from an implementation point of view, there may be room for improvement. What you have implemented is more of a logical model than a physical model
It is common practice to add a Surrogate Instance Identifier to a physical table, this is a general requirement for most persistence engines, and as pointed out by #Pieter Geerkens, aids database efficiency. The value of the instance id for example EmployeeId (INT) would be automatically generated by the database on insert. This would also help with the issue that #Pieter Geerkens has pointed out with the SSN. Add the Id as the first column of all your tables, I follow a convention of tablenameId. Make your current primary keys into secondary keys ( the natural key).
Adding the Ids then makes it necessary to implement a DependentPolicy intersection table
DependentPolicyId, (PK)
PolicyId,
DependentId
You may then need to consider as to what is natural key of the Dependent table.
I notice that you have age as an attribute, you should consider whether this the age at the time the policy is created or the actual age of the dependent, I which case you should be using date of birth.
Other ornamentations you could consider are creation and modified dates.
I also generally favor using the singular for a table ie Employee not Employees.
Welcome to the world of data modeling and design.

Do link tables need a meaningless primary key field?

I am working on a couple of link tables and I got to thinking (Danger Will Robinson, Danger) what are the possible structures of a link table and what are their pro's and con's.
I came up with a few possible strictures for the link table:
Traditional 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key
table2fk - foreign key
It's a classic, in most of the books, 'nuff said.
Indexed 3 column model
id - auto-numbered PRIMARY
table1fk - foreign key INDEX ('table1fk')
table2fk - foreign key INDEX ('table2fk')
In my own experience, the fields that you are querying against are not indexed in the traditional model. I have found that indexing the foreign key fields does improve performance as would be expected. Not a major change but a nice optimizing tweak.
Composite key 2 columns ADD PRIMARY KEY ('table1fk' , 'table2fk')
table1fk - foreign key
table2fk - foreign key
With this I use a composite key so that a record from table1 can only be linked to a record on table2 once. Because the key is composite I can add records (1,1), (1,2), (2,2) without any duplication errors.
Any potential problems with the composite key 2 columns option? Is there an indexing issue that this might cause? A performance hit? Anything that would disqualify this as a possible option?
I would use composite key, and no extra meaningless key.
I would not use a ORM system that enforces such rules on my db structure.
For true link tables, they typically do not exist as object entities in my object models. Thus the surrogate key is not ever used. The removable of an item from a collection results in a removal of an item from a link relationship where both foreign keys are known (Person.Siblings.Remove(Sibling) or Person.RemoveSibling(Sibling) which is appropriately translated at the data access layer as usp_Person_RemoveSibling(PersonID, SiblingID)).
As Mike mentioned, if it does become an actual entity in your object model, then it may merit an ID. However, even with addition of temporal factors like effective start and end dates of the relationship and things like that, it's not always clear. For instance, the collection may have an effective date associated at the aggregate level, so the relationship itself may still not become an entity with any exposed properties.
I'd like to add that you might very well need the table indexed both ways on the two foreign key columns.
If this is a true many-to-many join table, then dump unecessary id column (unless your ORM requires one. in that case you've got to decide whether your intellect is going to trump your practicality).
But I find that true join tables are pretty rare. It usually isn't long before I start wanting to put some other data in that table. Because of that I almost always model these join tables as entities from the beginning and stick an id in there.
Having a single column pk can help out alot in disaster recovery situation. So though while correct in theory that you only need the 2 foreign keys. In practice when the shit hits the fan you may want the single column key. I have never been in a situation where i was screwed because I had a single column identifier but I have been in ones where I was screwed because I didn't.
Composite PK and turn off clustering.
I have used composite key to prevent duplicate entry and let the database handle the exception. With a single key, you are rely on the front-end application to check the database for duplicate before adding a new record.
There is something called identifying and non-identifying relationship. With identifying relationships the FK is a part of the PK in the many-to-many table. For example, say we have tables Person, Company and a many-to-many table Employment. In an identifying relationship both fk PersonID and CompanyID are part of the pk, so we can not repeat PersonID, CompanyID combination.
TABLE Employment(PersonID int (PK,FK), CompanyID int (PK,FK))
Now, suppose we want to capture history of employment, so a person can leave a company, work somewhere else and return to the same company later. The relationship is non-identifying here, combination of PersonID, CompanyID can now repeat, so the table would look something like:
TABLE Employment(EmploymentID int (PK), PersonID int (FK), CompanyID int (FK),
FromDate datetime, ToDate datetime)
If you are using an ORM to get to/alter the data, some of them require a single-column primary key (Thank you Tom H for pointing this out) in order to function correctly (I believe Subsonic 2.x was this way, not sure about 3.x).
In my mind, having the primary key doesn't impact performance to any measurable degree, so I usually use it.
If you need to traverse the join table 'in both directions', that is starting with a table1fk or a table2fk key only, you might consider adding a second, reversed, composite index.
ADD KEY ('table2fk', 'table1fk')
The correct answer is:
Primary key is ('table1fk' , 'table2fk')
Another index on ('table2fk' , 'table1fk')
Because:
You don't need an index on table1fk or table2fk alone: the optimiser will use the PK
You'll most likely use the table "both" ways
Adding a surrogate key is only needed because of braindead ORMs
i've used both, the only benefit of using the first model (with uid) is that you can transport the identifier around as a number, whereas in some cases you would have to do some string concatenation with the composite key to transport it around.
i agree that not indexing the foreign keys is a bad idea whichever way you go.
I (almost) always use the additional single-column primary key. This generally makes it easier to build user interfaces, because when a user selects that particular linking entity I can identify with a single integer value rather than having to create and then parse compound identifiers.

SQL: Do you need an auto-incremental primary key for Many-Many tables?

Say you have a Many-Many table between Artists and Fans. When it comes to designing the table, do you design the table like such:
ArtistFans
ArtistFanID (PK)
ArtistID (FK)
UserID (FK)
(ArtistID and UserID will then be contrained with a Unique Constraint
to prevent duplicate data)
Or do you build use a compound PK for the two relevant fields:
ArtistFans
ArtistID (PK)
UserID (PK)
(The need for the separate unique constraint is removed because of the
compound PK)
Are there are any advantages (maybe indexing?) for using the former schema?
ArtistFans
ArtistID (PK)
UserID (PK)
The use of an auto incremental PK has no advantages here, even if the parent tables have them.
I'd also create a "reverse PK" index automatically on (UserID, ArtistID) too: you will need it because you'll query the table by both columns.
Autonumber/ID columns have their place. You'd choose them to improve certain things after the normalisation process based on the physical platform. But not for link tables: if your braindead ORM insists, then change ORMs...
Edit, Oct 2012
It's important to note that you'd still need unique (UserID, ArtistID) and (ArtistID, UserID) indexes. Adding an auto increments just uses more space (in memory, not just on disk) that shouldn't be used
Assuming that you're already a devotee of the surrogate key (you're in good company), there's a case to be made for going all the way.
A key point that is sometimes forgotten is that relationships themselves can have properties. Often it's not enough to state that two things are related; you might have to describe the nature of that relationship. In other words, there's nothing special about a relationship table that says it can only have two columns.
If there's nothing special about these tables, why not treat it like every other table and use a surrogate key? If you do end up having to add properties to the table, you'll thank your lucky presentation layers that you don't have to pass around a compound key just to modify those properties.
I wouldn't even call this a rule of thumb, more of a something-to-consider. In my experience, some slim majority of relationships end up carrying around additional data, essentially becoming entities in themselves, worthy of a surrogate key.
The rub is that adding these keys after the fact can be a pain. Whether the cost of the additional column and index is worth the value of preempting this headache, that really depends on the project.
As for me, once bitten, twice shy – I go for the surrogate key out of the gate.
Even if you create an identity column, it doesn't have to be the primary key.
ArtistFans
ArtistFanId
ArtistId (PK)
UserId (PK)
Identity columns can be useful to relate this relation to other relations. For example, if there was a creator table which specified the person who created the artist-user relation, it could have a foreign key on ArtistFanId, instead of the composite ArtistId+UserId primary key.
Also, identity columns are required (or greatly improve the operation of) certain ORM packages.
I cannot think of any reason to use the first form you list. The compound primary key is fine, and having a separate, artificial primary key (along with the unique contraint you need on the foreign keys) will just take more time to compute and space to store.
The standard way is to use the composite primary key. Adding in a separate autoincrement key is just creating a substitute that is already there using what you have. Proper database normalization patterns would look down on using the autoincrement.
Funny how all answers favor variant 2, so I have to dissent and argue for variant 1 ;)
To answer the question in the title: no, you don't need it. But...
Having an auto-incremental or identity column in every table simplifies your data model so that you know that each of your tables always has a single PK column.
As a consequence, every relation (foreign key) from one table to another always consists of a single column for each table.
Further, if you happen to write some application framework for forms, lists, reports, logging etc you only have to deal with tables with a single PK column, which simplifies the complexity of your framework.
Also, an additional id PK column does not cost you very much in terms of disk space (except for billion-record-plus tables).
Of course, I need to mention one downside: in a grandparent-parent-child relation, child will lose its grandparent information and require a JOIN to retrieve it.
In my opinion, in pure SQL id column is not necessary and should not be used. But for ORM frameworks such as Hibernate, managing many-to-many relations is not simple with compound keys etc., especially if join table have extra columns.
So if I am going to use a ORM framework on the db, I prefer putting an auto-increment id column to that table and a unique constraint to the referencing columns together. And of course, not-null constraint if it is required.
Then I treat the table just like any other table in my project.