I am in a database course this semester and I'm trying to figure out how to interpret this diagram
I know the key symbols represent either a primary or foreign key, but I can't tell which ones are which. I think the tables that have the 2 perpendicular lines have at least one foreign key from the table where the line came from, but I am not 100% sure. That's about all I (think) I understand.
What I really need is someone to either tell me the name of this type of diagram and/or how to interpret it so that I can write the SQL script to represent it.
A Key symbol mean Primary Key or PK.
Foreign Key (FK) doesn't have any symbol but you can guess. For example student.dept_name is FK from department.dept_name
The arrow go from department to student mean one department have 0 to N students
They are two symbols starting the line one with a circle and another one doble lines. My guess one is 0 .. N and the other 1 .. N but without know how you make that diagram can't be sure.
This diagram is call ER or Entity Relationship
Each box is a table or Entity you have to create in your script, then create PK, and than define FK.
This is what I'm seeing in the diagram, although it may not be perfect because not all the keys have their direct lines to their primary sources. The >O or O< indicates there are required to be many of whatever it's against. An example is there are many students in a department. The O| or |O indicates there must be 1 and only 1. Each student must be registered with a department, but they can only be registered with one department. The || indicates a one to many plurality. Each course can have one or many sections (typically determined by how many students wish to attend that course).
For the issue of the keys, there is no apparent distinction between primary and foreign keys. I would assume that in each case, if a key is simply called ID or includes part of the table name (such as department: dept_name), then it is the primary key and all others are foreign keys. Again, somewhat difficult to tell since not all relationships are mapped in this particular diagram (such as teaches/takes and the key set course_id, semester, & year), but in these cases we assume that it's a composite key (values in multiple fields make up a unique record) rather than a single primary key (although there appears to be a single primary key in the section table). In such cases, simply saying section 01 or 01O doesn't mean anything and will likely return as many rows as there are class titles with a section number equivalent to those. You would have to specify course_id = CIT261, sec_id = '01O', semester = 'fall', year = 2015 for the first section of an online CIT261 course during the current semester, which should return a single row.
Another interesting note, it would appear that the advisor table satisfies a many to many relationship, and does not contain a primary key, but another composite key, but it doesn't seem to be a solid model as academic advisors are generally tied to the student via the department. This may be meant to reflect the instructors' TA.
I hope this points you in the right direction.
-C§
Related
This question already has answers here:
Surrogate vs. natural/business keys [closed]
(19 answers)
Why would one consider using Surrogate keys vs Natural with ON UPDATE CASCADE?
(1 answer)
Closed 7 months ago.
Recently I Inherited a huge app from somebody who left the company.
This app used a SQL server DB .
Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.
This is done for every table no matter if other fields could be unique and define primary key.
Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?
I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.
Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.
But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.
One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.
But a natural primary key can eliminate the need for a join in the first place.
Let's put all this together with an example:
create table Countries
(
countryCode char(2) not null primary key clustered,
countryName varchar(64) not null
);
insert Countries values
('AU', 'Australia'),
('FR', 'France');
create table TourLocations
(
tourLocationName varchar(64) not null,
tourLocationId int identity(1,1) unique clustered,
countryCode char(2) not null foreign key references Countries(countryCode),
primary key (countryCode, tourLocationName)
);
insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')
I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.
On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.
But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.
By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.
OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.
But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.
Suppose I gave the results of this query to my users:
select countryCode, tourLocationName
from TourLocations
order by 1, 2;
Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.
When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.
If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.
You are talking about the difference between synthetic and natural keys.
In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.
In database design:
Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.
I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".
Yes having a dedicated int is a good thing for PK use.
you may have multiple alternate keys, that's ok too.
two great reasons for it:
it is performant
it protects against key mutation ( editing a name etc. )
A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.
Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.
By meaningless I mean that apart from being the primary key it has no meaning to the users.
An int identity column has other advantages over a text field as primary key.
It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).
Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.
But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.
This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.
However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.
There are of course MANY different views on this subject, and many advantages and disavantages.
Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.
The simple concpet?
You can build your tables with the PK and child tables with forighen keys.
At that point you setup your relationships between the tables.
Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.
But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?
You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!
The concept of data, even required fields, or even a column having to be unique ?
That has ZERO to do with a working relational data model.
And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.
So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.
So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.
I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.
Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!
So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?
You NEVER give that auto PK number any meaning from a user and applcation point of view.
In summary:
Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.
Its better to use userid. User table is referenced by many other tables.
The referenced table would contain the primary key of the user table as foreign key.
Its better to use userid since its integer value,
it takes less space than string values of username and
the searches by the database engine would be faster
user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)
I am not a SQL expert, so I defer to someone with more knowledge. So here is my question. I have designed a database where every table has an Id column (auto increment) that is the primary key. And I use this design without any issue - it makes sense to me I simply do referential integrity by way of this simple primary key since the Id columns of all tables uniquely identifies each row.
Some of my colleagues have suggested that I use composite primary keys, but I see no value in doing that. The purpose of a primary key is to enable referential integrity, and that is what it does.
For example, this is a toy example but it demonstrates my design:
tbl_Customers
-------------
Id (PK)
Code (VARCHAR)
Name (VARCHAR)
Surname (VARCHAR)
tbl_CustomerDetails
-----------------
Id (PK)
CustomerId (FK to tbl_Customers)
SomeDetails (VARCHAR)
This does not use a seperate 'linking' table, but it does not matter, it demonstrates my design.
Some of my colleagues noted that I should have a composite primary key on tbl_Customers to not only include Id as I do now, but also Code. They say that this will improve performance and that it will ensure that Code will not duplicate.
My counter argument is that if I want Code to not duplicate, I can create a UNIQUE INDEX on Code. And that, since my front-end only ever works with Ids and never allows for example searching (SELECTing) by Code, that there can not be a performance improvement. On my presentation layer, if I show for example Customers and I allow the user to select one to see the associated CustomerDetails, I will select the corresponding tbl_CustomerDetails rows on CustomerId where it matches the selected Id of the clicked customer.
What do you suggest? Am I correct or am I wrong? I am always willing to learn, and if I am wrong here I'd love to learn. But at the moment, I do not feel their arguments are valid. Which is why I am asking the community.
Thanks!
I would suggest to go with single column primary key instead on composite keys. The biggest drawback with composite key is that you require more than one value /columnto identify a row. If your application uses an O/RM (Object/Relation Mapping) layer, then you will have fits mapping these database rows to objects in a programming language. O/RM's are easiest to set up when every table has a single column primary key.
Programming aside,the major drawback of composite keys in general, and especially composite keys requiring this many columns, is all of this data needs to be specified and copied to child tables in order to set up proper relationships between tables which is wastage of space and it increase unnecessary complexity too.
The biggest headache I've run into with developers is they assume "uniqueness of data" equates "identifying a row in the database". This is rarely the case. I've found applications and databases to be much more maintainable and easy to build by defaulting to single column primary keys, and using composite keys as an exception to the rule, then enforcing data uniqueness by using unique constraints or indexes on those columns.
After reading your question and arguments I would like to say you are not wrong.
Since you have ID auto incremented which will always provides uniqueness to your row.
Now talking about code column, then if code should be unique then you can always have UNIQUE constraint for column which will not allow duplicate values for code and since you are doing it from front end so no need to add composite primary key with(ID,Code) but make sure you add UNIQUE constraint for code column.
You have already given explanation buddy and I believe you are totally right.
If you are going to make composite primary key then you have to consider two things here:
Composite PK on (ID,Code) will allow duplicate ID's and duplicate codes, it will not
allow duplicate combinations.
you have to add code column in tbl_CustomerDetails table as well if you are going
to link both tables.
In Summary I would like to say I don't feel that in this case Composite Primary Key is required.
If your question is, should you use a composite key in your example, the answer to that is a resounding NO! Your colleague's suggestion to add code as a composite key is not only unnecessary but will more than likely introduce problems for you down the road. Let me illustrate:
Let's say that you'd like to distinguish customers by code: All members are having code MEMB plus the Id number, all vendors have code VEND plus the Id number, and all customers have code CUST plus Id.
Among the "customers" are donors who don't purchase anything but give a contribution. You decide to make a distinction between donors and customers.
That means you'll have to change the code of some of your customers from CUST to DONOR plus Id. To make that change you will have to UPDATE EVERY INSTANCE of CUST that's a donor into DONOR. That could be a nightmare to say the least as you'll need to figure out every table that has that Id as a reference.
With your current set up, all you have to do is update the Code in ONE place and no more changes are needed. So you're right in your implementation.
I'm new to SQL Server and would really appreciate it if you could help me out here.
So are a healthcare provider and internally we assign an ID to each patient (for example, 1234). I'm currently constructing another database, and I just wonder can I use our internal IDs as primary key, given they are unique? If so, since I am not going to do any calculation on the primary key, can I set them to string/char datatype for primary key?
In short, yes you can but it is not recommended at all!
To give you some heads up:
Primary keys should never change
You cannot use a natural key or a key form other system
They cannot have any formula
Use short but suitable key type
If you have an external key that you want to use to find some patients, create another column for it and add UNIQUE Constraint to it.
just don't forget to add index for that column
Read this post of mine for more information:
http://pilpag.blogspot.dk/2016/06/relational-database-designsimple-rules.html
The conditions for a primary key are that the key is unique in the table and never NULL.
Your patient id would appear to have these characteristics.
That said, there are good reasons for developing a synthetic primary key (auto-incremented/identity/serial depending on the database). More importantly, the actual patient ID may be sensitive information. For instance, patients might use the id when logging in or it might be printed on invoices.
It might not be a good idea to have sensitive information repeated throughout the database. For this reason, an "internal" id would be used to refer to patients in table and all the sensitive information would be contained in one or a handful of tables.
This would perhaps be more obvious if the "patient id" were a government id ("social security number") or email address.
Yes, but the ID can also be numeric and a primary key - it doesn't have to be a string. As long as the ID is unique, you should be fine.
Yes, you can use your internal IDs if they are unique;PK limit is 900 bytes for char/varchar data types.So if your IDs are int is fine. But if your IDs can change with time or can be reused them for more than one patient I strong recommend not to use them to avoid chaos. I prefer a surrogate key, like an identity
If I understand correctly, you are assigning each patient a number so as to uniquely identify them. So a report would contain the patient number rather than only a patient's name which can be ambiguous. You won't ever change the patient numbers, because then you'd have to change this in all databases and would have to re-print all documents on the patient that are still needed. This makes this number a perfect primary key for a patient table in any of your databases.
You could use a generated technical ID instead as the table's primary key and have the patient number only as another field in the table (which would still have a unique constraint of course, because it is still the business key uniquely identifying a patient). Whether to do this or not is mainly a matter of personal preference and experience. I prefer natural keys over IDs (so I would make the patient number the primary key). This stems from having worked with rather large databases with thousands of tables and much hierarchy where the natural keys proved to result in faster queries, enhanced data consistency and easier maintenance. Others may have different experience, though.
So yes, the patient number seems to be the perfect natural primary key in my opinion.
I am in no way a SQL expert so I am sure I did something wrong. I have read a few questions on here about needing a primary key. The way I created this table I can't find a way to actually have a unique key. It is a survey type database. I have a table for the main details like date, triage number, and the person involved. Another table for the questions results and another for the comments. I would have made the triage unique but more than one person can be involved so the same triage number would be used more than once. The people involved can appear more than once as well. The only truly unique thing is combining the person with the triage. I thought about an auto key but it would serve no purpose. Can using two identifiers be an acceptable practice for a survey type table?
The important part:
"... more than one person can be involved so the same triage number would be used more than once. The people involved can appear more than once as well."
Based on your comments, data in these two fields, for example:
Triage Person
------ ------
1 PersonA
1 PersonB
...
7 PersonA
7 PersonB
is fine in that Triage and Person can make a composite key, provided each person recorded in the Person field is uniquely identifiable. That is, if ea. person value is a name like "John Smith", you may have a problem if there are 2 or more John Smiths answering the survey. So, your Person value itself has to identify people uniquely. Assuming the triage nos. are distinguished (i.e., no triage no. represents more than one semantically-relevant triage position), these two fields as the composite key will work for you if and only if at no time does your survey create more than one unique triage-person combination.
The foreign key for each of your other tables ought to be the main table's composite key combination, but if the other two tables can be merged into the main one, consider it to reduce join burdens. E.g.: if the comments table stores only comments in a single field and nothing more, why not include that field in the main table and get rid of the comments table?
Your question is quite general and I don't have enough information to give you a definite answer but hopefully my comments below can help.
It is not a problem to use a composite primary key (key consisting of 2 or more columns). It is more often used in linking tables, e.g. in many-to-many relationships.
One thing that you should consider is that if you want to also refer to a table with a composite primary key from other tables, you will have to refer to 2 columns in the foreign key, all the joins, etc. It may be easier to create a separate column for a primary key (e.g. autoincrementing number).
I'm trying to model artists and songs and I have a problem where I have a Song_Performance can be performed by many artists (say a duet) so I have an Artist_Group to represent who the songs is performed by.
Well, I now have a many-to-many relationship between Artist and Artist_Group, where an Artist_Group is uniquely identified by the collection of artists in that group. I can create an intersection entity that represents an Artist's participation in an Artist_Group (Artist_Group_Participation?)
I'm having trouble coming up with how to come up with a primary key for the Artist_Group entity that preserves the fact that the same set of artists represents the same group, and lacking a primary key for the Artist_Group entity means I'm lacking a foreign key for the Artist_Group_Participation entity.
The book "Mastering Data Modeling" by John Carlis and Joseph Maguire mention this shape and refer it to as a "Many-Many Collection Entity" and state that it is very rare, but doesn't state how to resolve it since obviously a many-to-many relationship can't be stored directly in a RDBMS. How do I go about representing this?
Edit:
Looks like everyone is suggesting an intersection table, but that's not my issue here. I have that. My issue is enforcing the constraint that you cannot add an Artist_Group entry where the group of artists that it contains are the same as an existing group, ignoring order. I thought about having the ID for Artist_Group be a varchar that is the concatenation of the various artists that comprise it, which would solve the issue if order mattered, but having an Artist_Group for "Elton John and Billy Joel" doesn't prevent the addition of a group for "Billy Joel and Elton John".
I guess I'm missing the point of the "Artist_Group" relation.
The data model in my mind is:
Artist: an individual person.
Song: The song itself.
Performance: A particular performance or arrangement of a song. Usually this would have one song, but you could provide an m:n linking table to accommodate a medley. Ideally, this would be a single real performance, i.e., there would be an associated date.
Recording: A particular fixed version of a performance (CD or whatever). Usually a Performance only has one Recording, but having a separate table would handle the Grateful Dead / multiple-bootleg scenario, as well as re-release albums, radio play vs. live vs. CD versions, etc.
Performance_Artists: A linking table from a particular performance to a list of performers. For each, you could also have an attribute that describes their role(s) in the performance (vocalist, drummer, etc.).
There's no explicit relationship between a set of performers, except that they share performances in common. Thus, any table that attempts to combine random sets of artists outside the context of a recording is not an accurate relational model, as there is no real relationship.
If you are trying to represent an explicit relationship between a set of artists (i.e., they are in the same band), well, bands have names that have uniqueness (though not enough to be a primary key), and a band could be stored simply as an Artist, and then have an Artist_Member linking table that is self-referencing back to the individual Artist records. Or you could have a separate Band table, and a Band_Members table to assign artists to it, perhaps with dates of membership. Either way, just remember that band members change over time and band roles change from one song to the next, so associating a band with a performance should not substitute for linking performances directly to the artists involved.
The primary key for both the Artist and Artist_Group would be an numeric, incremental ID. Then you'd have an Artist_Group_Participation table that has two columns: artist_id and group_id. These would be foreign keys that refer to the ID of their respective tables. Then to SELECT everything you'd use a JOIN.
EDIT: Sorry, I misunderstood your question. The only other way I can think of is add an "artists" column to your Artist_Group table that contains a serialized array (assuming you're using PHP, but other languages have equivalents) of the artists and their IDs. Then just add a UNIQUE constraint to the column.
You could make each artist's ID correspond to a bit in a bitfield. So if Elton John is ID 12 and Billy Joel is ID 123, then the "group" formed by a duet between Elton John and Billy Joel is Artist_Group ID 10633823966279326983230456482242760704 (i.e. it has the 12th and 123rd bit set).
You could enforce the relationship using the intersection table. For example, using a CHECK constraint in PostgreSQL:
CREATE TABLE Artist_Group_Participation (
artist_id int not null,
artist_group_id int not null,
PRIMARY KEY (artist_id, artist_group_id),
FOREIGN KEY (artist_id) REFERENCES Artists (artist_id),
FOREIGN KEY (artist_group_id) REFERENCES Artist_Group (artist_group_id),
CHECK (B'1'<<artist_id & artist_group_id <> 0)
);
Admittedly, this is a hack. It applies extra significance to the Artist_Group surrogate key, when surrogate keys are supposed to be unique but not contain information.
Also if you have thousands of artists, and new artists every day, things could get unwieldy because the length of the Artist_Group key's data type needs to grow larger all the time.
I guess you could build a primary key by sorting and concatenate the artist ids ??
group: 3,2,6 -> 2-3-6 and 6,3,2 -> 2-3-6
I don't have much experience in RDBMS. However, I have read papers of Codd and books by C.J. Date.
So, instead of using RDBMS jargon, I'll try to explain in more common sensical terms (at least to me!)
Here goes -
Singer names should be standard on "First Name - Last Name" basis
Each "Singer" should have an entry in the "Artists Group" table even if they have performed solo
Each entry in the "Artists Group" will consist of multiple "Singer" ordered alphabetically. There should be a single occurance of a specific combination.
Each song will have an entry of a unique record from "Artists Group" regardless of whether they are solo, duets or in a gang.
I don't know if this makes much sense, but it's my two cents!