Numeric IDs vs. String IDs - sql

I'm using a very stripped down example here so please ask if you need more context.
I'm in the process of restructuring/normalising a database where the ID fields in the majority of the tables have primary key fields which are auto-incremented numerical ID's (1,2,3 etc.) and I'm thinking I need to change the ID field from a numerical value to a string value generated from data in the row.
My reasoning for this is as follows:
I have 5 tables; Staff, Members, Volunteers, Interns and Students; all of these have numeric ID's.
I have another table called BuildingAttendance which logs when people visited the premises and for what reason which has the following relevant fields:
ID Type Premises Attended
To differentiate between staff and members. I use the type field, using MEM for member and STA for staff, etc. So as an example:
ID Type Premises Attended
1 MEM Building A 27/6/15
1 STA Building A 27/6/15
2 STU Building B 27/6/15
I'm thinking it might be a better design design to use an ID similar to the following:
ID Premises Attended
MEM1 Building A 27/6/15
STA1 Building A 27/6/15
STU2 Building B 27/6/15
What would be the best way to deal with this? I know that if my primary key is a string my query performance may take a hit, but is this easier than having 2 columns?
tl;dr - How should I deal a table that references records from other tables with the same ID system?

Auto-incremented numeric ids have several advantages over strings:
They are easier to implement. In order to generate the strings (as you want them), you would need to implement a trigger or computed column.
They occupy a fixed amount of storage (probably 4 bytes), so they are more efficient in the data record and in indexes.
They allow members to change between types, without affecting the key.
The problem that you are facing is that you have subtypes of a supertype. This information should be stored with the person, not in the attendance record (unless a person could change their type with each visit). There are several ways to approach this in SQL, none as clean as simple class inheritance in a programming language.
One technique is to put all the data in a single table called something like Persons. This would have a unique id, a type, and all the columns from your five tables. The problem is when the columns from your subtables are very different.
In that case, have a table called persons with a unique primary key and the common columns. Then have separate tables for each one and use the PersonId as the primary key for these tables.
The advantage to this approach is that you can have a foreign key reference to Persons for something like BuildingAttendance. And, you can also have foreign key references to each of the subtypes, for other tables where appropriate.

Gordon Linoff already provided an answer that points out the type/supertype issue. I refer to this a class/subclass, but that's just a difference in terminology.
There are two tags in this area that collect questions that relate to class/subclass. Here they are:
class-table-inheritance
shared-primary-key
If you will look over the info tab for each of these tags, you'll see a brief outline. Plus the answers to the questions will help you with your case.
By creating a single table called Person, with an autonumber ID, you provide a handy way of referencing a person, regardless of that person's type. By making the staff, member, volunteer, student, and intern tables use a copy of this ID as their own ID you will facilitate whatever joins you need to perform.
The decision about whether to include type in attendance depends on whether you want to retrieve the data with the person's current type, or with the type the person had at the time of the attendance.

Related

How to structure SQL tables with one (non-composite) candidate key and all non-primary attributes?

I'm not very familiar with relational databases but here is my question.
I have some raw data that's collected as a result of a customer survey. For each customer who participated, there is only one record and that's uniquely identifiable by the CustomerId attribute. All other attributes I believe fall under the non-prime key description as no other attribute depends on another, apart from the non-composite candidate key. Also, all columns are atomic, as in, none can be split into multiple columns.
For example, the columns are like CustomerId(non-sequential), Race, Weight, Height, Salary, EducationLevel, JobFunction, NumberOfCars, NumberOfChildren, MaritalStatus, GeneralHealth, MentalHealth and I have 100+ columns like this in total.
So, as far as I understand we can't talk about any form of normalization for this kind of dataset, am I correct?
However, given the excessive number of columns, if I wanted to split this monolithic table into tables with fewer columns, ie based on some categorisation of columns like demographics, health, employment etc, is there a specific name for such a structure/approach in the literature? All the tables are still going to be using the CustomerId as their primary key.
Yes, this is part of an assignment and as part of a task, it's required to fit this dataset into a relational DB, not a document DB which I don't think would gain anything in this case anyway.
So, there is no direct question as such as I worded above but creating a table with 100+ columns doesn't feel right to me. Therefore, what I am trying to understand is how the theory approaches such blobs. Some concept names or potential ideas for further investigation would be appreciated as I even don't know how to look this up.
In relational databases using all information in a table is not a good usage.
As you mentioned groping some columns in other tables and join all tables with master table is well. In this usage you can also manage one to many, many to one and many to many relationships. Such as customers could have more than one address or phone numbers.
An other usage is making a table like customer_properities and use columns like property_type and property_value and store data by rows.
But the first usage is more effective and most common usage
customer_id property_type properity_value
1 num_of_child 3
1 age 22
1 marial_status Single
.
.
.

Table needs more than one identifier

I am in no way a SQL expert so I am sure I did something wrong. I have read a few questions on here about needing a primary key. The way I created this table I can't find a way to actually have a unique key. It is a survey type database. I have a table for the main details like date, triage number, and the person involved. Another table for the questions results and another for the comments. I would have made the triage unique but more than one person can be involved so the same triage number would be used more than once. The people involved can appear more than once as well. The only truly unique thing is combining the person with the triage. I thought about an auto key but it would serve no purpose. Can using two identifiers be an acceptable practice for a survey type table?
The important part:
"... more than one person can be involved so the same triage number would be used more than once. The people involved can appear more than once as well."
Based on your comments, data in these two fields, for example:
Triage Person
------ ------
1 PersonA
1 PersonB
...
7 PersonA
7 PersonB
is fine in that Triage and Person can make a composite key, provided each person recorded in the Person field is uniquely identifiable. That is, if ea. person value is a name like "John Smith", you may have a problem if there are 2 or more John Smiths answering the survey. So, your Person value itself has to identify people uniquely. Assuming the triage nos. are distinguished (i.e., no triage no. represents more than one semantically-relevant triage position), these two fields as the composite key will work for you if and only if at no time does your survey create more than one unique triage-person combination.
The foreign key for each of your other tables ought to be the main table's composite key combination, but if the other two tables can be merged into the main one, consider it to reduce join burdens. E.g.: if the comments table stores only comments in a single field and nothing more, why not include that field in the main table and get rid of the comments table?
Your question is quite general and I don't have enough information to give you a definite answer but hopefully my comments below can help.
It is not a problem to use a composite primary key (key consisting of 2 or more columns). It is more often used in linking tables, e.g. in many-to-many relationships.
One thing that you should consider is that if you want to also refer to a table with a composite primary key from other tables, you will have to refer to 2 columns in the foreign key, all the joins, etc. It may be easier to create a separate column for a primary key (e.g. autoincrementing number).

Setup Many-to-Many tables that share a common type

I'm preparing a legacy Microsoft SQL Server database so that I can interface with in through an ORM such as Entity Framework, and my question revolves around handling the setup of some of my many-to-many associations that share a common type. Specifically, should a common type be shared among master types or should each master type have its own linked table?
For example, here is a simple example I concocted that shows how the tables of interest are currently setup:
Notice that of there are two types, Teachers and Students, and both can contain zero, one, or many PhoneNumbers. The two tables, Teachers and Students, actually share an association table (PeoplePhoneNumbers). The field FKID is either a TeacherId or a StudentId.
The way I think it ought to be setup is like this:
This way, both the Teachers table and the Students table get its own PhoneNumbers table.
My gut tells me the second way is the proper way. Is this true? What about even if the PhoneNumbers tables contains several fields? My object oriented programmer brain is telling me that it would be wrong to have several identical tables, each containing a dozen or so fields if the only difference between these tables is which master table they are linked to? For example:
Here we have two tables that contain the same information, yet the only difference is that one table is addresses for Teachers and the other is for Students. These feels redundant to me and that they should really be one table -- but then I lose the ability for the database to constrain them (right?) and also make it messier for myself when I try to apply an ORM to this.
Should this type of common type be merged or should it stay separated for each master type?
Update
The answers below have directed me to the following solution, which is based on subclassing tables in the database. One of my original problems was that I had a common table shared among multiple other tables because that entity type was common to both the other tables. The proper way to handle that is to subclass the shared tables and essentially descend them from a common parent AND link the common data type to this new parent. Here's an example (keep in mind my actual database has nothing to do with Teachers and Students, so this example is highly manufactured but the concepts are valid):
Since Teachers and Students both required PhoneNumbers, the solution is to create a superclass, Party, and FK PhoneNumbers to the Party table. Also note that you can still FK tables that only have to do with Teachers or only have to do with Students. In this example I also subclassed Students and PartTimeStudents one more level down and descended them from Learners.
Where this solution is very satisfactory is when I implement it in an ORM, such as Entity Framework.
The queries are easy. I can query all Teachers AND Students with a particular phone number:
var partiesWithPhoneNumber = from p in dbContext.Parties
where p.PhoneNumbers.Where(x => x.PhoneNumber1.Contains(phoneNumber)).Any()
select p;
And it's just as easy to do a similar query but only for PhoneNumbers belonging to only Teachers:
var teachersWithPhoneNumber = from t in dbContext.Teachers
where t.Party.PhoneNumbers.Where(x => x.PhoneNumber1.Contains(phoneNumber)).Any()
select t;
Teacher and Student are both subclasses of a more general concept (a Person). If you create a Person table that contains the general data that is shared for all people in your database and then create Student and Teacher tables that link to Person and contain any additional details you will find that you have an appropriate point to link in any other tables.
If there is data that is common for all people (such as zero to many phone numbers) then you can link to the Person table. When you have data that is only appropriate for a Student you link it to the Student ID. You gain the additional advantage that Student Instructors are simply a Person with both a Student and Teacher record.
Some ORMs support the concept of subclass tables directly. LLBLGen does so in the way I describe so you can make your data access code work with higher level concepts (Teacher and Student) and the Person table will be managed on your behalf in the low level data access code.
Edit
Some commentary on the current diagram (which may not be relevant in the source domain this was translated from, so a pinch of salt is advised).
Party, Teachers and Learners looks good. Salaries looks good if you add start and end dates for the rate so you can track salary history. Also keep in mind it may make sense to use PartyID (instead of TeacherID) if you end up with multiple entites that have a Salary.
PartyPhoneNumbers looks like you might be able to hang the phone number off of that directly. This would depend on if you expect to change the phone number for multiple people (n:m) at once or if a phone number is owned by each Party independently. (I would expect the latter because you might have a student who is a (real world) child of a teacher and thus they share a phone number. I wouldn't want an update to the student's phone number to impact the teacher, so the join table seems odd here.)
Learners to PaymentHistories seems right, but the Students vs PartTimeStudents difference seems artificial. (It seems like PartTimeStudents is more AttendenceDays which in turn would be a result of a LearnerClasses join).
I think you should look into the supertype/subtype pattern. Add a Party or Person table that has one row for every teacher or student. Then, use the PartyID in the Teacher and Student tables as both the PK and FK back to Party (but name them TeacherID and StudentID). This establishes a "one-to-zero-or-one" relationship between the supertype table and each of the subtype tables.
Note that if you have identity columns in the subtype tables they will need to be removed. When creating those entities going forward you will first have to insert to the supertype and then use that row's ID in either subtype.
To maintain consistency you will also have to renumber one of your subtype tables so its IDs do not conflict with the other's. You can use SET IDENTITY_INSERT ON to create the missing supertype rows after that.
The beauty of all this is that when you have a table that must allow only one type such as Student you can FK to that table, but when you need an FK that can be either--as with your Address table--you FK to the Party table instead.
A final point is to move all the common columns into the supertype table and put only columns in the subtypes that must be different between them.
Your single Phone table now is easily linked to PartyID as well.
For a much more detailed explanation, please see this answer to a similar question.
The problem that you have is an example of a "one-of" relationship. A person is a teacher or a student (or possibly both).
I think the existing structure captures this information best.
The person has a phone number. Then, some people are teachers and some are students. The additional information about each entity is stored in either the teacher or student table. Common information, such as name, is in the phone table.
Splitting the phone numbers into two separate tables is rather confusing. After all, a phone number does not know whether it is for a student or a teacher. In addition, you don't have space for other phone numbers, such as for administrative staff. You also have a challenge for students who may sometimes teach or help teach a class.
Reading your question, it looks like you are asking for a common database schema to your situation. I've seen several in the past, some easier to work with than others.
One option is having a Student_Address table and a Teacher_Address table that both use the same Address table. This way if you have entity specific fields to store, you have that capability. But this can be slightly (although not significantly) harder to query against.
Another option is how you suggested above -- I would probably just add a primary key on the table. However you'd want to add a PersonTypeId field to that table (PersonTypeId which links to a PersonType table). This way you'd know which entity was with each record.
I would not suggest having two PhoneNumber tables. I think you'll find it much easier to maintain with all in the same table. I prefer keeping same entities together, meaning Students are a single entity, Teachers are a single entity, and PhoneNumbers are the same thing.
Good luck.

how to design a schema where the columns of a table are not fixed

I am trying to design a schema where the columns of a table are not fixed. Ex: I have an Employee table where the columns of the table are not fixed and vary (attributes of Employee are not fixed and vary). Frequent addition of a new attribute / column is requirement.
Nullable columns in the Employee table itself i.e. no normalization
Instead of adding nullable columns, separate those columns out in their individual tables ex: if Address is a column to be added then create table Address[EmployeeId, AddressValue].
Create tables ExtensionColumnName [EmployeeId, ColumnName] and ExtensionColumnValue [EmployeeId, ColumnValue]. ExtensionColumnName would have ColumnName as "Address" and ExtensionColumnValue would have ColumnValue as address value.
Employee table
EmployeeId
Name
ExtensionColumnName table
ColumnNameId
EmployeeId
ColumnName
ExtensionColumnValue table
EmployeeId
ColumnNameId
ColumnValue
There is a drawback is the first two ways as the schema changes with every new attribute. Note that adding a new attribute is frequent and a requirement.
I am not sure if this is the good or bad design. If someone had a similar decision to make, please give an insight on things like foreign keys / data integrity, indexing, performance, reporting etc.
It might be useful to look at the current crop of NoSQL databases which allow you to store arbitrary sets of key-value pairs per record.
I would recommend you look at couchdb, mongodb, lucene, etc ...
If the schema changes often in an SQL database this ends up in a nightmare, especially with reporting.
Putting everything in (rowId, key, value) triads is flexible, but slower because of the huge number of records.
The way the ERP vendors do it is just make their schema of the fields they're sure of and add a largisch number of "flexfields" (i.e. 20 numbers, 20 strings, etc) in fixed named columns and use a lookup table to see which flexcolumn corresponds to what. This allows some flexibility for the future while essentially having a static schema.
I recommend using a combination of numbers two and three. Where possible, model tables for standard associations like addresses. This is the most ideal approach...
But for constantly changing values that can't be summarized into logical groupings like that, use two tables in addition to the EMPLOYEES table:
EMPLOYEE_ATTRIBUTE_TYPE_CODES (two columns, employee_attribute_type_code and DESCRIPTION)
EMPLOYEE_ATTRIBUTES (three columns: employee_id foreign key to EMPLOYEES, employee_attribute_type_code foreign key to EMPLOYEE_ATTRIBUTE_TYPE_CODES, and VALUE)
In EMPLOYEE_ATTRIBUTES, set the primary key to be made of:
employee_id
employee_attribute_type_code
This will stop duplicate attributes to the same employee.
If, as you say, new attributes will be added frequently, an EAV data model may work well for you.
There is a pattern, called observation pattern.
For explanation, see these questions/answers: one, two, three.
In general, looks like this:
For example, subjects employee, company and animal can all have observation Name (trait), subjects employee and animal can have observation Weight (measurement) and subject beer bottle can have observations Label (trait) and Volume (measurement). It all fits in the model.
Combine your ExtensionColumn tables into one
Property:
EmployeeID foreign key
PropertyName string
PropertyValue string
If you use a monotonic sequence for assigning primary keys in all your object tables then a single property table can hold properties for all objects.
I would use a combination of 1 and 2. If you are adding attributes frequently, I don't think you have a handle on the data requirements.
I supect some of the attributes being added belong in a another table. If you keep adding attribututes like java certified, asp certified, ..., then you need a certification table. This can be relationship to a certifications code table listing available certifications.
Attributes like manager may be either an attribute or relationship table. If you have multiple relationships between employees, then consider a relationship table with a releation type. Organizations with a matrix management structure will require a releationship table.
Addresses and phone numbers often go in separate tables. An address key like employee_id, address_type would be appropriate. If history is desired add a start_date column to the key.
If you are keeping history I recommend using start_date and end_date columns on the appropriate columns. I try to use a relationship where the record is active when 'start_date <= date-being-considered < end_date' is true.
Attributes like weight, eye color, etc.

Two tables or one table?

a quick question in regards to table design..
Let's say I am designing a loan application database.
As it is right now, I will have 2 tables..
Applicant (ApplicantID, FirstName , LastName, SSN, Email... )
and
Co-Applicant(CoApplicantID, FirstName, LastName , SSN, Email.., ApplicantID)
Should I consider having just one table because all the fields are the same.. ??
Person( PersonID, FirstName, LastName , SSN, Email...,ParentID (This determines if it is a co-applicant))
What are the pros and cons of these two approaches ?
I suggest the following data model:
PERSON table
PERSON_ID, pk
LOAN_APPLICATIONS table
APPLICATION_ID, pk
APPLICANT_TYPE_CODE table
APPLICANT_TYPE_CODE, pk
APPLICANT_TYPE_CODE_DESCRIPTION
LOAN_APPLICANTS table
APPLICATION_ID, pk, fk
PERSON_ID, pk, fk
APPLICANT_TYPE_CODE, fk
Person( PersonID, FirstName, LastName , SSN, Email...,ParentID (This determines if it is a co-applicant))
That works if a person will only ever exist in your system as either an applicant or a co-applicant. A person could be a co-applicant to numerous loans and/or an applicant themselves - you don't want to be re-entering their details every time.
This is the benefit of how & why things are normalized. Based on the business rules & inherent reality of usage, the tables are setup so stop redundant data being stored. This is for the following reasons:
Redundant data is a waste of space & resources to support & maintain
The action of duplicating the data means it can also be different in subtle ways - capitalizations, spaces, etc that can all lead to complications to isolate real data
Data incorrectly stored due to oversight when creating the data model
Foresight & Flexibility. Currently there isn't any option other than applicant or co-applicant for an APPLICANT_TYPE_CODE value - it could be a stored without using another table & foreign key. But this setup allows support to add different applicant codes in the future, as needed - without any harm to the data model.
There's no performance benefit when you risk bad data. What you would make, will be eaten by the hacks you have to perform to get things to work.
If the Domain Model determines that both people are applicants and that are related, then they belong in the same table with a self-referential foriegn key.
You may want to read up on database normalization.
I think you should have two tables, but not those two. Have a table "loans" which has foreign keys to an applicants table, or just have records in applicants reference the same table.
The advantages:
- Makes searching easier: If you only have a phone number or a name, you can still search, in a single table and find the corresponding person regardless of he/she being a co-applicant or a main-applicant. Otherwise you'd need to use a UNION construct. (Yet, when you know that you search for a particular type of applicant, you add a filter on the type and you only get such applicants.
- Generally easier to maintain. Say tomorrow you need to add the tweeter id of the applicant ;-), only one place to change.
- Also allows inputing persons with say an "open/undefined" type, and assign then as applicant or otherwise, at a later date.
- Allows to introduce new types of applicants (say a co-latteral warrantor... whatever)...
The disadvantages:
with really huge (multi-million person records), there could be a slight performance gain with a two table approach (depending on index and various other things
SQL queries can be bit more complicated, for example with two separate joins to the the person table, one for the applicant the other for the co-applicant . (Nothing intractable but a bit more complexity.
On the whole, the proper design is in most likelihood the one with a single table. Only possible exception is if over time the info kept for one type of applicant was starting to diverge significantly from the other type(s) of applicant. (And even then we can deal with this situation in different ways, including the introduction of a related table for these extra fields, as it may make more sense; Yes, a two table system again, but one where the extra fields may fit "naturally" together in term of their semantics, usage etc...)
Both of your variants have one disadvantage: any person can be an applicant and co-applicant twice and more. So you should use table Person( PersonID, FirstName, LastName , SSN, Email... ) and table Co-Applicants (PersonID as Applicant, PersonID as CoApplicant)
How about since each Applicant can have a Co-Applicant -- just go with one table in total. So you'd have Applicants, which has an optional foreign key field 'coapplicant' (or similar).
If the fields between applicant and co-applicant are identical, then I would suggest that you put them in the same table and use an "applicant type" field to indicate main or co- applicant. IF there's some information special about the co-applicant (such as relationship to main applicant, extra phone numbers, other stuff) you might want to normalize to a separate table and refer from there back to the co-applicant (by (co-)applicant ID) in the applicant table.
Keep Two table>
1ST User type code ID
In this table u can keep user type ie applicat And Co applicant
2nd table User--> here u can keep all the field with similar coloums with user type code as foregin key.
By this you can easily distingush between two user.
I know - I'm too late on this.... The Loan Application is your primary entity. You will have one or more applicants for the loan. Drop the idea of Person - you're creating something that you can't control. I've been there, done that and got the T-Shirt.