Merging tables with common attributes - sql

I am designing a database that contains JOBSEEKERS who can be matched to VACANCIES. I am looking for an effecient and good way to store the common attributes between the 2 (There's a lot). For example a JOBSEEKER has skills and a VACANCY has required skills; or a JOBSEEKER has a salary requirement and a VACANCY has a salary offer.
Right now I am considering two options:
Storing all the attributes or each table in their own table.
Creating another table that contains the common attributes. Each row would represent the attributes for either a VACANCY or JOBSEEKER. I would then link each record to either a VACANCY or a JOBSEEKER.
Which way should is the correct way of going about this? Other suggestions are also welcome.

JobSeeker and Vacancy are two separate entities. In most cases, you would store the values in separate tables with separate columns. Although they have overlapping attributes, they have many attributes that are not common.
The, use application code logic (often implemented in SQL) to match between the two.
For something like skills, you actually want junction tables: JobseekerSkills and VacancySkills to list each of the skills. These would, in turn, reference another table Skills to ensure that the skills are common between the two entities.

Related

Many to many relationship self join SQL Server 2008

Are there any hard and fast rules against creating a junction table out of a table's primary key? Say I have a table with a structure similar to:
In this instance there are a list of items that can be sold together, but should be marked as dangerous. Any one item can have multiple other items with which it is dangerous. All of the items are uniquely identified using their itemId. Is it OK to refer a table to itself in a many-to-many relationship? I've seen other examples on SO, but they weren't SQL specific really.
This is the correct design for your problem, as long as your combinations can only be two-item combos.
In database design a conceptual design that renders a relation in many-to-many form is converted into 2 one-to-many in physical design. Say for example a Student could take one or many courses and a course could have many students, so that's many to many. So, in actual design it would be a Student table, Course table then CourseTaken table that has both the primary key of Student and Course table thus creating 2 one to many relayionship. In your case altough the two tables are one and the same but you have the virtual third table to facilitate the 2 one to many relationship so that to me is still very viable approach.

Setup Many-to-Many tables that share a common type

I'm preparing a legacy Microsoft SQL Server database so that I can interface with in through an ORM such as Entity Framework, and my question revolves around handling the setup of some of my many-to-many associations that share a common type. Specifically, should a common type be shared among master types or should each master type have its own linked table?
For example, here is a simple example I concocted that shows how the tables of interest are currently setup:
Notice that of there are two types, Teachers and Students, and both can contain zero, one, or many PhoneNumbers. The two tables, Teachers and Students, actually share an association table (PeoplePhoneNumbers). The field FKID is either a TeacherId or a StudentId.
The way I think it ought to be setup is like this:
This way, both the Teachers table and the Students table get its own PhoneNumbers table.
My gut tells me the second way is the proper way. Is this true? What about even if the PhoneNumbers tables contains several fields? My object oriented programmer brain is telling me that it would be wrong to have several identical tables, each containing a dozen or so fields if the only difference between these tables is which master table they are linked to? For example:
Here we have two tables that contain the same information, yet the only difference is that one table is addresses for Teachers and the other is for Students. These feels redundant to me and that they should really be one table -- but then I lose the ability for the database to constrain them (right?) and also make it messier for myself when I try to apply an ORM to this.
Should this type of common type be merged or should it stay separated for each master type?
Update
The answers below have directed me to the following solution, which is based on subclassing tables in the database. One of my original problems was that I had a common table shared among multiple other tables because that entity type was common to both the other tables. The proper way to handle that is to subclass the shared tables and essentially descend them from a common parent AND link the common data type to this new parent. Here's an example (keep in mind my actual database has nothing to do with Teachers and Students, so this example is highly manufactured but the concepts are valid):
Since Teachers and Students both required PhoneNumbers, the solution is to create a superclass, Party, and FK PhoneNumbers to the Party table. Also note that you can still FK tables that only have to do with Teachers or only have to do with Students. In this example I also subclassed Students and PartTimeStudents one more level down and descended them from Learners.
Where this solution is very satisfactory is when I implement it in an ORM, such as Entity Framework.
The queries are easy. I can query all Teachers AND Students with a particular phone number:
var partiesWithPhoneNumber = from p in dbContext.Parties
where p.PhoneNumbers.Where(x => x.PhoneNumber1.Contains(phoneNumber)).Any()
select p;
And it's just as easy to do a similar query but only for PhoneNumbers belonging to only Teachers:
var teachersWithPhoneNumber = from t in dbContext.Teachers
where t.Party.PhoneNumbers.Where(x => x.PhoneNumber1.Contains(phoneNumber)).Any()
select t;
Teacher and Student are both subclasses of a more general concept (a Person). If you create a Person table that contains the general data that is shared for all people in your database and then create Student and Teacher tables that link to Person and contain any additional details you will find that you have an appropriate point to link in any other tables.
If there is data that is common for all people (such as zero to many phone numbers) then you can link to the Person table. When you have data that is only appropriate for a Student you link it to the Student ID. You gain the additional advantage that Student Instructors are simply a Person with both a Student and Teacher record.
Some ORMs support the concept of subclass tables directly. LLBLGen does so in the way I describe so you can make your data access code work with higher level concepts (Teacher and Student) and the Person table will be managed on your behalf in the low level data access code.
Edit
Some commentary on the current diagram (which may not be relevant in the source domain this was translated from, so a pinch of salt is advised).
Party, Teachers and Learners looks good. Salaries looks good if you add start and end dates for the rate so you can track salary history. Also keep in mind it may make sense to use PartyID (instead of TeacherID) if you end up with multiple entites that have a Salary.
PartyPhoneNumbers looks like you might be able to hang the phone number off of that directly. This would depend on if you expect to change the phone number for multiple people (n:m) at once or if a phone number is owned by each Party independently. (I would expect the latter because you might have a student who is a (real world) child of a teacher and thus they share a phone number. I wouldn't want an update to the student's phone number to impact the teacher, so the join table seems odd here.)
Learners to PaymentHistories seems right, but the Students vs PartTimeStudents difference seems artificial. (It seems like PartTimeStudents is more AttendenceDays which in turn would be a result of a LearnerClasses join).
I think you should look into the supertype/subtype pattern. Add a Party or Person table that has one row for every teacher or student. Then, use the PartyID in the Teacher and Student tables as both the PK and FK back to Party (but name them TeacherID and StudentID). This establishes a "one-to-zero-or-one" relationship between the supertype table and each of the subtype tables.
Note that if you have identity columns in the subtype tables they will need to be removed. When creating those entities going forward you will first have to insert to the supertype and then use that row's ID in either subtype.
To maintain consistency you will also have to renumber one of your subtype tables so its IDs do not conflict with the other's. You can use SET IDENTITY_INSERT ON to create the missing supertype rows after that.
The beauty of all this is that when you have a table that must allow only one type such as Student you can FK to that table, but when you need an FK that can be either--as with your Address table--you FK to the Party table instead.
A final point is to move all the common columns into the supertype table and put only columns in the subtypes that must be different between them.
Your single Phone table now is easily linked to PartyID as well.
For a much more detailed explanation, please see this answer to a similar question.
The problem that you have is an example of a "one-of" relationship. A person is a teacher or a student (or possibly both).
I think the existing structure captures this information best.
The person has a phone number. Then, some people are teachers and some are students. The additional information about each entity is stored in either the teacher or student table. Common information, such as name, is in the phone table.
Splitting the phone numbers into two separate tables is rather confusing. After all, a phone number does not know whether it is for a student or a teacher. In addition, you don't have space for other phone numbers, such as for administrative staff. You also have a challenge for students who may sometimes teach or help teach a class.
Reading your question, it looks like you are asking for a common database schema to your situation. I've seen several in the past, some easier to work with than others.
One option is having a Student_Address table and a Teacher_Address table that both use the same Address table. This way if you have entity specific fields to store, you have that capability. But this can be slightly (although not significantly) harder to query against.
Another option is how you suggested above -- I would probably just add a primary key on the table. However you'd want to add a PersonTypeId field to that table (PersonTypeId which links to a PersonType table). This way you'd know which entity was with each record.
I would not suggest having two PhoneNumber tables. I think you'll find it much easier to maintain with all in the same table. I prefer keeping same entities together, meaning Students are a single entity, Teachers are a single entity, and PhoneNumbers are the same thing.
Good luck.

Multiple many to many relationships between same tables or one n:m relationship with identifying attribute?

Currently I have multiple n:m relationships between 2 tables:
Users --> Favourites(user_id,post_id) <-- Posts
Users --> Follow(user_id,post_id) <-- Posts
Would you rather have 2 join tables or just one join table with an attribute which marks the type of the join, so something like:
Users --> Users_Posts (user_id,post_id,type(VALUES="favourite,follow") <-- Posts
It's not exactly the same example as I have in my application, but I think you can get the idea.
I don't think there is one "right" answer here. I think it depends. If you use a single table with a "relationship type" column and frequently want to extract just relationships of a single type--say just favorites--then each query against that table will need to apply a WHERE clause to filter out the types you don't want. That might make for slow queries if you don't properly index the non-key "relationship type" column. Also, it makes it so future developers will always need to know and remember to filter to the types they want or they may unexpectedly get relationships data back that they don't intend. Having two separate tables is easier to understand. For example, it is easier for me to quickly know what to expect in a "Favorites" table than in a "Users_Posts" table, so separate tables may communicate the differences more quickly.
On the other hand, if you frequently need to select both relationship types in a single set, then having them in a single table is simpler because you don't need to worry about doing a UNION to combine the data from two tables into a single view. What if there were 10,000 different possible relationship types? Would you want 10,000 different tables, or would you prefer a single table? Most people would prefer a single table in that case.
So I think it depends on many factors, such as expected usage, size, etc. The "right" answer is more of an art than a science.

Database Design for One to One relationships

I'm trying to finalize my design of the data model for my project, and am having difficulty figuring out which way to go with it.
I have a table of users, and an undetermined number of attributes that apply to that user. The attributes are in almost every case optional, so null values are allowed. Each of these attributes are one to one for the user. Should I put them on the same table, and keep adding columns when attributes are added (making the user table quite wide), or should I put each attribute on a separate table with a foreign key to the user table.
I have decided against using the EAV model.
Thanks!
Edit
Properties include thing like marital status, gender, age, first and last name, occupation, etc. All are optional.
Tables:
USERS
USER_PREFERENCE_TYPE_CODES
USER_PREFERENCES
USER_PREFERENCES is a many-to-many table, connecting the USERS and USER_PREFERENCE_TYPE_CODES tables. This will allow you to normalize the preference type attribute, while still being flexible to add preferences without needing an ALTER TABLE statement.
Could you give some examples of what kind of properties you'd want to add to the user table? As long as you stay below roughly 50 columns, it shouldn't be a big deal.
How ever, one way would be to split the data:
One table (users) for username, hashed_password, last_login, last_ip, current_ip etc, another table (profiles) for display_name, birth_day etc.
You'd link them either via the same id property or you'd add an user_id column to the other tables.
It depends.
You need to Look at what percentage of users will have that attribute. If the attribute is 'WalkedOnTheMoon' then split it out, if it is 'Sex' include it on the user's table. Also consider the number of columns on the base table, a few, 10-20, won't hurt that much.
If you have several related attributes you could group them into a common table: 'MedicalSchoolId', 'MedicalSpeciality', 'ResidencyHospitalId', etc. could be combined in UserMedical table.
Personally I would decide on whether there are natural groupings of attributes. You might put the most commonly queried in the user table and the others in a separate table with a one-to-one relationship to keep the table from being too wide (we usually call that something like User_Extended). If some of the attributes fall into natural groupings, they may call for a separate table because those attributes will usually be queried together.
In looking at the attributes, examine if some can be combined into one column (for instance if a user cannot simlutaneoulsy be three differnt things (say intern, resident, attending) but only one of them at a time, it is better to have one field and put the data into it rather than three bit fields that have to be transalted. This is especially true if you will need to use a case statement with all three fileds to get the information (say title) that you want in reporting. IN other words look over your attributes and see if they are truly separate or if they can be abstracted into a more general one.

Do these database design styles (or anti-pattern) have names?

Consider a database with tables Products and Employees. There is a new requirement to model current product managers, being the sole employee responsible for a product, noting that some products are simple or mature enough to require no product manager. That is, each product can have zero or one product manager.
Approach 1: alter table Product to add a new NULLable column product_manager_employee_ID so that a product with no product manager is modelled by the NULL value.
Approach 2: create a new table ProductManagers with non-NULLable columns product_ID and employee_ID, with a unique constraint on product_ID, so that a product with no product manager is modelled by the absence of a row in this table.
There are other approaches but these are the two I seem to encounter most often.
Assuming these are both legitimate design choices (as I'm inclined to believe) and merely represent differing styles, do they have names? I prefer approach 2 and find it hard to convey the difference in style to someone who prefers approach 1 without employing an actual example (as I have done here!) I'd would be nice if I could say, "I'm prefer the inclination-towards-6NF (or whatever) style myself."
Assuming one of these approaches is in fact an anti-pattern (as I merely suspect may be the case for approach 1 by modelling a relationship between two entities as an attribute of one of those entities) does this anti-pattern have a name?
Well the first is nothing more than a one-to-many relationship (one employee to many products). This is sometimes referred to as a O:M relationship (zero to many) because it's optional (not every product has a product manager). Also not every employee is a product manager so its optional on the other side too.
The second is a join table, usually used for a many-to-many relationship. But since one side is only one-to-one (each product is only in the table once) it's really just a convoluted one-to-many relationship.
Personally I prefer the first one but neither is wrong (or bad).
The second would be used for two reasons that come to mind.
You envision the possibility that a product will have more than one manager; or
You want to track the history of who the product manager is for a product. You do this with, say a current_flag column set to 'Y' (or similar) where only one at a time can be current. This is actually a pretty common pattern in database-centric applications.
It looks to me like the two model different behaviour. In the first example, you can have one product manager per product and one employee can be product manager for more than one product (one to many). The second appears to allow for more than one product manager per product (many to many). This would suggest the two solutions are equally valid in different situations and which one you use would depend on the business rule.
There is a flaw in the first approach. Imagine for a second, that the business requirements have changed and now you need to be able to set 2 Product Manager to a product. What will you do? Add another column to the table Product? Yuck. This obviously violates 1NF then.
Another option the second approach gives is an ability to store some attributes for a certain Product Manager <-> Product relation. Like, if you have two Product Manager for a product, then you can set one of them as a primary...
Or, for example, an employee can have a phone number, but as a product manager he/she can have another phone number... This also goes to the special table then.
Approach 1)
Slows down the use of the Product table with the additional Product Manager field (maybe not for all databases but for some).
Linking from the Product table to the Employee table is simple.
Approach 2)
Existing queries using the Product table are not affected.
Increases the size of your database. You've now duplicated the Product ID column to another table as well as added unique constraints and indexes to that table.
Linking from the Product table to the Employee table is more cumbersome and costly as you have to ink to the intermediate table first.
How often must you link between the two tables?
How many other queries use the Product table?
How many records in the Product table?
in the particular case you give, i think the main motivation for two tables is avoiding nulls for missing data and that's how i would characterise the two approaches.
there's a discussion of the pros and cons on wikipedia.
i am pretty sure that, given c date's dislike of this, he defines relational theory so that only the multiple table solution is "valid". for example, you could call the single table approach "poorly typed" (since the type of null is unclear - see quote on p4).