Should I use a name as primary key if there is nothing else to use, or should i create ids for the entities that don't have anything useable as PK? - sql

I have to specify that this is for a database assignment. I'm pretty good with SQL code but the diagram aspect of the assignment is killing me, I think that every step I take is wrong.
They have given us This scenario and requirements :
A research team has asked you to create a database for a project on movie production
companies; the project aims to use machine learning, neural networks and other
methods to extract information about the situation of movie production companies in
Europe and the health of this sector for a set of specific countries, including the UK.
The data analytics application resulting from this project – which you DO NOT have to
develop; your job is to develop the central, server-side database that underpins it – has been commissioned by a research institute (which shall remain nameless), and it is
intended to be open source, and therefore available to anyone.
Basically, it is a machine learning application that would run on a database with the aim
to identify the correlation between different aspects of the sector, including funding
opportunities and development of new production companies or studios.
The database records every production company in Europe, including the name of the
company, the address, ZIP code, city, country, type of the company (e.g., non-profit
organisation), number of employees and net worth (calculated as total assets minus
total liabilities). Every production company has its name registered with one and only
one local government authority (for example, Companies House in the UK) on a specific
date; each company can have many shareholders. The authority typically requires
information about all the shareholders, including town of birth, mother’s maiden name,
father’s first name, their personal telephone number (only one), national insurance
number (each country in Europe has a similar unique ID), and passport number. Also,
the registration procedure has a cost associated with it (e.g., 12£ in the UK).
The database also records the employees’ data for each company: each employee is
assumed to work for a single production company. Due to the complex structure of
movie production companies and the need for various skills and professions,
employees are categorised into crew and staff. The crew consists of three main groups:
the actors, the director(s) and those who work on other jobs relevant to the filming
(producers, editors, production designers, costume designers, composer, etc.). All
other employees belong to the staff group, including those responsible for HR,
advertising, etc. Employees are identified by an employee ID, first name, last name and
an optional middle name, date of birth and start date. Also, each employee has their
contact details recorded, whether it is a single phone number or multiple, with a
description associated with each of them. Each employee has a single email address,
too.
Members of the crew are paid hourly, and this is recorded in the database as well as a
bonus that depends on their contract. Actors get a bonus for each day of work and
another bonus for each scene completed; directors get a bonus at the end of the
shooting; crew members that work in other jobs relevant to the filming get a bonus at
the end of the shooting, and they have their role recorded as well (e.g., producer or
costume designer).
Staff members have the monthly salary and the working hours (e.g., full time 9-5).
Furthermore, each staff member belongs to a specific department (e.g., advertising),
which is located in a given building at a given address (both recorded in the database).
The database records all movies from each production company. More specifically, for
each movie the following information is recorded: a universal unique movie code(similar to the ISBN for books), the title of the movie, the year and the first release date
(different release dates are not important and should NOT be recorded).
Also, the database records each member of the crew that is part of the movie, and the
role they have in the movie: each crew member can play a single role or multiple roles
in the same movie, and each role has a description associated with it. For example, in
each movie there can be a single protagonist or more than one, the same actor can play
one or several roles, or even have a cameo.
One of the aims of the project is to provide insights on the impact of funding and grants
within the movie industry. To this end, the database should be able to record all the
funding that each production company receives. This must include the name of the
grant, the funding body (e.g., the government of a given country or European Union
grants such as the ERDF), the maximum amount for that grant and the deadline to
submit a proposal.
Then, for each company the database must record the date of the application to a given
grant, the amount requested, the outcome (successful/unsuccessful).
A grant can be given to a single production company or shared among several. Finally,
once the database is ready, the project will run a set of machine learning algorithms to
perform high level data analysis based on the different grants and their corresponding
impact with the aim to investigate the impacts of such funding against a list of criteria.
No additional information is provided at this stage from the project.
In the spec, the requirements are numerated from 1 to 5, as the scenario was not given
at that time. The details of each requirement are provided in the following:
Each production company may have received one or multiple grants, and grants
can be shared by more than one company.
It is possible for each employee to have more than one telephone number. Each
telephone number has a description associated with it (e.g., personal, or work).
Each production company is registered only once but can have many shareholders.
Each employee can either be a member of the crew OR a staff member. Each crew
member can be an actor OR a director OR have another role. Each staff member
belongs to a department. No duplication of data is allowed.
Each crew member may be part of one or more movies in a single role or many.
Based on that I have created THIS DIAGRAM.
I think I have all the entities,attributes and relationships down but I'm missing the keys. Keys can't be names right? I will use the company entity as an example. So, should I create new attributes like company_id to use as primary keys or just underline the name attributes and use it as Primary Key?
Also, please tell me if there's anything else wrong with the diagram.
Thanks a lot!
I created an er diagram but some entities don't attributes that can be used as primary keys because they are names. I tried using them but I don't think it's right.

The problem with names as primary keys
In your diagram, you have a couple of name used to identify entities: Grant, Production Company, Shareholder (full name), Employee, Movie (Title). You can in theory use them as primary key. However, this is a bad practice:
names can change (e.g. departments and companies can be renamed, movies can have a temporary working title);
names are often not sufficient to distinguish entities (e.g. there may be different people having the same name, e.g. Adam Smith);
names can be spelled differently across source of information , and are also easily misspelled;
although not really noticeable with modern RDBMS, names are more time consuming to search, and consume more memory when used as foreign keys.
How to chose a primary key?
You'd better use a primary key that guarantees uniqueness. You can then decide easily if a same name correspond to a different entity or not.
The next question that you'll then face in you design is surrogate key vs. natural key:
When there's no other unique information, you'll have not choice than using a surrogate key.
When there are other potential unique attributes, you may chose to use either a natural key (e.g. company registration number, national insurance number together with a country code, movie code?) or a surrogate one.
Keep in mind that both have advantages and inconveniences, but the surrogate key is in general more robust, as natural keys sometimes appear to be not as stable as expected.
Other remarks concerns about your ERD
By the way, here some issues and other remarks:
Works in relation does not relate Staff to anything else. From the name, it's obviously not a reflexive relation either. So this is a diagram error. department (name) and building should either be attached to a Department entity or be attributes of Staff.
In several cases you relate attributes to other attributes (actor-extra role, phone number-description) . This is also a diagramming error. Either add the extra attribute to the same entity, or there's a missing relationship with a missing entity.
In one case you relate two entities without a relation between the two (production company- application). This is an inconsistency that must be corrected also.
The following attributes are not real attributes but probably values of an unidentified attribute: producer, composer, actor, editor, xyzzy designer, advertising, HR, janitor.
Government authority is a misleading entity name: nowhere do you refer to data about the authority itself (name of the authority, e.g. "CNC", country of the government, ...). It's only information about the company's registration.
In your diagram you leave the hourly and monthly wage at the level of the Employee. This does not model accurately the requirements.
The link of the relation receive funding and the entity Application with the same attribute outcome seems very ambiguous.
In the name of the entities, stay consistent: either singular or plural. But mixing both will lead to lots of typos.
Better show cardinality in the link between the relation and the entity, than on the top of the relation: this avoids confusion about the direction of reading.
As a side remark, your question provides wealth of interesting details, but that are not really needed for answering the core of your question. Better limit yourself to only the information directly related to your issue in your next questions ;-)
Research or not research, keep in mind that GDPR may apply and that it requires inter alia privacy by design (some information about the shareholder and the employee may require some additional thoughts).

Related

Database Relation Anomalies

I am tasked to find anomalies within this relation. I had identified a few insertion, deletion and update anomalies.
Commission Percentage: the percentage of the total sales made by a salesperson that is paid as commission to that salesperson.
Year of Hire: the year the salesperson was first hired
Department Number: the number of the department where the salesperson works
Manager Name: name of the manager of the department
However, I am confused with a anomalies that I pulled out. Below is the statement:
There can not be a manager with the same name in the company as there is no primary identifier for the manager entity except for the name, which can be a duplicate within the company.
May I know how should I phrase the above statement and under which (update/deletion/insertion) anomaly should I include it in?
Thank you
May I request additional assistance below as well:
How would you change the current design and how does your new design address the problems you have identified with the current design.
My current design is splitting it into 3 relations:
Salesperson(salespersonNumber, salespersonName, commissionPercentage, YearOfHire, deparetmentNumber)
Product(productNumber, productName, unitPrice)
Manager(managerNumber, managerName, departmentNumber)
However, I am missing out quantity entity.
Quantity requires composite key of productNumber & salespersonNumber.
Should I make it in another relation by itself?
Quantity(productNumber, salespersonNumber)
Anomalies
When identifying identifying (potential) anomalies, you're listing dependent attributes that are affected by the anomalies (you forgot Salesperson Name, btw). Specifically, you listed attributes that depended on a subset of the key (Salesperson Number, Product Number), thus violating 2NF. You're on the right track.
However, be careful not to confuse attributes with anomalies. An update anomaly would be if 1 of the 3 instances of Bilstein got changed. The (assumed) functional dependency Salesperson Name depends on Salesperson Number would be broken and the data would be inconsistent (Salesperson Number 437 would be associated with more than one name). Remember that normalization aims to eliminate redundant associations.
Identity
The problem with identifying managers by name indicates a poor modeling decision. As you stated, a company's set of managers isn't uniquely identified by name, so there's a mismatch between the logical data model and the world it models. This won't cause insert, update or delete anomalies as long as we use different values for different managers, but it will prevent convenient identification of managers. Possible improvements would be to use multiple attributes (abstract domains are often easily identified by a combination of attributes, but natural domains like people usually aren't, e.g. Manager Name, Birthdate would be more identifying but still not a good solution), turn the Manager Name into a surrogate key (e.g. Scott1, Scott2), or introduce a new surrogate key (e.g. a numeric ID).
Proposed improvement
Your proposed design normalizes the original table as well as addressing the identification problem. It's a good answer except for two issues: it doesn't include the association between Salesperson and Manager, and in your Quantity relation, you forgot to include the quantity as a dependent attribute.
Good job so far, hope this helps.

Designing Normalized Database in SQL Server

I am a novice database admin and I am working on a project that would take a hardware inventory spreadsheet and migrate it to SQL server. Currently, the data exists as one large denormalized view. I am trying to achieve 3NF, but somehow I feel that I haven't designed it well.
Let me give you a brief overview of the situation:
The school district has several buildings, which in turn, have several rooms. The school is experiencing an expansionary period so new rooms are constantly "built", so the "Room Num" cannot be used as a primary key, therefore leading me to use a surrogate key.
Teachers are assigned to rooms, but a room doesn't necessarily have to have a teacher (i.e. labs). Teachers can also change rooms.
When the hardware is purchased, it is given a barcode number (think of it as a line item number on the PO). The actual piece of hardware is identified via its serial number.
Hardware is assigned to a room and given a workstation number and assigned an IP address.
Does the design seem solid or is it flawed? Also, is it ok to create a junction table between a one-to-many relationship (a specific piece of hardware is assigned to a room, a room can have many pieces of hardware).
Please find the link below.
Logical Database Design
Start with a functional decomposition of each entity and define what it "is" versus how it is used. 3NF is about avoiding duplication and keeping the right level of references.
For this problem, model a BUILDING (PK: auto-ID, name, etc).
Then model ROOM (PK: auto-ID, number, etc) with foreign-key to BUILDING primary key.
Then model TEACHER (PK: auto-ID, name, etc) with no foreign-keys.
For the relationship between teacher and room assignment, use a join table:
TEACHER_ROOM
This would have a compound primary-key of:
TEACHER_ID
ROOM_ID
Thus a teacher can be assigned to a room but a room does not need a teacher and in this model a teacher could be assigned to MANY rooms (one-to-many cardinality).
Same thing with hardware - define it by itself with SERIAL_NUMBER, etc.
Then have a ROOM_HARDWARE table that has which hardware is in which room with a unique key on the HARDWARE ID as it can only be in one room at a time but a ROOM can have many pieces of HARDWARE.
The only one that sticks out to me is the Hardware table. Category and Manufacturer should be in separate tables.

ER Diagram flaws

I have the following ER Diagram for a bank database - customers may have several accounts, accounts may be held jointly by several customers, and each customer is associated with an account set and accounts are members of one or more account sets. What design rules are violated? What modifications should be made and why?
So far, a few flaws I'm not sure about are:
1) Redundant owner-address attribute in AcctSets Entity.
2) This ER does not include accounts with multiple owners with different addresses.
My Question is: How would I go about fixing these flaws and/or other flaws that I may be missing from my analysis? Thanks!
I'm not sure what the AccountSet entity does.
You have a many to many relationship with Customers and Accounts. Therefore, you need a CustomerAccount entity that ties a customer to one or more accounts, and an account to one or more customers.
CustomerAccount
---------------
CustomerAccountKey
CustomerKey
AccountKey
This entity would be accessed by either the CustomerKey, to get the accounts for a customer, or the AccountKey, to get the customers for an account. The CustomerAccountKey is only used to cluster the data on a database.
The CustomerKey - AccountKey combination in the CustomerAccount entity is unique.
If you want more than one address for a customer, that that becomes a one to many relationship between the Customer entity and the Address entity. This allows customers to have a summer address and a winter address, as one real life example.
You haven't states a reason for having the abstraction Account Sets. You need an intersection between Customer and Account, since your business rules say many to many, but why have an intervening abstraction?
Even assuming you keep it, the attribute owner address doesn't belong on the abstraction/intersection entity between Account and Customer (i.e. Account Set). That just doesn't make sense. There is nothing in your stated rules nor in common experience to suggest that customer address has any functional dependency on the relationship between an account and a customer. If anything, it would be functionally dependent on the customer alone. As it stands, you are modelling this dependency to be multi-valued, so the address isn't even fully functionally dependent on the customer. The only 3NF place to put it would be on the Addresses entity type.
You should consider a better name than Addresses. First, your entities should be named for the object they represent. Resist the urge to use plural nouns. The entity type is not a collection. The table that implements the entity type will be a collection of your entity instances, but that goes without saying and plural noun naming will just lead you into confusion when you are thinking about cardinalities of relationships. I'd suggest Location as an entity type name, with address as an attribute. When you go beyond the conceptual level address is almost certainly going to have to be decomposed.
Other than that, you are on a pretty good track, based on the business rules you've cited.

Confusion about 1:1 relationship

I've been learning database design and I'm confused about 1:1 relationships. From what I understand, you can simply add columns to the appropriate table. Can someone provide a real world example of where a 1:1 relationship was either necessary or provided some significant benefit? I.e., where would I use a 1:1 relationship and what would it look like?
I'll give you a real practical example.
In the medical billing world, doctors who want to get paid by medicare handle billing by creating a dictation report for each visit with a patient. This might actually be a recorded audio dictation transcribed by a secretary, but more often it's just a written description of what they did and talked about with the patient, along with history, impressions, and so forth. A licensed medical coder will then read this dictation and decide what the doctor is allowed to bill.
Separate from the dictation, there is demographic information about the patient involved: name, age, billing address, etc. This information must be strictly separate from information about the dictation, to prevent coders from allowing bias to cloud their billing judgements or violating patients' privacy.
This data is often kept well-normalized with a 1:many relationship in the data systems at the point of origin, and only the right parts are displayed to the right people at the right times. However, a significant number of offices out-source their billing function to a third party. This way a small clinic, for example, doesn't have to keep a licensed medical coder on staff; one coder at the billing office can handle the needs of many clinics. When the data is sent from the clinic to the billing office, the patient demographic information and the dictations need to come over as separate pieces, possibly at separate times. At this point, they'll likely be stored in completely separate tables with a 1:1 relationship and a shared ID field to match them up later.
In this case, the 1:1 relationship has very little to do with the data model. You could probably match up the records at the time of import, and as a bill moves through the system eventually the provincial patient information received in the clinic's demographic record will be matched to a real person so the 1:many relationship can be restored. Otherwise you'd get a separate statement on a separate account for each visit to the doctor.
Instead, it has almost everything to do with the systems design. There are likely entirely different people building and using the billing part verses the coding part at our imaginary billing service. This way, each side can each have full control of it's own fiefdom, and you are sure that no one, not even a developer, is breaking any privacy rules.
True one-to-one relationships seldom
occur in the real world. This type of
relationship is often created to get
around some limitation of the database
management software rather than to
model a real-world situation. In
Microsoft Access, one-to-one
relationships may be necessary in a
database when you have to split a
table into two or more tables because
of security or performance concerns or
because of the limit of 255 columns
per table. For example, you might keep
most patient information in
tblPatient, but put especially
sensitive information (e.g., patient
name, social security number and
address) in tblConfidential (see
Figure 3). Access to the information
in tblConfidential could be more
restricted than for tblPatient. As a
second example, perhaps you need to
transfer only a portion of a large
table to some other application on a
regular basis. You can split the table
into the transferred and the
non-transferred pieces, and join them
in a one-to-one relationship.
That's a quote from here: Fundamentals of Relational Database Design
And here's a similar question on SO.
Another reason I can see for using a 1:1 (where I have used it in the past) is if you have a table with a lot of columns, and only a few of them are involved in very intensive and frequent queries which need to be fast, I would break it into two tables that are related 1:1 where I could query the lightweight table and get good performance, but still have the other data related to it easily with a simple join.
I belief tables should be designed with the domain background. So if those columns form two different entities, they should not be mixed in one table. From my experience 1:1 relationships tend to evolve into 1:n relationships over time.
For example you may want to store the postal address of a person. But after some time, you are required to store more than one address per person. Refactoring programs from a 1:1 relationship into 1:n is usually a lot easier than extracting some columns from an old table into a new one.
Many database systems allow defining of access permissions per table in a very easy way. But defining permissions on individual columns is often quite painful.
It's useful if X has a 1:1 relationship with Y and Z also has a 1:1 relationship with Y. Y can be abstracted out into a shared table rather than duplicating in both X and Z.
EDIT: A real world example would be Customers, Companies, and Addresses. There can be a N:N relationship between Customer and Company. But both Customer and Company have 1:1 relationships with Address. Some Address rows could be related to both a Customer and a Person.
First, because they are talking about Access (Jet, Ace, whatever) -- credit to #Richard DesLonde for spotting this -- then they are probably talking about 1:0..1 relationships. I do not believe true 1:1 relationships are workable in Access because it has no mechanism for deferring constraints nor executing multiple statements in a SQL PROCEDURE. Most Access practitioners are satisfied to use a 1:0..1 relationship to model a true 1:1 relationship, so I guess the authors are satisfied to use the term "1:1" informally to refer to both.
Of course, 1:1 and 1:1..0 relationships are common enough in the real world. I rather think they are trying to convey the (valid) point that some 1:1 and 1:1..0 relationships are invented in a data model for business purposes.
Consider a "natural person" (i.e. human) and a "corporation". They have no attributes in common (sure, both have a "name" but their domains are different e.g. "natural person name" has sub atomic domains for "family name", "given name" and "title", etc).
However, in a given data model distinct entity types may play the same role. For example, both a "natural person" and a "corporation" can be the officer of a "corporation". In the data model, we could have two distinct entity types "natural person officers" and "corporate officers" that are likely to have many attributes in common and from the same domains e.g. appointment date, termination date, etc; further, they business rules would be the same e.g. appointment date must be before termination date. Also, both would participate in equivalent relationships e.g. "natural person representing", etc.
The data model could be 'split' at high level, resulting in pairs of very similar tables e.g. "natural person officer" and "corporate officer", "natural person officer natural person representing" and "corporate officer natural person representing", etc.
However, another approach is to model the common attributes and relationships using a fabricated entity type. For example, both a "natural person" and a "corporation" could be considered to be a "legal person" (aside: there is such a concept of "legal person" in law but does this mean the same as existing in the real world?!)
Therefore, we could have a superclass table for "legal persons" and subtype tables for "natural persons" and "corporations" respectively. The "officers" table would reference the "legal persons" table. All subsequent relationship tables could reference the "officers" table, which would half the number of tables from this point down.
There are practical problems to such a 'subclassing' approach. Because a "natural person" and a "corporation" has no attributes in common, they have no common key, therefore the "legal persons" table would need to have an artificial key, with all the problems that this entails, especially if it needs to be exposed in the application. Also, because the relationships between "legal persons", "natural persons" and "corporations" are truly 1:1 some DBMSs, as Access, will lack the necessary functionality to effectively implement them and many will have to settle for making them 1:0..1.
A 1:1 relationship is an abstract concept that you model in your data, but at the database level (assuming RDBMS) doesn't really exist. You always have a foreign key on one table pointing to another, so technically the parent table being pointed to by the FK could have multiple children. This is something you'll want to enforce in your business logic.
A good example of a 1:1 relationship in a modeling sense would be the relationship between employee and person. You have a person with certain data, then you have extra attributes on that same person that you put on an employee. A good way to think of this in OO programming terms is as inherited classes. The Employee class, inherits from Person. In fact may ORM systems will model the 1:1 relationship in the database with each table having a shared primary key.

SSIS Population of Slowly Changing Dimension with outrigger

Working on a data warehouse, a suitable analogy for the problem is that we have Healthcare Practitioners. Healthcare Practitioners have a number of professional attributes and work in an open number of teams and in an open number of clinical areas.
For example, you may have a nurse who works in children's services across a number of teams as a relief/contractor/bank staff person. Or you may have a newly qualified doctor who works general medicine who is doing time in a special area pending qualifying as a consultant of that special area.
So we have an open number of areas of work and an open number of teams, we can't have team 1, team 2 etc in our dimensions. The other attributes may change over time also, like base location (where they work out of), the main team and area they work in..
So, following Kimble I've gone for outriggers:
Table DimHealthProfessionals:
Key (primary key, identity)
Name
Main Team
Main Area of Work
Base Location
Other Attribute 1
Other Attribute 2
Start Date
End Date
Table OutriggerHealthProfessionalTeam:
HPKey (foreign key to DimHealthPRofessionals.Key)
Team Name
Team Type
Other Team Attribute 1
Other Team Attribute 2
Table OutriggerHealthProfessionalAreaOfWork:
HPKey (as above)
Area of Work
Other AoW attribute 1
If any attribute of the HP changes, or the combination of teams or areas of work in which they work change, we need to create a new entry in the SCD and it's outrigger tables to encapsulate this.
And we're doing this in SSIS.
The source data is basically an HP table with the main attributes, a table of areas of work, a table of teams and a pair of mapping tables to map a current set of areas of work to an HP.
I have three data sources, one brings in the HCP information, one the areas of work of all HCPs and one the team memberships.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
Admittedly I may not understand everything here, but it seems to me that the relationship in this example should be reversed. Place TeamKey and the WorkAreaKey in the dimHealthProfessionals -- this should simplify things.
With this in place, you simply make sure to deliver outriggers before the dimHealthProfessionals.
Treat outriggers as dimensions in their own right. You may want to treat dimHealthProfessionals as a type 2 dimension, to properly capture the history.
EDIT
Considering that team to person is many-to-many, a fact is more appropriate.
A column in a dimension table is appropriate only if a person can belong to only one team at a time. Same with work areas.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
I'm not sure I understand your question fully. If you are unsure about change detection, then use Checksums in the package. Build up a temp table with the data as it is in the source, then compare each row to its counterpart (joined via the business keys) by computing the checksum for both rows and comparing those. If they differ, the data has changed.
If you are talking about cascading updates in a historized dimension hierarchy (and you can treat the outriggers like a hierarchy in this context) then the foreign key lookups will automatically lookup the newer entry in DimHealthProfessionals if you have a historization (i.e. have validFrom / validThrough timestamps in DimHealthProfessionals). Those different foreign keys result in a different checksum.